Schedule: AI and Data technologies in the cloud sessions: Big data conference & machine learning training

9:00 - 17:00 Monday, 29 April & Tuesday, 30 April

Building a serverless big data application on AWS

Data Engineering and Architecture
Location: London Suite 3

Jorge Lopez (Amazon Web Services), Nikki Rouda (Amazon Web Services), Damon Cortesi (Amazon Web Services), Sven Hansen (Amazon Web Services), Manos Samatas (Amazon Web Services), Alket Memushaj (Amazon Web Services)

Average rating:

(3.50, 2 ratings)

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join in to learn how to incorporate serverless concepts into your big data architectures. You'll explore design patterns to ingest, store, and analyze your data as you build a big data application using AWS technologies such as S3, Athena, Kinesis, and more. Read more.

9:00–12:30 Tuesday, 30 April 2019

Cross-cloud model training and serving with Kubeflow

Data Science, Machine Learning & AI
Location: Capital Suite 15

Holden Karau (Independent), Trevor Grant (IBM), Francesca Lazzeri (Microsoft)

Average rating:

(4.43, 7 ratings)

Holden Karau, Francesca Lazzeri, and Trevor Grant offer an overview of Kubeflow and walk you through using it to train and serve models across different cloud environments (and on-premises). You'll use a script to do the initial setup work, so you can jump (almost) straight into training a model on one cloud and then look at how to set up serving in another cluster/cloud. Read more.

9:00–12:30 Tuesday, 30 April 2019

Using AWS serverless technologies to analyze large datasets

Data Science, Machine Learning & AI
Location: Capital Suite 4

Krishnan Saidapet (REAN Cloud, A Hitachi Vantara company)

Average rating:

(3.43, 7 ratings)

Krishnan Saidapet offers an overview of the latest big data and machine learning serverless technologies from Amazon Web Services (AWS) and leads a deep dive into using them to process and analyze two different datasets: the publicly available Bureau of Labor Statistics dataset and the Chest X-Ray Image Data dataset. Read more.

9:00–12:30 Tuesday, 30 April 2019

Serverless machine learning with TensorFlow: Part I

Data Science, Machine Learning & AI
Location: Capital Suite 2/3

Melinda King (ROI Training)

Average rating:

(3.00, 8 ratings)

Melinda King offers an introduction to designing and building machine learning models on Google Cloud Platform. Through a combination of presentations, demos, and hands-on labs, you’ll learn machine learning (ML) and TensorFlow concepts, and develop skills in developing, evaluating, and productionizing ML models. Read more.

9:00–12:30 Tuesday, 30 April 2019

Architecting a data platform for enterprise use

Data Engineering and Architecture
Location: S11 A

Mark Madsen (Teradata), Todd Walter (Archimedata)

Average rating:

(3.71, 7 ratings)

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that is not subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure. Read more.

13:30–17:00 Tuesday, 30 April 2019

Running multidisciplinary big data workloads in the cloud

Data Engineering and Architecture
Location: Capital Suite 4

Colm Moynihan (Cloudera), Jonathan Seidman (Cloudera), Michael Kohs (Cloudera)

Average rating:

(4.00, 2 ratings)

Moving to the cloud poses a number of challenges. Join Colm Moynihan, Jonathan Seidman, and Michael Kohs to explore cloud architecture and challenges and learn how to use Cloudera Altus to build data warehousing and data engineering clusters and run workloads that share metadata between them using Cloudera SDX. Read more.

13:30–17:00 Tuesday, 30 April 2019

Architecture and algorithms for end-to-end streaming data processing

Data Engineering and Architecture, Streaming and IoT
Location: S11 A

Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio), Ivan Kelly (Streamlio)

Average rating:

(3.00, 10 ratings)

Many industry segments have been grappling with fast data (high-volume, high-velocity data). Arun Kejariwal and Karthik Ramasamy walk you through the state-of-the-art systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage—for real-time data and algorithms to extract insights (e.g., heavy hitters and quantiles) from data streams. Read more.

13:30–17:00 Tuesday, 30 April 2019

Serverless machine learning with TensorFlow: Part II

Data Science, Machine Learning & AI
Location: Capital Suite 11

Melinda King (ROI Training)

Average rating:

(3.12, 8 ratings)

Melinda King offers an introduction to designing and building machine learning models on Google Cloud Platform. Through a combination of presentations, demos, and hands-on labs, you’ll learn machine learning (ML) and TensorFlow concepts and develop skills in developing, evaluating, and productionizing ML models. Read more.

13:30–17:00 Tuesday, 30 April 2019

Learning Presto: SQL on anything

Data Engineering and Architecture
Location: Capital Suite 15

Matt Fuller (Starburst)

Average rating:

(5.00, 2 ratings)

Used by Facebook, Netflix, Airbnb, LinkedIn, Twitter, Uber, and others, Presto has become the ubiquitous open source software for SQL on anything. Presto was built from the ground up for fast interactive SQL analytics against disparate data sources ranging in size from GBs to PBs. Join Matt Fuller to learn how to use Presto and explore use cases and best practices you can implement today. Read more.

13:30–17:00 Tuesday, 30 April 2019

Time series forecasting with Azure Machine Learning

Data Science, Machine Learning & AI
Location: Capital Suite 2/3

Francesca Lazzeri (Microsoft), Aashish Bhateja (Microsoft)

Average rating:

(4.25, 4 ratings)

Time series modeling and forecasting is fundamentally important to various practical domains; in the past few decades, machine learning model-based forecasting has become very popular in both private and public decision-making processes. Francesca Lazzeri walks you through using Azure Machine Learning to build and deploy your time series forecasting models. Read more.

11:15–11:55 Wednesday, 1 May 2019

Stream, stream, stream: Different streaming methods with Spark and Kafka

Data Engineering and Architecture, Expo Hall
Location: Expo Hall 2 (Capital Hall N24)

Itai Yaffe (Nielsen)

Average rating:

(4.45, 11 ratings)

NMC (Nielsen Marketing Cloud) provides customers (both marketers and publishers) with real-time analytics tools to profile their target audiences. To achieve that, the company needs to ingest billions of events per day into its big data stores in a scalable, cost-efficient way. Itai Yaffe explains how NMC continuously transforms its data infrastructure to support these goals. Read more.

11:15–11:55 Wednesday, 1 May 2019

The Presto Cost-Based Optimizer for interactive SQL on anything

Data Engineering and Architecture
Location: S11 A

Wojciech Biela (Starburst), Piotr Findeisen (Starburst)

Average rating:

(3.12, 8 ratings)

Presto is a popular open source–distributed SQL engine for interactive queries over heterogeneous data sources (Hadoop/HDFS, Amazon S3, Azure ADSL, RDBMS, NoSQL, etc). Wojciech Biela and Piotr Findeisen offer an overview of the Cost-Based Optimizer (CBO) for Presto, which brings a great performance boost. Join in to learn about CBO internals, the motivating use cases, and observed improvements. Read more.

11:15–11:55 Wednesday, 1 May 2019

Protecting sensitive data in huge datasets: Cloud tools you can use

Data Engineering and Architecture
Location: S11 B

Felipe Hoffa (Google)

Average rating:

(3.50, 4 ratings)

Before releasing a public dataset, practitioners need to thread the needle between utility and protection of individuals. Felipe Hoffa explores how to handle massive public datasets, taking you from theory to real life as he showcases newly available tools that help with PII detection and bring concepts like k-anonymity and l-diversity to the practical realm. Read more.

11:15–11:55 Wednesday, 1 May 2019

Executive Briefing: From the edge to AI—Taking control of your data for fun and profit

Executive Briefing and best practices, Strata Business Summit
Location: Capital Suite 13

Mick Hollison (Cloudera)

Average rating:

(3.33, 3 ratings)

Managing your data securely is difficult, as is choosing the right machine learning tools and managing models and applications in compliance with regulation and law. Mick Hollison covers the risks and the issues that matter most and explains how to address them with an enterprise data cloud and by embracing your data center and the public cloud in combination. Read more.

11:15–11:55 Wednesday, 1 May 2019

Serverless for data and AI

Data Engineering and Architecture
Location: Capital Suite 10/11

Avner Braverman (Binaris)

Average rating:

(2.71, 7 ratings)

What is serverless, and how can it be utilized for data analysis and AI? Avner Braverman outlines the benefits and limitations of serverless with respect to data transformation (ETL), AI inference and training, and real-time streaming. This is a technical talk, so expect demos and code. Read more.

12:05–12:45 Wednesday, 1 May 2019

Running SQL-based workloads in the cloud at 20x–200x lower cost using Apache Arrow

Data Engineering and Architecture
Location: S11 A

Jacques Nadeau (Dremio)

Average rating:

(4.75, 4 ratings)

Performance and cost are two important considerations in determining optimized solutions for SQL workloads in the cloud. Jacques Nadeau explains how to accelerate TPC workloads, invisible to client apps, and how to use Apache Arrow, Parquet, and Calcite to provide a scalable, high-performance solution optimized for cloud deployments while significantly reducing operational costs. Read more.

14:05–14:45 Wednesday, 1 May 2019

Nielsen presents: Fun with Kafka, Spark, and offset management

Data Engineering and Architecture, Expo Hall
Location: Expo Hall 2 (Capital Hall N24)

Simona Meriam (Nielsen)

Average rating:

(4.57, 7 ratings)

Simona Meriam explains how Nielsen Marketing Cloud (NMC) used to manage its Kafka consumer offsets against Spark-Kafka 0.8 consumer and why the company decided to upgrade from Spark-Kafka 0.8 to 0.10 consumer. Simona reviews the problems encountered during the upgrade and details the process that led to the solution. Read more.

14:55–15:35 Wednesday, 1 May 2019

Model serving via Pulsar functions

Data Engineering and Architecture
Location: S11 B

Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio)

Average rating:

(3.00, 1 rating)

Arun Kejariwal and Karthik Ramasamy walk you through an architecture in which models are served in real time and the models are updated, using Apache Pulsar, without restarting the application at hand. They then describe how to apply Pulsar functions to support two example use—sampling and filtering—and explore a concrete case study of the same. Read more.

14:55–15:35 Wednesday, 1 May 2019

Processing 10M samples a second to drive smart maintenance in complex IIoT systems

Data Engineering and Architecture, Expo Hall, Streaming and IoT
Location: Expo Hall 2 (Capital Hall N24)

Geir Engdahl (Cognite), Daniel Bergqvist (Google)

Average rating:

(4.00, 2 ratings)

Geir Engdahl and Daniel Bergqvist explain how Cognite is developing IIoT smart maintenance systems that can process 10M samples a second from thousands of sensors. You'll explore an architecture designed for high performance, robust streaming sensor data ingest, and cost-effective storage of large volumes of time series data as well as best practices learned along the way. Read more.

14:55–15:35 Wednesday, 1 May 2019

Improving Spark downscaling; Or, Not throwing away all of our work

Data Engineering and Architecture
Location: S11 A

Holden Karau (Independent), Mikayla Konst (Google), Ben Sidhom (Google)

Average rating:

(3.75, 4 ratings)

As more workloads move to severless-like environments, the importance of properly handling downscaling increases. Holden Karau, Mikayla Konst, and Ben Sidhom explore approaches for improving the scale-down experience on open source cluster managers—everything from how to schedule jobs to the location of blocks and their impact. Read more.

16:35–17:15 Wednesday, 1 May 2019

Deploying your real-time apps on thousands of servers and still being able to breathe

Data Engineering and Architecture, Expo Hall
Location: Expo Hall 2 (Capital Hall N24)

Constantin Muraru (Adobe), Dan Popescu (Adobe)

Average rating:

(5.00, 2 ratings)

With the current crop of cloud providers, obtaining servers to run your real-time application has never been easier. But what happens, though, when you wish to deploy your (web) applications frequently, on hundreds or even thousands of servers, in a fast, reliable way, with minimal human intervention? Constantin Muraru and Dan Popescu tell you how to tackle this challenge. Read more.

16:35–17:15 Wednesday, 1 May 2019

Scalability-aware autoscaling of a Spark application

Data Engineering and Architecture
Location: S11 A

Anirudha Beria (Qubole), Rohit Karlupia (Qubole)

Average rating:

(3.67, 3 ratings)

Autoscaling of resources aims to achieve low latency for a big data application while reducing resource costs at the same time. Scalability-aware autoscaling uses historical information to make better scaling decisions. Anirudha Beria and Rohit Karlupia explain how to measure the efficiency of autoscaling policies and discuss more efficient autoscaling policies, in terms of latency and costs. Read more.

17:25–18:05 Wednesday, 1 May 2019

Information architecture for an enterprise data cloud

Data Engineering and Architecture
Location: S11 B

Mark Samson (Cloudera), Phillip Radley (BT)

Average rating:

(5.00, 2 ratings)

It's now possible to build a modern data platform capable of storing, processing, and analyzing a wide variety of data across multiple public and private cloud platforms and on-premises data centers. Mark Samson and Phillip Radley outline an information architecture for such a platform, informed by working with multiple large organizations that have built such platforms over the last five years. Read more.

11:15–11:55 Thursday, 2 May 2019

Big data analytics in the public cloud: Challenges and opportunities

Data Engineering and Architecture
Location: S11 B

Jian Zhang (Intel), Chendi Xue (Intel), Yuan Zhou (Intel)

Average rating:

(4.50, 2 ratings)

Jian Zhang, Chendi Xue, and Yuan Zhou explore the challenges of migrating big data analytics workloads to the public cloud (e.g., performance lost and missing features) and demonstrate how to use a new in-memory data accelerator leveraging persistent memory and RDMA NICs to resolve this issues and enable new opportunities for big data workloads on the cloud. Read more.

11:15–11:55 Thursday, 2 May 2019

Transforming a financial services data infrastructure for the modern era by building a PCI DSS-compliant data platform from the ground up on AWS

Data Engineering and Architecture
Location: Capital Suite 10/11

Eoin O'Flanagan (NewDay), Darragh McConville (Kainos)

Average rating:

(4.86, 7 ratings)

Eoin O'Flanagan and Darragh McConville explain how NewDay built a high-performance contemporary data processing platform from the ground up on AWS. Join in to explore the company's journey from a traditional legacy onsite data estate to an entirely cloud-based PCI DSS-compliant platform. Read more.

12:05–12:45 Thursday, 2 May 2019

Unleashing Apache Kafka and TensorFlow in hybrid architectures

Data Engineering and Architecture, Expo Hall
Location: Expo Hall 2 (Capital Hall N24)

Kai Wähner (Confluent)

Average rating:

(4.75, 8 ratings)

How do you leverage the flexibility and extreme scale of the public cloud and the Apache Kafka ecosystem to build scalable, mission-critical machine learning infrastructures that span multiple public clouds—or bridge your on-premises data center to the cloud? Join Kai Wähner to learn how to use technologies such as TensorFlow with Kafka’s open source ecosystem for machine learning infrastructures. Read more.

12:05–12:45 Thursday, 2 May 2019

Schema on read and the new logging way

Data Engineering and Architecture
Location: S11 A

David Josephsen (Sparkpost)

Average rating:

(3.50, 2 ratings)

David Josephsen tells the story of how Sparkpost's reliability engineering team abandoned ELK for a DIY schema-on-read logging infrastructure. Join in to learn the architectural details, trials, and tribulations from the company's Internal Event Hose data ingestion pipeline project, which uses Fluentd, Kinesis, Parquet, and AWS Athena to make logging sane. Read more.

12:05–12:45 Thursday, 2 May 2019

Herding elephants: Seamless data access in a multicluster clouds

Data Engineering and Architecture
Location: S11 B

Pradeep Bhadani (Hotels.com), Elliot West (Hotels.com)

Average rating:

(4.17, 6 ratings)

Travel platform Expedia Group likes to give its data teams flexibility and autonomy to work with different technologies. However, this approach generates challenges that cannot be solved by existing tools. Pradeep Bhadani and Elliot West explain how the company built a unified virtual data lake on top of its many heterogeneous and distributed data platforms. Read more.

14:05–14:45 Thursday, 2 May 2019

Unlocking insights in AI by building a feature store

Data Engineering and Architecture
Location: Capital Suite 8/9

Willem Pienaar (GOJEK), Zhi Ling Chen (GOJEK)

Average rating:

(4.80, 5 ratings)

Features are key to driving impact with AI at all scales, allowing organizations to dramatically accelerate innovation and time to market. Willem Pienaar and Zhiling Chen explain how GOJEK, Indonesia's first billion-dollar startup, unlocked insights in AI by building a feature store called Feast, and the lessons they learned along the way. Read more.

14:05–14:45 Thursday, 2 May 2019

Autoscaling Spark on Kubernetes

Data Engineering and Architecture, Expo Hall
Location: Expo Hall 2 (Capital Hall N24)

Holden Karau (Independent), Kris Nova (Independent)

Average rating:

(4.86, 7 ratings)

In the Kubernetes world, where declarative resources are a first-class citizen, running complicated workloads across distributed infrastructure is easy, and processing big data workloads using Spark is common practice, we can finally look at constructing a hybrid system of running Spark in a distributed cloud native way. Join respective experts Kris Nova and Holden Karau for a fun adventure. Read more.

14:55–15:35 Thursday, 2 May 2019

Architecting a data platform to support analytic workflows for scientific data

Data Engineering and Architecture
Location: Capital Suite 8/9

Jane McConnell (Teradata), Sun Maria Lehmann (Equinor)

Average rating:

(3.67, 3 ratings)

In upstream oil and gas, a vast amount of the data requested for analytics projects is scientific data: physical measurements about the real world. Historically, this data has been managed library style, but a new system was needed to best provide this data. Sun Maria Lehmann and Jane McConnell share architectural best practices learned from their work with subsurface data. Read more.

14:55–15:35 Thursday, 2 May 2019

Executive Briefing: AWS technology trends—Data lakes and analytics

Executive Briefing and best practices, Strata Business Summit
Location: Capital Suite 13

Nikki Rouda (Amazon Web Services)

Average rating:

(4.14, 7 ratings)

Nikki Rouda shares key trends in data lakes and analytics and explains how they shape the services offered by AWS. Specific topics include the rise of machine-generated data and semistructured and unstructured data as dominant sources of new data, the move toward serverless, SPI-centric computing, and the growing need for local access to data from users around the world. Read more.

14:55–15:35 Thursday, 2 May 2019

The future of cloud native data warehousing: Emerging trends and technologies

Data Engineering and Architecture
Location: S11 A

Greg Rahn (Cloudera)

Average rating:

(3.00, 7 ratings)

Data warehouses have traditionally run in the data center, and in recent years, they've been adapted to be more cloud native. Greg Rahn discusses a number of emerging trends and technologies that will impact how data warehouses are run both in the cloud and on-premises and explains what that means for architects, administrators, and end users. Read more.

16:35–17:15 Thursday, 2 May 2019

From legacy to cloud: An end-to-end data integration journey

Data Engineering and Architecture
Location: Capital Suite 8/9

Max Schultze (Zalando SE)

Average rating:

(4.83, 12 ratings)

Max Schultze details Zalondo's end-to-end data integration platform to serve analytical use cases and machine learning throughout the company, covering raw data collection, standardized data preparation (binary conversion, partitioning, etc.), user-driven analytics, and machine learning. Read more.

16:35–17:15 Thursday, 2 May 2019

Deep learning with TensorFlow and Spark using GPUs and Docker containers

Data Engineering and Architecture
Location: S11 A

Thomas Phelan (HPE BlueData)

Average rating:

(3.29, 7 ratings)

Organizations need to keep ahead of their competition by using the latest AI, ML, and DL technologies such as Spark, TensorFlow, and H2O. The challenge is in how to deploy these tools and keep them running in a consistent manner while maximizing the use of scarce hardware resources, such as GPUs. Thomas Phelan discusses the effective deployment of such applications in a container environment. Read more.

Schedule: AI and Data technologies in the cloud sessions

Sponsorship Opportunities

Partner Opportunities

Contact Us