Presented By
O’Reilly + Cloudera
Make Data Work
29 April–2 May 2019
London, UK

Schedule: AI and Data technologies in the cloud sessions

Add to your personal schedule
9:00 - 17:00 Monday, 29 April & Tuesday, 30 April
Data Engineering and Architecture
Location: London Suite 3
Jorge Lopez (Amazon Web Services)
Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. In this workshop, we show you how to incorporate serverless concepts into your big data architectures, looking at design patterns to ingest, store, and analyze your data. You will build a big data application using AWS technologies such as S3, Athena, Kinesis, and more Read more.
Add to your personal schedule
9:0012:30 Tuesday, 30 April 2019
Data Science, Machine Learning & AI
Location: Capital Suite 2/3
Holden Karau (Google), Trevor Grant (IBM), Ilan Filonenko (Bloomberg LP), Francesca Lazzeri (Microsoft)
This workshop will quickly introduce what Kubeflow is, and how we can use it to train and serve models across different cloud environments (and on-prem). We’ll have a script to do the initial set up work ready so you can jump (almost) straight into training a model on one cloud, and then look at how to set up serving in another cluster/cloud. We will start with a simple model w/follow up links. Read more.
Add to your personal schedule
9:0012:30 Tuesday, 30 April 2019
Data Science, Machine Learning & AI
Location: Capital Suite 15
S.P.T. Krishnan (REAN Cloud (A Hitachi Vantara company))
Provides an overview of the latest Big Data and Machine Learning serverless technologies from AWS, and a deep dive into using them to process and analyze two different datasets. The first dataset is publicly available Bureau of Labor Statistics, and the second is Chest X-Ray Image Data. Read more.
Add to your personal schedule
9:0012:30 Tuesday, 30 April 2019
Data Science, Machine Learning & AI
Location: Capital Suite 4
Amy Unruh (Google)
This tutorial provides an introduction to designing and building machine learning models on Google Cloud Platform. Through a combination of presentations, demos, and hand-ons labs, you’ll learn machine learning (ML) and TensorFlow concepts, and develop skills in developing, evaluating, and productionizing ML models. Read more.
Add to your personal schedule
9:0012:30 Tuesday, 30 April 2019
Data Engineering and Architecture
Location: Capital Suite 9
Mark Madsen (Think Big Analytics), Todd Walter (Teradata)
Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that is not subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure. Read more.
Add to your personal schedule
13:3017:00 Tuesday, 30 April 2019
Data Engineering and Architecture
Location: Capital Suite 9
Jason Wang (Cloudera), Tony Wu (Cloudera), Vinithra Varadharajan (Cloudera)
Moving to the cloud poses challenges from re-architecting to be cloud-native, to data context consistency across workloads that span multiple clusters on-prem and in the cloud. First, we’ll cover in depth cloud architecture and challenges; second, you’ll use Cloudera Altus to build data warehousing and data engineering clusters and run workloads that share metadata between them using Cloudera SDX. Read more.
Add to your personal schedule
13:3017:00 Tuesday, 30 April 2019
Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio)
Many industry segments have been grappling with fast data (high-volume, high-velocity data). In this tutorial we shall lead the audience through a journey of the landscape of state-of-the-art systems for each stage of an end-to-end data processing pipeline - messaging, compute and storage - for real-time data and algorithms to extract insights - e.g., heavy-hitters, quantiles - from data streams. Read more.
Add to your personal schedule
13:3017:00 Tuesday, 30 April 2019
Data Science, Machine Learning & AI
Location: Capital Suite 4
Amy Unruh (Google)
This tutorial provides an introduction to designing and building machine learning models on Google Cloud Platform. Through a combination of presentations, demos, and hand-ons labs, you’ll learn machine learning (ML) and TensorFlow concepts and develop skills in developing, evaluating, and productionizing ML models. Read more.
Add to your personal schedule
13:3017:00 Tuesday, 30 April 2019
Data Engineering and Architecture
Location: Capital Suite 10
Matt Fuller (Starburst)
Used by Facebook, Netflix, Airbnb, LinkedIn, Twitter, Uber, and others, Presto has become the ubiquitous open source software for SQL-on-Anything. Presto was built from the ground up for fast interactive SQL analytics against disparate data sources ranging in size from Gigabytes to Petabytes. In this tutorial, attendees will learn Presto usages, best practices, and optional hands on exercises. Read more.
Add to your personal schedule
13:3017:00 Tuesday, 30 April 2019
Data Science, Machine Learning & AI
Location: Capital Suite 15
Francesca Lazzeri (Microsoft), Aashish Bhateja (Microsoft)
Time series modeling and forecasting has fundamental importance to various practical domains and, during the past few decades, machine learning model-based forecasting has become very popular in the private and the public decision-making process. In this tutorial, we will walk you through the core steps for using Azure Machine Learning to build and deploy your time series forecasting models. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 1 May 2019
Data Engineering and Architecture
Location: Capital Suite 7
Itai Yaffe (Nielsen)
At Nielsen Marketing Cloud, we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences. To achieve that, we need to ingest billions of events per day into our big data stores and we need to do it in a scalable yet cost-efficient manner. In this talk, we will discuss how we continuously transform our data infrastructure to support these goals. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 1 May 2019
Wojciech Biela (Starburst), Piotr Findeisen (Starburst)
Presto is a popular open source distributed SQL engine for interactive queries over heterogeneous data sources (Hadoop/HDFS, Amazon S3/Azure ADSL, RDBMS, no-SQL, etc). Recently Starburst has contributed the Cost-Based Optimizer for Presto which brings a great performance boost for Presto. Learn about this CBO’s internals, the motivating use cases and observed improvements. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 1 May 2019
Felipe Hoffa (Google)
Before releasing a public dataset, practitioners need to thread the needle between utility and protection of individuals. We will explore massive public datasets, taking you from theory to real life showcasing newly available tools that help with PII detection and brings concepts like k-anonymity and l-diversity to the practical realm (with options such as removing, masking, and coarsening). Read more.
Add to your personal schedule
11:1511:55 Wednesday, 1 May 2019
Mike Olson (Cloudera)
It's easier than ever to collect data -- but managing it securely, in compliance with regulations and legal constraints is harder. There are plenty of tools that promise to bring machine learning techniques to your data -- but choosing the right tools, and managing models and applications in compliance with regulation and law is quite difficult. Read more.
Add to your personal schedule
12:0512:45 Wednesday, 1 May 2019
Jacques Nadeau (Dremio)
Performance and cost are two important considerations in determining optimized solutions for SQL workloads in the cloud. We look at TPC workloads and how they can be accelerated, invisible to client apps. We explore how Apache Arrow, Parquet, and Calcite can be used to provide a scalable, high-performance solution optimized for cloud deployments, while significantly reducing operational costs. Read more.
Add to your personal schedule
14:0514:45 Wednesday, 1 May 2019
Data Engineering and Architecture
Location: Capital Suite 7
Simona Meriam (Nielsen)
Ingesting billions of events per day into our big data stores we need to do it in a scalable, cost-efficient and consistent way. When working with Spark and Kafka the way you manage your consumer offsets has a major implication on data consistency. We will go in depths of the solution we ended up implementing and discuss the working process, the dos and don'ts that led us to its final design. Read more.
Add to your personal schedule
14:5515:35 Wednesday, 1 May 2019
Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio)
In this talk, we shall walk the audience through an architecture whereby models are served in real-time and the models are updated, using Apache Pulsar, without restarting the application at hand. Further, we will describe how Pulsar functions can be applied to support two example use cases, viz., sampling and filtering. We shall lead the audience through a concrete case study of the same. Read more.
Add to your personal schedule
14:5515:35 Wednesday, 1 May 2019
Geir Endahl (Cognite), Daniel Bergqvist (Google)
Learn how Cognite is developing IIoT smart maintenance systems that can process 10M samples/second from thousands of sensors. We’ll review an architecture designed for high performance, robust streaming sensor data ingest and cost-effective storage of large volumes of time series data, best practices for aggregation and fast queries, and achieving high-performance with machine learning. Read more.
Add to your personal schedule
14:5515:35 Wednesday, 1 May 2019
Holden Karau (Google), Mikayla Konst (Google), Ben Sidhom (Google)
As more workloads move to “severless” like environments, the importance of properly handling downscaling increases. Read more.
Add to your personal schedule
16:3517:15 Wednesday, 1 May 2019
Data Engineering and Architecture
Location: Capital Suite 7
Constantin Muraru (Adobe), Dan Popescu (Adobe)
Obtaining servers to run your realtime application has never been easier. Cloud providers have removed the cumbersome process of provisioning new hardware, to suite your needs. What happens though when you wish to deploy your (web) applications frequently, on hundreds or even thousands of servers in a fast and reliable way with minimal human intervention? This session addresses this precise topic. Read more.
Add to your personal schedule
16:3517:15 Wednesday, 1 May 2019
Anirudha Beria (Qubole), Rohit Karlupia (Qubole)
Autoscaling of resources aims to achieve low latency for a big data application, while reducing resource costs at the same time. Scalability aware autoscaling aims to use historical information to make better scaling decisions. In this talk we will talk about (1) Measuring efficiency of autoscaling policies and (2) coming up with more efficient autoscaling policies, in terms of latency and costs. Read more.
Add to your personal schedule
17:2518:05 Wednesday, 1 May 2019
Mark Samson (Cloudera)
It is now possible to build a modern data platform capable of storing, processing and analysing a wide variety of data across multiple public and private Cloud platforms and on-premise data centres. This session will outline an information architecture for such a platform, informed by working with multiple large organisations who have built such platforms over the last 5 years. Read more.
Add to your personal schedule
11:1511:55 Thursday, 2 May 2019
Jian Zhang (Intel), Chendi Xue (Intel), Yuan Zhou (Intel)
Introduce the challenges of migrating bigdata analytics workloads to public cloud - like performance lost, and missing features. Show case how to the new in memory data accelerator leveraging persistent memory and RDMA NICs can resolve this issues and enables new opportunities for bigdata workloads on the cloud. Read more.
Add to your personal schedule
11:1511:55 Thursday, 2 May 2019
Data Engineering and Architecture
Location: Capital Suite 10/11
Eoin O'Flanagan (Newday), Darragh McConville (Kainos)
In this session you will learn how we have built a high-performance contemporary data processing platform, from the ground up, on AWS. We will discuss our journey from legacy, onsite, traditional data estate to an entirely cloud-based, PCI DSS-compliant platform. Read more.
Add to your personal schedule
12:0512:45 Thursday, 2 May 2019
Data Engineering and Architecture
Location: Capital Suite 7
Kai Wähner (Confluent)
How can you leverage the flexibility and extreme scale in public cloud combined with Apache Kafka ecosystem to build scalable, mission-critical machine learning infrastructures, which span multiple public clouds or bridge your on-premise data centre to cloud? Join this talk to learn how to apply technologies such as TensorFlow with Kafka’s open source ecosystem for machine learning infrastructures Read more.
Add to your personal schedule
12:0512:45 Thursday, 2 May 2019
David Josephsen (Sparkpost)
This is the story of how Sparkpost Reliability Engineering abandoned ELK for a DIY Schema-On-Read logging infrastructure. We share architectural details and tribulations from our _Internal Event Hose_ data ingestion pipeline project, which uses Fluentd, Kinesis, Parquet and AWS Athena to make logging sane. Read more.
Add to your personal schedule
14:0514:45 Thursday, 2 May 2019
Data Engineering and Architecture
Location: Capital Suite 8/9
Willem Pienaar (GO-JEK), Zhi Ling Chen (GO-JEK)
Features are key to driving impact with AI at all scales. By democratizing the creation, discovery, and access of features through a unified platform, organizations are able to dramatically accelerate innovation and time to market. Find out how GO-JEK, Indonesia's first billion-dollar startup, built a feature platform to unlock insights in AI, and the lessons they learned along the way. Read more.
Add to your personal schedule
14:0514:45 Thursday, 2 May 2019
Data Engineering and Architecture
Location: Capital Suite 7
Holden Karau (Google), Kris Nova (VMware)
In the Kubernetes world where declarative resources are a first class citizen, running complicated workloads across distributed infrastructure is easy, and processing big data workloads using Spark is common practice -- we can finally look at constructing a hybrid system of running Spark in a distributed cloud native way. Join respective experts Kris Nova & Holden Karau for a fun adventure. Read more.
Add to your personal schedule
14:5515:35 Thursday, 2 May 2019
Data Engineering and Architecture
Location: Capital Suite 8/9
Jane McConnell (Teradata), Sun Maria Lehmann (Equinor)
In Upstream Oil and Gas, a vast amount of the data requested for analytics projects is “scientific data” - physical measurements about the real world. Historically this data has been managed “library-style” in files - but to provide this data to analytics projects, we need to do something different. Sun and Jane discuss architectural best practices learned from their work with subsurface data. Read more.
Add to your personal schedule
14:5515:35 Thursday, 2 May 2019
Nikki Rouda (Amazon Web Services (AWS))
This talk is about some of the key trends we see in data lakes and analytics, and how they shape the services we offer at AWS. Specific topics include the rise of machine generated data and semi-structured/unstructured data as dominant sources of new data, the move towards serverless, SPI-centric computing, and the growing need for local access to data from users around the world. Read more.
Add to your personal schedule
14:5515:35 Thursday, 2 May 2019
Greg Rahn (Cloudera)
Data warehouses have traditionally run in the data center and in recent years they have adapted to be more cloud-native. In this talk, we'll discuss a number of emerging trends and technologies that will impact how data warehouses are run both in the cloud and on-prem and share our vision on what that means for architects, administrators, and end users. Read more.
Add to your personal schedule
16:3517:15 Thursday, 2 May 2019
Data Engineering and Architecture
Location: Capital Suite 7
Max Schultze (Zalando SE)
Data Lake implementation at a large scale company, raw data collection, standardized data preparation (e.g. binary conversion, partitioning), user driven analytics and machine learning. Read more.
Add to your personal schedule
16:3517:15 Thursday, 2 May 2019
Nanda Vijaydev (BlueData), Thomas Phelan (BlueData)
Organizations need to keep ahead of their competition by using the latest AI/ML/DL technologies such as Spark, TensorFlow, and H2O. The challenge is in how to deploy these tools and keep them running in a consistent manner while maximizing the use of scarce hardware resources, such as GPUs. This session will discuss the effective deployment of such applications in a container environment. Read more.