Presented By
O’Reilly + Cloudera
Make Data Work
March 25-28, 2019
San Francisco, CA

Data Engineering & Architecture

March 25-28, 2019
San Francisco, CA

Learn to build an analytics infrastructure that unlocks the value of your data

The right data infrastructure and architecture streamline your workflows, reduce costs, and scale your data analysis. The wrong architecture costs time and money that may never be recovered.

Selecting and building the tools and architecture you need is complex. You need to consider scalability, tools, and frameworks (open source or proprietary), platforms (on-premise, cloud, or hybrid), integration, adoption and migration, and security. That’s why good data engineers are in such demand.

These sessions will help you navigate the pitfalls, select the right tools and technologies, and design a robust data pipeline.

Monday, Mar 25 - Tuesday, Mar 26: 2-Day Training (Platinum & Training passes)
Tuesday Mar 26: Tutorials (Gold & Silver passes)
Wednesday Mar 27: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
8:45am | Location: Ballroom
Strata Data Conference Keynotes
10:30am
Morning break
Thursday Mar 28: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
8:45am | Location: Ballroom
Strata Data Conference Keynotes
10:30am
Morning break
Add to your personal schedule
9:00am - 5:00pm Monday, March 25 & Tuesday, March 26
Location: 2018
Secondary topics:  AI and Data technologies in the cloud
Jorge A. Lopez (Amazon Web Services)
Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. In this workshop, we show you how to incorporate serverless concepts into your big data architectures, looking at design patterns to ingest, store, and analyze your data. You will build a big data application using AWS technologies such as S3, Athena, Kinesis, and more Read more.
Add to your personal schedule
9:00am - 5:00pm Monday, March 25 & Tuesday, March 26
Location: 3016
Secondary topics:  Streaming and realtime analytics
Jesse Anderson (Big Data Institute)
Takes a participant through an in-depth look at Apache Kafka. We show how Kafka works and how to create real-time systems with it. It shows how to create consumers and publishers in Kafka. The we look at Kafka’s ecosystem and how each one is used. We show how to use Kafka Streams, Kafka Connect, and KSQL. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, March 26, 2019
Location: 2004
Secondary topics:  Streaming and realtime analytics
Jeff Bean (data Artisans)
This hands-on session introduces Flink via the SQL interface. You will receive an overview of stream processing, and a survey of Apache Flink with its various modes of use. Then we’ll use Flink to run SQL queries on data streams and contrast this with the Flink data stream API. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, March 26, 2019
Location: 2005
Mark Madsen (Think Big Analytics), Todd Walter (Teradata)
Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that is not subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, March 26, 2019
Location: 2006
Secondary topics:  AI and machine learning in the enterprise
Jonathan Seidman (Cloudera), Ted Malaska (Capital One)
The enterprise data management space has changed dramatically in recent years, and this had led to new challenges for organizations in creating successful data practices. In this presentation we’ll provide guidance and best practices from planning to implementation based on years of experience working with companies to deliver successful data projects. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, March 26, 2019
Location: 2007
Secondary topics:  Data Integration and Data Pipelines, Data preparation, data governance, and data lineage, Model lifecycle management
Boris Lublinsky (Lightbend), Dean Wampler (Lightbend)
This hands-on tutorial examines production use of ML in streaming data pipelines; how to do periodic model retraining and low-latency scoring in live streams. We'll discuss Kafka as the data backplane, pros and cons of microservices vs. systems like Spark and Flink, tips for Tensorflow and SparkML, performance considerations, model metadata tracking, and other techniques. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, March 26, 2019
Location: 2008
Secondary topics:  Data preparation, data governance, and data lineage
Santosh Kumar (Cloudera)
Cloudera SDX provides unified metadata control, simplifies administration, and maintains context as well as data lineage across storage services, workloads, and operating environments. In this 3h tutorial, we cover the background to SDX, before diving deep into the moving parts and also get hands on in setting it up. You'll leave with all the skills and experience you need to setup your own SDX. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, March 26, 2019
Location: 2004
Secondary topics:  Streaming and realtime analytics
Matt Fuller (Starburst)
Used by Facebook, Netflix, Airbnb, LinkedIn, Twitter, Uber, and others, Presto has become the ubiquitous open source software for SQL-on-Anything. Presto was built from the ground up for fast interactive SQL analytics against disparate data sources ranging in size from Gigabytes to Petabytes. In this tutorial, attendees will learn Presto usages, best practices, and optional hands on exercises. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, March 26, 2019
Location: 2005
Secondary topics:  AI and Data technologies in the cloud, Data Integration and Data Pipelines, Streaming and realtime analytics
Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio)
Many industry segments have been grappling with fast data (high-volume, high-velocity data). In this tutorial we shall lead the audience through a journey of the landscape of state-of-the-art systems for each stage of an end-to-end data processing pipeline - messaging, compute and storage - for real-time data and algorithms to extract insights - e.g., heavy-hitters, quantiles - from data streams. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, March 26, 2019
Location: 2006
Secondary topics:  AI and machine learning in the enterprise
Sourav Dey (Manifold), Alex Ng (Manifold)
Many teams are still run as if data science is mainly about experimentation, but those days are over. Now it must be turnkey to take models into production. Sourav Day and Alex Ng explain how to streamline a machine learning project and help your engineers work as an an integrated part of your production teams, using a Lean AI process and the Orbyter package for Docker-first data science. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, March 26, 2019
Location: 2007
Secondary topics:  AI and Data technologies in the cloud, Model lifecycle management
Holden Karau (Google), Francesca Lazzeri (Microsoft), Trevor Grant (IBM), Ilan Filonenko (Bloomberg LP)
This workshop will quickly introduce what Kubeflow is, and how we can use it to train and serve models across different cloud environments (and on-prem). We’ll have a script to do the initial set up work ready so you can jump (almost) straight into training a model on one cloud, and then look at how to set up serving in another cluster/cloud. We will start with a simple model w/follow up links. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, March 26, 2019
Location: 2008
Secondary topics:  AI and Data technologies in the cloud
Jason Wang (Cloudera), Tony Wu (Cloudera), Vinithra Varadharajan (Cloudera)
Moving to the cloud poses challenges from re-architecting to be cloud-native, to data context consistency across workloads that span multiple clusters on-prem and in the cloud. First, we’ll cover in depth cloud architecture and challenges; second, you’ll use Cloudera Altus to build data warehousing and data engineering clusters and run workloads that share metadata between them using Cloudera SDX. Read more.
Add to your personal schedule
11:00am11:40am Wednesday, March 27, 2019
Location: 2001
Secondary topics:  Data Integration and Data Pipelines, Data preparation, data governance, and data lineage, Media, Marketing, Advertising
Jitender Aswani (Netflix), Di Lin (Netflix)
Hundreds of thousands of ETL pipelines ingest over a trillion events daily to populate millions of data tables downstream at Netflix. This session discusses Netflix’s internal data lineage service aimed at establishing end-to-end lineage across millions of data artifacts that was essential for enhancing platform’s reliability, increasing trust in data and improving data infrastructure efficiency. Read more.
Add to your personal schedule
11:00am11:40am Wednesday, March 27, 2019
Location: 2002
Secondary topics:  AI and machine learning in the enterprise, Automation in data science and big data, Data Platforms, Retail and e-commerce, Temporal data and time-series analytics
Jian Chang (Alibaba Group), Sanjian Chen (Alibaba Group)
We focus on sharing the design of the AI Engine on Alibaba TSDB service that enables fast and complex analytics of large-scale retail data. A successful case study of the Fresh Hema Supermarket, a major “New Retail” platform operated by Alibaba Group. We will highlight our solutions to the major technical challenges in data cleaning, storage and processing. Read more.
Add to your personal schedule
11:00am11:40am Wednesday, March 27, 2019
Location: 2004
Secondary topics:  AI and Data technologies in the cloud
Shubham Tagra (Qubole)
Running Presto in AWS at 1/10th the cost with AWS Spot nodes can be achieved with few architectural enhancements to Presto. This talk will explain the gaps in Presto architecture to use spot nodes and cover these enhancements and showcase the improvements in terms of reliability and TCO achieved through them. Read more.
Add to your personal schedule
11:00am11:40am Wednesday, March 27, 2019
Location: 2006
Secondary topics:  Data Integration and Data Pipelines, Streaming and realtime analytics
Osman Sarood (Mist Systems), Chunky Gupta (Mist Systems)
Live Aggregators(LA) is a highly reliable and scalable in-house real time aggregation system that can autoscale for sudden changes in load. LA consumes billions of kafka messages and does over 1.5 billion writes to Cassandra per day. It is 80% cheaper than competing streaming solutions due to running over AWS spot instances and having 70% CPU utilization. Read more.
Add to your personal schedule
11:00am11:40am Wednesday, March 27, 2019
Location: 2008
Secondary topics:  AI and Data technologies in the cloud, Automation in data science and big data, Model lifecycle management
Diego Oppenheimer (Algorithmia)
You've invested heavily in cleaning your data, feature engineering, training and tuning your model—but now you have to deploy your model into production and you discover it's a huge challenge. In this talk, you'll learn common architectural patterns and best practices of the most advanced organizations who are deploying your model for scalability and accessibility. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 27, 2019
Location: 2001
Secondary topics:  Data preparation, data governance, and data lineage, Retail and e-commerce
Neelesh Salian (Stitch Fix)
This talk helps describe the Data lineage system we built at Stitch Fix and what has the journey been as we built it from the ground up. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 27, 2019
Location: 2002
Secondary topics:  Data Platforms, Media, Marketing, Advertising
Kurt Brown (Netflix)
The Netflix data platform is a massive-scale, cloud-only suite of tools and technologies. It includes big data techs (e.g. Spark and Flink), enabling services (e.g. federated metadata management), and machine learning support. But with power comes complexity. I'll talk through how we are investing towards an easier, "self-service" data platform without sacrificing our enabling capabilities. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 27, 2019
Location: 2004
Secondary topics:  Streaming and realtime analytics
Lars Volker (Cloudera), Michael Ho (Cloudera)
In recent years, Apache Impala has been deployed to clusters that are large enough to hit architectural limitations in the stack. Our talk will cover the efforts and results to address the scalability limitations in the now legacy Thrift RPC framework by using Apache Kudu's RPC which was built from the ground up to support asynchronous communication, multiplexed connections, TLS, and Kerberos. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 27, 2019
Location: 2006
Secondary topics:  Data Integration and Data Pipelines, Data Platforms, Health and Medicine, Streaming and realtime analytics
In a large Global Health Service company, streaming data for processing and sharing comes with its own challenges. Data science and analytics platforms need data fast, from relevant sources to act on this data quickly and share the insights with consumers with the same speed and urgency. Streaming data architectures are a necessity. Kafka and Hadoop are key. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 27, 2019
Location: 2008
Secondary topics:  AI and Data technologies in the cloud, Automation in data science and big data, Model lifecycle management
Tobias Knaup (Mesosphere), Jörg Schad (Mesosphere, Inc.)
There are many great tutorials for training your deep learning models using TensorFlow, Keras, Spark or one of the many other frameworks. But training is only a small part in the overall deep learning pipeline. This talk gives an overview into building a complete automated deep learning pipeline starting with exploratory analysis, over training, model storage, model serving, and monitoring. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 27, 2019
Location: 2001
Secondary topics:  Data Integration and Data Pipelines, Transportation and Logistics
James Taylor (Lyft)
This talk will provide details of an automated feedback loop at Lyft to adapt ETL based on the aggregate cost of queries run across the cluster. In addition, future work will be outlined to enhance the system through the use of materialized views to reduce the number of ad hoc joins and sorting performed by the most expensive queries by transparently rewriting queries when possible. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 27, 2019
Location: 2002
Secondary topics:  AI and Data technologies in the cloud, Data Integration and Data Pipelines
Yaron Haviv (iguazio)
Faced with the need to handle increasing volumes of data, alternative data sets ("alt data") and AI, many enterprises are working to design or redesign their big data architectures. While traditional batch platforms fail to generate sufficient ROI, Yaron Haviv suggests a Continuous Analytics approach yielding faster answers for the business while remaining simpler and less expensive for IT. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 27, 2019
Location: 2004
Secondary topics:  Data Platforms, Streaming and realtime analytics, Transportation and Logistics
Zhenxiao Luo (Uber)
From determining the most convenient rider pickup points to predicting the fastest routes, Uber uses data-driven analytics to create seamless trip experiences. Inside Uber, analysts would like to run Analytics on any data sources, in real time. This talk will share Uber’s engineering effort about real time Analytics on any data source on the fly, without any data copy. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 27, 2019
Location: 2006
Secondary topics:  Streaming and realtime analytics
Adem Efe Gencer (LinkedIn)
This talk will describe our work and experiences towards alleviating the management overhead of large-scale Kafka clusters using Cruise Control at LinkedIn. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 27, 2019
Location: 2008
Secondary topics:  Graph technologies and analytics
Denise Gosnell, PhD (DataStax)
The graph community has spent years defining and describing our passion - applying graph thinking to solve difficult problems. This talk will leverage years of experience from shipping large scale applications built on graph databases. We’ll discuss some practical and tangible decisions that come into play when designing and delivering distributed graph applications … or playing SimCity 2000. Read more.
Add to your personal schedule
4:20pm5:00pm Wednesday, March 27, 2019
Location: 2001
Secondary topics:  Data Integration and Data Pipelines, Transportation and Logistics
Alex Kira (Uber)
Uber operates at scale, with thousands of microservices serving millions of rides a day leading to more than a hundred petabytes of data. We will describe our journey towards a unified and scalable data workflow system at Uber used to manage this data. We will talk about the challenges we faced and how we have re-architected our system to make it highly available and horizontally scalable. Read more.
Add to your personal schedule
4:20pm5:00pm Wednesday, March 27, 2019
Location: 2002
Secondary topics:  AI and Data technologies in the cloud, Data Integration and Data Pipelines, Retail and e-commerce
Jowanza Joseph (OneClickRetail), Karthik Ramasamy (Streamlio)
After 2 years of running streaming pipelines through Kinesis and Spark at One Click Retail, we evaluated our solution and decided to explore a new platform that would (1) take advantage of Kubernetes and (2) support a simpler data processing DSL. We settled on Apache Pulsar because of its native support for Kubernetes and Pulsar Functions a serverless functions model on top of Pulsar. Read more.
Add to your personal schedule
4:20pm5:00pm Wednesday, March 27, 2019
Location: 2004
Secondary topics:  Data Integration and Data Pipelines, Streaming and realtime analytics
Julien Le Dem (WeWork)
Big data infrastructure has evolved from flat files in a distributed filesystem to an efficient ecosystem to a fully deconstructed and open source database with reusable components. Julien Le Dem discusses the key open source components of the big data ecosystem and explains how they relate to each other and how they make the ecosystem more of a database and less of a filesystem. Read more.
Add to your personal schedule
4:20pm5:00pm Wednesday, March 27, 2019
Location: 2006
Secondary topics:  Streaming and realtime analytics
Sean Glover (Lightbend)
Introducing Strimzi, a Kafka project for Kubernetes. The best way to run stateful services with complex operational needs like Kafka is to use the operator pattern. This talk will review a popular new open source operator-based Apache Kafka implementation on Kubernetes called the Strimzi Kafka Operator. Read more.
Add to your personal schedule
5:10pm5:50pm Wednesday, March 27, 2019
Location: 2001
Secondary topics:  AI and Data technologies in the cloud, Data Integration and Data Pipelines
Gwen Shapira (Confluent)
As microservices, data services and serverless APIs proliferate, data engineers need to collect and standardize data in an increasingly complex and diverse system. In this presentation, we’ll discuss how data engineering requirements changed in a cloud-native world and share architectural patterns that are commonly used to build flexible, scalable and reliable data pipelines. Read more.
Add to your personal schedule
5:10pm5:50pm Wednesday, March 27, 2019
Location: 2002
Secondary topics:  AI and Data technologies in the cloud, Data Integration and Data Pipelines
Rustem Feyzkhanov (Instrumental)
Serverless implementation of the core processing is becoming a production-ready solution for a lot of companies. The companies with existing processing pipelines may find it hard to go completely serverless. Serverless workflows unite serverless world and cluster world to use benefits of both approaches. My talk will show how serverless workflows change our perception of software architecture. Read more.
Add to your personal schedule
5:10pm5:50pm Wednesday, March 27, 2019
Location: 2004
Tim Armstrong (Cloudera)
As the popularity and utilization of Apache Impala deployments increases, often clusters become victims of their own success when demand for resources exceeds the supply. This talk will dive into the latest resource management features in Impala to maintain high cluster availability and optimal performance as well as provide examples of how to configure them in your Impala deployment. Read more.
Add to your personal schedule
5:10pm5:50pm Wednesday, March 27, 2019
Location: 2006
Secondary topics:  Streaming and realtime analytics, Transportation and Logistics
GE produces a third of the world's power and 60% of airplane engines. These engines form a critical portion of the world's infrastructure and require meticulous monitoring of the hundreds of sensors streaming data from each turbine. Here, we share the case study of releasing into production the first real-time ML systems used to determine turbine health by GE's monitoring and diagnostics teams. Read more.
Add to your personal schedule
5:10pm5:50pm Wednesday, March 27, 2019
Location: 2008
Secondary topics:  AI and Data technologies in the cloud, Model lifecycle management
Rachel Silver (MapR Technologies)
KubeFlow separates compute and storage to provide the ability to deploy best-of-breed open source systems for machine learning to any cluster running Kubernetes, whether on-premises or in the cloud. This talk will explore the problems of state and storage and how distributed persistent storage can logically extend the compute flexibility provided by KubeFlow. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 28, 2019
Location: 2001
Secondary topics:  Data preparation, data governance, and data lineage, Transportation and Logistics
Mark Grover (Lyft), Tao Feng (Lyft)
In this talk, we'll discuss how Lyft has reduced time taken for discovering data by 10x by building its own data portal - Amundsen. We will give a demo of Amundsen, deep dive into its architecture and discuss how it leverages centralized metadata, page rank, and a comprehensive data graph to achieve its goal. We will close with future roadmap, unsolved problems and collaboration model. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 28, 2019
Location: 2002
Secondary topics:  Data Platforms, Data preparation, data governance, and data lineage, Financial Services
Subha Tatavarti (PayPal), Vadim Kutsyy (PayPal)
PayPal data eco system is fairly large with over 250+PB of data transacting in over 200+ countries. Given this massive scale and complexity, discovering and access to the right data sets in a frictionless environment is a massive challenge.PayPal’s Data Platform team is helping solve this problem holistically with a combination of self service integrated and interoperable products. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 28, 2019
Location: 2004
Secondary topics:  Streaming and realtime analytics
Kamil Bajda-Pawlikowski (Starburst), Martin Traverso (Facebook)
Presto, an open source distributed SQL engine, is designed for interactive queries and ability to query multiple data sources. With the ever-growing list of connectors (e.g., Apache Kudu, Pulsar, Netflix Iceberg, Elasticsearch) recently introduced Cost-Based Optimizer in Presto must account for heterogeneous data source with incomplete statistics and new use cases such as geospatial analytics. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 28, 2019
Location: 2006
Secondary topics:  Data Platforms, Media, Marketing, Advertising, Streaming and realtime analytics
Sijie Guo (Streamlio), 李鹏辉 (ZhaoPin)
Using a messaging system to build an event bus is very common. However, certain use cases demand messaging system with a certain set of features. This talk will focus on the event bus requirements for Zhaopin.com, one of the biggest Chinese online recruitment services provider, and why they chose Apache Pulsar. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 28, 2019
Location: 2008
Secondary topics:  Data Platforms, Retail and e-commerce
Zhen Fan (JD.com)
JD.com has designed a brand new architecture to optimize the spark computing clusters. We will show the problems we faced before and how we benefit from the in-memory distributed filesystem now. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 28, 2019
Location: 2024
Secondary topics:  AI and Data technologies in the cloud, Security and Privacy
Thomas Phelan (BlueData)
Recent headline-grabbing data breaches demonstrate that protecting data is essential for every enterprise. The best-of-breed approach for Big Data is HDFS configured with Transparent Data Encryption (TDE). TDE is difficult to configure and manage - even more so when run in Docker containers. This session will discuss these challenges and how to overcome them. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 28, 2019
Location: 2001
Secondary topics:  AI and Data technologies in the cloud, Data Platforms
Jason Wang (Cloudera), Tony Wu (Cloudera), Suraj Acharya (Cloudera)
We start with a general overview of cloud paradigms and cloud architectures for big data platforms (focusing on AWS and Azure); then we give an actionable understanding of cloud architecture with a dive into core cloud concepts: compute and virtual machine architectures, cloud storage, authentication and authorization, encryption, security and security best practices, and user management. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 28, 2019
Location: 2002
Secondary topics:  Data Platforms, Retail and e-commerce
Learn about how a small team in Tokyo went through several evolutions as they built an analytics service to help 200+ businesses accelerate their decision-making process. This presentation will cover the background, challenges, architecture, success stories, and best practices as they built and productionalized Rakuten Analytics. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 28, 2019
Location: 2004
Secondary topics:  Data Integration and Data Pipelines, Streaming and realtime analytics
Fabian Hueske (data Artisans)
Processing streaming data with SQL is gaining a lot of attention. In this talk, Fabian Hueske explains why SQL queries on streams should have the same semantics as SQL queries on static data. Moreover, Fabian will present a selection of common use cases and demonstrate how easily they can be addressed by Flink SQL. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 28, 2019
Location: 2006
Secondary topics:  Data Platforms, Media, Marketing, Advertising, Streaming and realtime analytics
Vivek Pasari (Netflix)
Netflix has over 125 million members spread across 191 countries. Each day our members interact with our client applications on 250 million+ devices under highly variable network conditions. These interactions result in over 200 billion daily data points. In this session, we will highlight the data engineering and architecture which enables application performance measurement at this scale. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 28, 2019
Location: 2008
Secondary topics:  Deep Learning, Media, Marketing, Advertising
Alex Poms (Stanford University), Will Crichton (Stanford University)
Systems like Spark made it possible to process big numerical/textual data on hundreds of machines. Today, the majority of data in the world is video. Scanner is the first open-source distributed system for building large-scale video processing applications. Scanner is being used at Stanford for analyzing TBs of film with deep learning on GCP, and at Facebook for synthesizing VR video on AWS. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 28, 2019
Location: 2024
Secondary topics:  Automation in data science and big data, Deep Learning
Alkis Simitsis (Micro Focus), Shivnath Babu (Unravel Data Systems, Duke University)
This describes an automated technique for root cause analysis (RCA) for big data stack applications using deep learning techniques. Spark and Impala will be used as examples, but the concepts generalize to the big data stack. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 28, 2019
Location: 2001
Secondary topics:  Data Integration and Data Pipelines, Financial Services
How efficient is your data platform? The single metric we use is Time-to-Reliable-Insights — total of time spent to ingest, transform, catalog, analyze, and publish. There are three elephants-in-the-room when it comes to Time-to-Reliable-insights — time-to-discover, time-to-catalog, and time-to-debug for data quality. This talk covers three design patterns and/or frameworks we have implemented. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 28, 2019
Location: 2002
Jacques Nadeau (Dremio)
Apache Arrow Flight is a new initiative focused on providing high performance communication within data engineering and data science infrastructure. This talk will discuss how Flight works and where it has been integrated. We’ll also discuss how Flight can be used to abstract physical data management from logical access. We’ll then share benchmarks of workloads that have been improved by Flight. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 28, 2019
Location: 2004
Secondary topics:  Streaming and realtime analytics
Haifeng Chen (Intel)
Spark SQL is widely used today. However, it still suffers from stability and performance challenges in the highly dynamic environment with large scale of data. To address these challenges, we introduced Spark adaptive execution engine which can handle the task parallelism, join conversion and data skew dynamically during run-time, guaranteeing the best plan is chosen using run-time statistics. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 28, 2019
Location: 2006
Secondary topics:  Streaming and realtime analytics
Michael Freedman (Timescale)
In this talk, I focus on two newly-released features of TimescaleDB (automated adaptation of time-partitioning intervals and continuous aggregations in near-real-time), and discuss how these capabilities ease time-series data management. I discuss how these capabilities have been leveraged across several different use cases, including in use with other technologies such as Kafka. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 28, 2019
Location: 2008
Secondary topics:  Automation in data science and big data, Streaming and realtime analytics
Arun Kumar (University of California, San Diego)
This talks presents a couple of recent techniques from research to accelerate ML over data that is the output of joins of multiple tables. Using ideas from query optimization and learning theory, we show how to avoid joins before ML to reduce runtimes and memory/storage footprints. Open source software prototypes and sample ML code in both R and Python will also be shown. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 28, 2019
Location: 2001
Xiao Li (Databricks), Wenchen Fan (Databricks)
This talk will provide an overview of the major features and enhancements in Apache Spark 2.4 release and the upcoming releases and will be followed by a Q&A session. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 28, 2019
Location: 2002
Secondary topics:  Data Platforms, Data preparation, data governance, and data lineage
Rohan Dhupelia (Atlassian), Jimmy Li (Atlassian)
Analytics is easy, good analytics is hard. Here at Atlassian we know this all to well with our push to become a truely data-driven organisation. In order to achieve this we've transformed the way we thought about behavioural analytics, from how we defined our events all the way to how we ingested and analysed them. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 28, 2019
Location: 2004
Secondary topics:  AI and Data technologies in the cloud
Alan Choi (Cloudera), Eva Andreasson (Cloudera), Mark Brine (Cloudera)
In this talk, you will learn how Cloudera’s Finance Department used a hybrid model to speed up report delivery and reduce cost of end of quarter reporting. Learn from our experience some guidelines for how to deploy modern data warehousing in a hybrid cloud environment: When should you choose private vs public cloud services? What options are there? Do:s and dont:s Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 28, 2019
Location: 2006
Secondary topics:  Streaming and realtime analytics
Akshai Sarma (Yahoo), Michael Natkovich (Yahoo)
Bullet is a scalable, pluggable, light, multi-tenant query system on any data flowing through a streaming system without storing it. Bullet queries are submitted first and operate on data flowing through the system from the point of submission. Bullet efficiently supports intractable operations like Top K, Counting Distincts and Windowing without any storage using Sketch-based algorithms. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 28, 2019
Location: 2008
Secondary topics:  AI and Data technologies in the cloud
Paul Curtis (MapR Technologies)
Just like almost everybody, we needed a way for ordinary users to stand up applications on top of Kubernetes, but we had additional requirements. And we had to do it without breaking the bank. Our field sales engineering force of sixty engineers around the globe now can spin up and down our technology quickly and simply using Kubernetes, the cloud, and shared data storage. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 28, 2019
Location: 2024
Secondary topics:  Automation in data science and big data, Data Integration and Data Pipelines, Financial Services, Streaming and realtime analytics
Cory Watson (Stripe)
How Stripe uses data sketching and off the shelf parts to build a novel observability pipeline that unifies measurements across our infrastructure to both improve reliability and keep vendor costs down. Read more.
Add to your personal schedule
3:50pm4:30pm Thursday, March 28, 2019
Location: 2001
Secondary topics:  Data Integration and Data Pipelines, Data Platforms
Li Gao (Lyft Inc.), Bill Graham (Lyft Inc.)
In this talk, Li Gao and Bill Graham will talk about challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale. Read more.
Add to your personal schedule
3:50pm4:30pm Thursday, March 28, 2019
Location: 2004
Secondary topics:  Data Platforms
Adrian Lungu (Adobe), Serban Teodorescu (Adobe)
Inspired by the Green / Blue deployment technique, the Adobe Audience Manager team developed an Active / Passive database migration procedure that allows us to test our database clusters in production, minimising the risks without compromising the innovation. We successfully applied this approach twice to upgrade the entire technology stack. But it never was a smooth move. Read more.
Add to your personal schedule
3:50pm4:30pm Thursday, March 28, 2019
Location: 2006
Secondary topics:  AI and machine learning in the enterprise, Data Platforms, Security and Privacy
Václav Surovec (T-Mobile Czech Republic)
The knowledge of location and travel patterns of customers is important for many companies. One of them is a German telco service operator Deutsche Telekom. Commercial Roaming project using Cloudera Hadoop helped the company to better analyze the behavior of its customers from 13 countries, in a very secure way, to be able to provide better predictions and visualizations for the high management. Read more.
Add to your personal schedule
3:50pm4:30pm Thursday, March 28, 2019
Location: 2008
Secondary topics:  Streaming and realtime analytics
Yuan Zhou (Intel), Haodong Tang (Intel), Jian Zhang (Intel)
We introduce Spark-PMOF and explain how it improves Spark analytics performance. Read more.
Add to your personal schedule
3:50pm4:30pm Thursday, March 28, 2019
Location: 2024
Secondary topics:  Data Platforms, Media, Marketing, Advertising, Security and Privacy
John Bennett (Netflix), Siamac Mirzaie (Netflix)
Data has become a foundational pillar for security teams operating in organizations of all shapes and sizes. This new norm has created a need for platforms that enable engineers to harness data for various security purposes. This talk introduces our internal platform aimed at quickly deploying data-based detection capabilities in the Netflix corporate environment. Read more.
Add to your personal schedule
4:40pm5:20pm Thursday, March 28, 2019
Location: 2001
Secondary topics:  Automation in data science and big data
Holden Karau (Google), Rachel Warren (Salesforce Einstein)
Apache Spark is an amazing distributed system, but part of the bargain we've all made with the infrastructure demons involves providing the correct set of magic numbers (aka tuning) or our jobs may be eaten by Cthulhu. This talk will look at auto-tuning jobs using historical & static job information using systems like Mahout, and internal Spark ML jobs as workloads including new settings in 2.4. Read more.
Add to your personal schedule
4:40pm5:20pm Thursday, March 28, 2019
Location: 2002
Secondary topics:  Data Integration and Data Pipelines, Data preparation, data governance, and data lineage, Media, Marketing, Advertising
Sonali Sharma (Netflix), Shriya Arora (Netflix)
With so much data being generated in real-time what if we could combine all these high-volume data streams in real time and provide a near realtime feedback for model training, improve personalization and recommendations, thereby taking the customer experience on the product to a whole new level. Well, it is possible to tame large state-join for exactly that purpose using Flink's keyed state. Read more.
Add to your personal schedule
4:40pm5:20pm Thursday, March 28, 2019
Location: 2004
Secondary topics:  AI and Data technologies in the cloud, Streaming and realtime analytics
Igor Canadi (Rockset), dhruba borthakur (Rockset)
Most existing big data systems prefer sequential scans for processing queries. We challenge this view and present converged indexing: a single system called ROCKSET that builds inverted, columnar and document indices. Converged indexing is economically feasible due to the elasticity of cloud-resources and write optimized storage engines. Read more.
Add to your personal schedule
4:40pm5:20pm Thursday, March 28, 2019
Location: 2006
Secondary topics:  AI and Data technologies in the cloud
Jinchul Kim (SK Telecom)
Druid supports auto scaling feature for data ingestion, but it is only available on AWS EC2. We cannot rely on the feature on our private cloud. In this talk, we are going to introduce auto scale-out/in on Kubernetes. We will show benefit on our approach and where it comes from and share development of Druid Helm chart, rolling update, custom metric usage for horizontal auto scaling. Read more.
Add to your personal schedule
4:40pm5:20pm Thursday, March 28, 2019
Location: 2008
Secondary topics:  Data Integration and Data Pipelines, Streaming and realtime analytics
Patrick Stuedi (IBM Research)
Modern networking and storage technologies like RDMA or NVMe find their ways into the data center. Apache Crail (incubating) is a new project that facilitates running data processing workloads (ML, SQL, etc.) on such hardware. In this talk I will present Apache Crail, what it does and how workloads based on TensorFlow or Spark can benefit from Crail. Read more.
Add to your personal schedule
4:40pm5:20pm Thursday, March 28, 2019
Location: 2024
Secondary topics:  Data Integration and Data Pipelines, Security and Privacy, Streaming and realtime analytics
Julien Delange (Twitter), Neng Lu (Twitter)
This presentation presents how Twitter uses the heron data processing engine to monitor and analyze its network infrastructure. Within 2 months, infrastructure engineers implemented a new data pipeline that ingests multiple sources and processes about 1 billion of tuples to detect network issues generate usage statistics. The talk focuses on key technologies used, the architecture and challenges. Read more.