Presented By
O’Reilly + Cloudera

Make Data Work

March 25-28, 2019
San Francisco, CA

Schedule: Data Integration and Data Pipelines sessions

Machine learning applications rely on data. The first step is to bring together existing data sources and when appropriate, enrich them with them with other data sets. In most cases data needs to be refined and prepared before it’s ready for analytic applications. This series of talks showcase some modern approaches to data integration and the creation and maintenance of data pipelines.

9:00am–12:30pm Tuesday, March 26, 2019

Hands-on machine learning with Kafka-based streaming pipelines

Data Engineering & Architecture, Streaming and IoT
Location: 2007

Boris Lublinsky (Lightbend), Dean Wampler (Anyscale)

Average rating:

(3.85, 13 ratings)

Boris Lublinsky and Dean Wampler walk you through using ML in streaming data pipeline and doing periodic model retraining and low-latency scoring in live streams. You'll explore using Kafka as a data backplane, the pros and cons of microservices versus systems like Spark and Flink, tips for TensorFlow and SparkML, performance considerations, model metadata tracking, and other techniques. Read more.

1:30pm–5:00pm Tuesday, March 26, 2019

Architecture and algorithms for end-to-end streaming data processing

Data Engineering & Architecture
Location: 2005

Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio)

Average rating:

(2.67, 12 ratings)

Many industry segments have been grappling with fast data (high-volume, high-velocity data). Arun Kejariwal and Karthik Ramasamy walk you through the state-of-the-art systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage—for real-time data and algorithms to extract insights (e.g., heavy hitters and quantiles) from data streams. Read more.

11:00am–11:40am Wednesday, March 27, 2019

Live Aggregators: A scalable, cost-effective, and reliable way of aggregating billions of messages in real time

Data Engineering & Architecture
Location: 2006

Osman Sarood (Mist Systems), Chunky Gupta (Mist Systems)

Average rating:

(4.67, 3 ratings)

Osman Sarood and Chunky Gupta discuss Mist’s real-time data pipeline, focusing on Live Aggregators (LA)—a highly reliable and scalable in-house real-time aggregation system that can autoscale for sudden changes in load. LA is 80% cheaper than competing streaming solutions due to running over AWS Spot Instances and having 70% CPU utilization. Read more.

11:00am–11:40am Wednesday, March 27, 2019

Scaling data lineage at Netflix to improve data infrastructure reliability and efficiency

Data Engineering & Architecture
Location: 2001

Jitender Aswani (Netflix), Di Lin (Netflix), Girish Lingappa (Netflix)

Average rating:

(3.40, 15 ratings)

Hundreds of thousands of ETL pipelines ingest over a trillion events daily to populate millions of data tables downstream at Netflix. Jitender Aswani, Girish Lingappa, and Di Lin discuss Netflix’s internal data lineage service, which was essential for enhancing platform’s reliability, increasing trust in data, and improving data infrastructure efficiency. Read more.

11:50am–12:30pm Wednesday, March 27, 2019

Enabling insights and analytics with data streaming architectures and pipelines using Kafka and Hadoop

Data Engineering & Architecture
Location: 2006

Mohammad Quraishi (Cigna)

Average rating:

(4.60, 5 ratings)

In a large global health services company, streaming data for processing and sharing comes with its own challenges. Data science and analytics platforms need data fast, from relevant sources, to act on this data quickly and share the insights with consumers with the same speed and urgency. Join Mohammad Quraishi to learn why streaming data architectures are a necessity—Kafka and Hadoop are key. Read more.

11:50am–12:30pm Wednesday, March 27, 2019

How Intuit reduced time to reliable insights for data pipelines

Data Engineering & Architecture
Location: 2001

Sandeep U (Intuit)

Average rating:

(4.57, 7 ratings)

How efficient is your data platform? The single metric Intuit uses is time to reliable insights: the total of time spent to ingest, transform, catalog, analyze, and publish. Sandeep Uttamchandani shares three design patterns/frameworks Intuit implemented to deal with three challenges to determining time to reliable insights: time to discover, time to catalog, and time to debug for data quality. Read more.

2:40pm–3:20pm Wednesday, March 27, 2019

Goodbye, data lake: Why continuous analytics yield higher ROI

Data Engineering & Architecture
Location: 2002

Yaron Haviv (iguazio)

Average rating:

(4.00, 2 ratings)

Faced with the need to handle increasing volumes of data, alternative datasets ("alt data"), and AI, many enterprises are working to design or redesign their big data architectures, but traditional batch platforms fail to generate sufficient ROI. Yaron Haviv shares a continuous analytics approach that yields faster answers for the business while remaining simpler and less expensive for IT. Read more.

2:40pm–3:20pm Wednesday, March 27, 2019

Adaptive ETL to optimize query performance at Lyft

Data Engineering & Architecture
Location: 2001

James Taylor (Lyft)

Average rating:

(3.56, 9 ratings)

James Taylor offers an overview of an automated feedback loop at Lyft to adapt ETL based on the aggregate cost of queries run across the cluster. He also discusses future work to enhance the system through the use of materialized views to reduce the number of ad hoc joins and sorting performed by the most expensive queries by transparently rewriting queries when possible. Read more.

4:20pm–5:00pm Wednesday, March 27, 2019

From flat files to deconstructed databases: The evolution and future of the big data ecosystem

Data Engineering & Architecture
Location: 2004

Julien Le Dem (WeWork)

Average rating:

(4.83, 6 ratings)

Big data infrastructure has evolved from flat files in a distributed filesystem to an efficient ecosystem to a fully deconstructed and open source database with reusable components. Julien Le Dem discusses the key open source components of the big data ecosystem and explains how they relate to each other and how they make the ecosystem more of a database and less of a filesystem. Read more.

4:20pm–5:00pm Wednesday, March 27, 2019

Managing Uber's data workflows at scale

Data Engineering & Architecture
Location: 2001

Alex Kira (Uber)

Average rating:

(4.00, 13 ratings)

Uber operates at scale, with thousands of microservices serving millions of rides a day, leading to 100+ PB of data. Alex Kira details Uber's journey toward a unified and scalable data workflow system used to manage this data and shares the challenges faced and how the company has rearchitected the system to make it highly available and horizontally scalable. Read more.

4:20pm–5:00pm Wednesday, March 27, 2019

Reducing stream processing complexity using Apache Pulsar Functions

Data Engineering & Architecture
Location: 2002

Jowanza Joseph (Pluralsight), Karthik Ramasamy (Streamlio)

Average rating:

(4.00, 1 rating)

After two years of running streaming pipelines through Kinesis and Spark at One Click Retail, Jowanza Joseph and Karthik Ramasamy decided to explore a new platform that would take advantage of Kubernetes and support a simpler data processing DSL. Join in to discover why they chose Apache Pulsar and learn tips and tricks for using Pulsar Functions. Read more.

5:10pm–5:50pm Wednesday, March 27, 2019

Serverless workflows for orchestration hybrid cluster-based and serverless processing

Data Engineering & Architecture
Location: 2002

Rustem Feyzkhanov (Instrumental)

Average rating:

(3.50, 8 ratings)

Serverless implementation of core processing is quickly becoming a production-ready solution. However, companies with existing processing pipelines may find it hard to go completely serverless. Serverless workflows unite the serverless and cluster worlds, with the benefits of both approaches. Rustem Feyzkhanov demonstrates how serverless workflows change your perception of software architecture. Read more.

5:10pm–5:50pm Wednesday, March 27, 2019

Cloud native data pipelines with Apache Kafka

Data Engineering & Architecture
Location: 2001

Gwen Shapira (Confluent)

Average rating:

(4.64, 11 ratings)

As microservices, data services, and serverless APIs proliferate, data engineers need to collect and standardize data in an increasingly complex and diverse system. Gwen Shapira discusses how data engineering requirements have changed in a cloud native world and shares architectural patterns that are commonly used to build flexible, scalable, and reliable data pipelines. Read more.

11:00am–11:40am Thursday, March 28, 2019

Cloud programming simplified: A Berkeley view on serverless computing

Data Engineering & Architecture
Location: 2007

Eric Jonas (UC Berkeley)

Average rating:

(4.50, 2 ratings)

Eric Jonas offers a quick history of cloud computing, including an accounting of the predictions of the 2009 "Berkeley View of Cloud Computing" paper, explains the motivation for serverless computing, describes applications that stretch the current limits of serverless, and then lists obstacles and research opportunities required for serverless computing to fulfill its full potential. Read more.

11:50am–12:30pm Thursday, March 28, 2019

Flink SQL in action

Data Engineering & Architecture, Streaming and IoT
Location: 2004

Fabian Hueske (Ververica)

Average rating:

(4.30, 10 ratings)

Processing streaming data with SQL is becoming increasingly popular. Fabian Hueske explains why SQL queries on streams should have the same semantics as SQL queries on static data. He then shares a selection of common use cases and demonstrates how easily they can be addressed with Flink SQL. Read more.

11:50am–12:30pm Thursday, March 28, 2019

Serverless for data and AI

Data Engineering & Architecture, Data Science, Machine Learning & AI, Streaming and IoT
Location: 2007

Avner Braverman (Binaris)

Average rating:

(4.00, 3 ratings)

What is serverless, and how can it be utilized for data analysis and AI? Avner Braverman outlines the benefits and limitations of serverless with respect to data transformation (ETL), AI inference and training, and real-time streaming. This is a technical talk, so expect demos and code. Read more.

3:50pm–4:30pm Thursday, March 28, 2019

Scaling Apache Spark on Kubernetes at Lyft

Data Engineering & Architecture
Location: 2001

Li Gao (Lyft), Bill Graham (Lyft)

Average rating:

(4.00, 2 ratings)

Li Gao and Bill Graham discuss the challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale. Read more.

3:50pm–4:30pm Thursday, March 28, 2019

Real-time monitoring of Twitter's network infrastructure with Heron

Data Engineering & Architecture
Location: 2024

J Delange (Twitter), N Lu (Twitter)

Average rating:

(2.67, 3 ratings)

Julien Delange and Neng Lu explain how Twitter uses the Heron stream processing engine to monitor and analyze its network infrastructure—implementing a new data pipeline that ingests multiple sources and processes about 1 billion tuples to detect network issues and generate usage statistics. Join in to learn the key technologies used, the architecture, and the challenges Twitter faced. Read more.

4:40pm–5:20pm Thursday, March 28, 2019

Data processing at the speed of 100 Gbps using Apache Crail

Data Engineering & Architecture
Location: 2008

Patrick Stuedi (IBM Research)

Average rating:

(4.00, 1 rating)

Modern networking and storage technologies like RDMA or NVMe are finding their way into the data center. Patrick Stuedi offers an overview of Apache Crail (incubating), a new project that facilitates running data processing workloads (ML, SQL, etc.) on such hardware. Patrick explains what Crail does and how it benefits workloads based on TensorFlow or Spark. Read more.

4:40pm–5:20pm Thursday, March 28, 2019

Taming large state to join datasets for personalization

Data Engineering & Architecture
Location: 2002

Sonali Sharma (Netflix), Shriya Arora (Netflix)

Average rating:

(3.00, 2 ratings)

With so much data being generated in real time, what if we could combine all these high-volume data streams and provide near real-time feedback for model training, improving personalization and recommendations and taking the customer experience to a whole new level. Sonali Sharma and Shriya Arora explain how to do exactly that, using Flink's keyed state. Read more.

Presented by

Strategic Sponsors

Zettabyte Sponsor

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Supporting Sponsor

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com