Presented By
O’Reilly + Cloudera
Make Data Work
March 25-28, 2019
San Francisco, CA

Schedule: Data Integration and Data Pipelines sessions

Machine learning applications rely on data. The first step is to bring together existing data sources and when appropriate, enrich them with them with other data sets. In most cases data needs to be refined and prepared before it’s ready for analytic applications. This series of talks showcase some modern approaches to data integration and the creation and maintenance of data pipelines.

Add to your personal schedule
9:00am12:30pm Tuesday, March 26, 2019
Boris Lublinsky (Lightbend), Dean Wampler (Anyscale)
Average rating: ***..
(3.85, 13 ratings)
Boris Lublinsky and Dean Wampler walk you through using ML in streaming data pipeline and doing periodic model retraining and low-latency scoring in live streams. You'll explore using Kafka as a data backplane, the pros and cons of microservices versus systems like Spark and Flink, tips for TensorFlow and SparkML, performance considerations, model metadata tracking, and other techniques. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, March 26, 2019
Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio)
Average rating: **...
(2.67, 12 ratings)
Many industry segments have been grappling with fast data (high-volume, high-velocity data). Arun Kejariwal and Karthik Ramasamy walk you through the state-of-the-art systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage—for real-time data and algorithms to extract insights (e.g., heavy hitters and quantiles) from data streams. Read more.
Add to your personal schedule
11:00am11:40am Wednesday, March 27, 2019
Osman Sarood (Mist Systems), Chunky Gupta (Mist Systems)
Average rating: ****.
(4.67, 3 ratings)
Osman Sarood and Chunky Gupta discuss Mist’s real-time data pipeline, focusing on Live Aggregators (LA)—a highly reliable and scalable in-house real-time aggregation system that can autoscale for sudden changes in load. LA is 80% cheaper than competing streaming solutions due to running over AWS Spot Instances and having 70% CPU utilization. Read more.
Add to your personal schedule
11:00am11:40am Wednesday, March 27, 2019
Jitender Aswani (Netflix), Di Lin (Netflix), Girish Lingappa (Netflix)
Average rating: ***..
(3.40, 15 ratings)
Hundreds of thousands of ETL pipelines ingest over a trillion events daily to populate millions of data tables downstream at Netflix. Jitender Aswani, Girish Lingappa, and Di Lin discuss Netflix’s internal data lineage service, which was essential for enhancing platform’s reliability, increasing trust in data, and improving data infrastructure efficiency. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 27, 2019
Average rating: ****.
(4.60, 5 ratings)
In a large global health services company, streaming data for processing and sharing comes with its own challenges. Data science and analytics platforms need data fast, from relevant sources, to act on this data quickly and share the insights with consumers with the same speed and urgency. Join Mohammad Quraishi to learn why streaming data architectures are a necessity—Kafka and Hadoop are key. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 27, 2019
Sandeep U (Intuit)
Average rating: ****.
(4.57, 7 ratings)
How efficient is your data platform? The single metric Intuit uses is time to reliable insights: the total of time spent to ingest, transform, catalog, analyze, and publish. Sandeep Uttamchandani shares three design patterns/frameworks Intuit implemented to deal with three challenges to determining time to reliable insights: time to discover, time to catalog, and time to debug for data quality. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 27, 2019
Yaron Haviv (iguazio)
Average rating: ****.
(4.00, 2 ratings)
Faced with the need to handle increasing volumes of data, alternative datasets ("alt data"), and AI, many enterprises are working to design or redesign their big data architectures, but traditional batch platforms fail to generate sufficient ROI. Yaron Haviv shares a continuous analytics approach that yields faster answers for the business while remaining simpler and less expensive for IT. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 27, 2019
James Taylor (Lyft)
Average rating: ***..
(3.56, 9 ratings)
James Taylor offers an overview of an automated feedback loop at Lyft to adapt ETL based on the aggregate cost of queries run across the cluster. He also discusses future work to enhance the system through the use of materialized views to reduce the number of ad hoc joins and sorting performed by the most expensive queries by transparently rewriting queries when possible. Read more.
Add to your personal schedule
4:20pm5:00pm Wednesday, March 27, 2019
Julien Le Dem (WeWork)
Average rating: ****.
(4.83, 6 ratings)
Big data infrastructure has evolved from flat files in a distributed filesystem to an efficient ecosystem to a fully deconstructed and open source database with reusable components. Julien Le Dem discusses the key open source components of the big data ecosystem and explains how they relate to each other and how they make the ecosystem more of a database and less of a filesystem. Read more.
Add to your personal schedule
4:20pm5:00pm Wednesday, March 27, 2019
Alex Kira (Uber)
Average rating: ****.
(4.00, 13 ratings)
Uber operates at scale, with thousands of microservices serving millions of rides a day, leading to 100+ PB of data. Alex Kira details Uber's journey toward a unified and scalable data workflow system used to manage this data and shares the challenges faced and how the company has rearchitected the system to make it highly available and horizontally scalable. Read more.
Add to your personal schedule
4:20pm5:00pm Wednesday, March 27, 2019
Jowanza Joseph (Pluralsight), Karthik Ramasamy (Streamlio)
Average rating: ****.
(4.00, 1 rating)
After two years of running streaming pipelines through Kinesis and Spark at One Click Retail, Jowanza Joseph and Karthik Ramasamy decided to explore a new platform that would take advantage of Kubernetes and support a simpler data processing DSL. Join in to discover why they chose Apache Pulsar and learn tips and tricks for using Pulsar Functions. Read more.
Add to your personal schedule
5:10pm5:50pm Wednesday, March 27, 2019
Rustem Feyzkhanov (Instrumental)
Average rating: ***..
(3.50, 8 ratings)
Serverless implementation of core processing is quickly becoming a production-ready solution. However, companies with existing processing pipelines may find it hard to go completely serverless. Serverless workflows unite the serverless and cluster worlds, with the benefits of both approaches. Rustem Feyzkhanov demonstrates how serverless workflows change your perception of software architecture. Read more.
Add to your personal schedule
5:10pm5:50pm Wednesday, March 27, 2019
Gwen Shapira (Confluent)
Average rating: ****.
(4.64, 11 ratings)
As microservices, data services, and serverless APIs proliferate, data engineers need to collect and standardize data in an increasingly complex and diverse system. Gwen Shapira discusses how data engineering requirements have changed in a cloud native world and shares architectural patterns that are commonly used to build flexible, scalable, and reliable data pipelines. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 28, 2019
Eric Jonas (UC Berkeley)
Average rating: ****.
(4.50, 2 ratings)
Eric Jonas offers a quick history of cloud computing, including an accounting of the predictions of the 2009 "Berkeley View of Cloud Computing" paper, explains the motivation for serverless computing, describes applications that stretch the current limits of serverless, and then lists obstacles and research opportunities required for serverless computing to fulfill its full potential. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 28, 2019
Fabian Hueske (Ververica)
Average rating: ****.
(4.30, 10 ratings)
Processing streaming data with SQL is becoming increasingly popular. Fabian Hueske explains why SQL queries on streams should have the same semantics as SQL queries on static data. He then shares a selection of common use cases and demonstrates how easily they can be addressed with Flink SQL. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 28, 2019
Avner Braverman (Binaris)
Average rating: ****.
(4.00, 3 ratings)
What is serverless, and how can it be utilized for data analysis and AI? Avner Braverman outlines the benefits and limitations of serverless with respect to data transformation (ETL), AI inference and training, and real-time streaming. This is a technical talk, so expect demos and code. Read more.
Add to your personal schedule
3:50pm4:30pm Thursday, March 28, 2019
Li Gao (Lyft), Bill Graham (Lyft)
Average rating: ****.
(4.00, 2 ratings)
Li Gao and Bill Graham discuss the challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale. Read more.
Add to your personal schedule
3:50pm4:30pm Thursday, March 28, 2019
J Delange (Twitter), N Lu (Twitter)
Average rating: **...
(2.67, 3 ratings)
Julien Delange and Neng Lu explain how Twitter uses the Heron stream processing engine to monitor and analyze its network infrastructure—implementing a new data pipeline that ingests multiple sources and processes about 1 billion tuples to detect network issues and generate usage statistics. Join in to learn the key technologies used, the architecture, and the challenges Twitter faced. Read more.
Add to your personal schedule
4:40pm5:20pm Thursday, March 28, 2019
Patrick Stuedi (IBM Research)
Average rating: ****.
(4.00, 1 rating)
Modern networking and storage technologies like RDMA or NVMe are finding their way into the data center. Patrick Stuedi offers an overview of Apache Crail (incubating), a new project that facilitates running data processing workloads (ML, SQL, etc.) on such hardware. Patrick explains what Crail does and how it benefits workloads based on TensorFlow or Spark. Read more.
Add to your personal schedule
4:40pm5:20pm Thursday, March 28, 2019
Sonali Sharma (Netflix), Shriya Arora (Netflix)
Average rating: ***..
(3.00, 2 ratings)
With so much data being generated in real time, what if we could combine all these high-volume data streams and provide near real-time feedback for model training, improving personalization and recommendations and taking the customer experience to a whole new level. Sonali Sharma and Shriya Arora explain how to do exactly that, using Flink's keyed state. Read more.