Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Schedule: Data engineering and architecture sessions

11:00am11:40am Wednesday, March 15, 2017
Location: 230 A Level: Beginner
Secondary topics:  Architecture, Data Platform, Streaming
Felix Gorodishter (GoDaddy)
Average rating: ****.
(4.25, 4 ratings)
GoDaddy ingests and analyzes 100,000 EPS of logs, metrics, and events each day. Felix Gorodishter shares GoDaddy's big data journey and explains how the company makes sense of 10+-TB-per-day growth for operational insights of its cloud leveraging Kafka, Hadoop, Spark, Pig, Hive, Cassandra, and Elasticsearch. Read more.
11:50am12:30pm Wednesday, March 15, 2017
Location: LL20 A Level: Intermediate
Secondary topics:  Architecture, Data Platform, Media
Christopher Colburn (Netflix), Monal Daxini (Netflix)
Average rating: ****.
(4.00, 3 ratings)
In the past, typical real-time data processing was reserved for answering operational questions and very basic analytical questions, but with better processing frameworks and more-capable hardware, the streaming context can now enable personalization applications. Christopher Colburn and Monal Daxini explore the challenges faced when building a streaming application at scale at Netflix. Read more.
11:50am12:30pm Wednesday, March 15, 2017
Location: 210 A/E
Secondary topics:  Architecture, Cloud
Andrei Savu (Cloudera), Jennifer Wu (Cloudera)
Average rating: ***..
(3.00, 3 ratings)
Cloud infrastructure, with a scalable data store and elastic compute, is particularly well suited for large-scale data engineering workloads. Andrei Savu and Jennifer Wu explore the latest cloud technologies and outline cost, security, and ease-of-use considerations for data engineers. Read more.
1:50pm2:30pm Wednesday, March 15, 2017
Location: LL20 A Level: Intermediate
Secondary topics:  Architecture, Data Platform, Financial services
Average rating: *****
(5.00, 2 ratings)
Data warehouses are critical in driving business decisions—with SQL dominantly used to build ETL pipelines. While the technology has shifted from using RDBMS-centric data warehouses to data pipelines based on Hadoop and MPP databases, engineering and quality processes have not kept pace. Avinash Padmanabhan highlights the changes that Intuit's team made to improve processes and data quality. Read more.
1:50pm2:30pm Wednesday, March 15, 2017
Location: LL20 C Level: Intermediate
Secondary topics:  Streaming
Ryan Pridgeon (Confluent), Dustin Cote (Confluent)
Average rating: ****.
(4.67, 3 ratings)
Dustin Cote and Ryan Pridgeon share their experience troubleshooting Apache Kafka in production environments and discuss how to avoid pitfalls like message loss or performance degradation in your environment. Read more.
2:40pm3:20pm Wednesday, March 15, 2017
Location: LL20 A Level: Intermediate
Secondary topics:  Architecture, Media, Streaming
Kartik Paramasivam (LinkedIn)
Average rating: *****
(5.00, 2 ratings)
LinkedIn has one of the largest Kafka installations in the world, ingesting more than a trillion messages per day. Apache Samza-based stream processing applications process this deluge of data. Kartik Paramasivam discusses key improvements and architectural patterns that LinkedIn has adopted in its data systems in order to process millions of requests per second while keeping costs in control. Read more.
2:40pm3:20pm Wednesday, March 15, 2017
Location: LL20 C Level: Intermediate
Secondary topics:  Media, Streaming
Michael Edwards shares experiences from operating several Kafka clusters in a real-time streaming event ingestion pathway. He'll discuss the lessons learned from working with hundreds of terabytes flowing through every day, petabytes of retention, and gigabytes of historical data streaming to and from storage. Read more.
2:40pm3:20pm Wednesday, March 15, 2017
Location: LL21 E/F Level: Beginner
Secondary topics:  Architecture, Cloud
Paige Liu (Microsoft), John Zhuge (Netflix)
Paige Liu and John Zhuge explore the options and trade-offs to consider when building a Cloudera cluster on Microsoft Azure Cloud and explain how to deploy and scale a Cloudera cluster on Azure and how to connect a Cloudera cluster with other Azure services to build enterprise-grade end-to-end big data solutions. Read more.
2:40pm3:20pm Wednesday, March 15, 2017
Location: 210 A/E Level: Intermediate
Secondary topics:  Cloud
Shubham Tagra (Qubole)
Shubham Tagra offers an introduction to RubiX, a lightweight, cross-engine caching solution that works well with optimized columnar formats by caching only the required amount of data. RubiX can be used with any data analytics engine that reads data from remote sources via the Hadoop FileSystem interface without any changes to the source code of those engines. Read more.
4:20pm5:00pm Wednesday, March 15, 2017
Location: LL20 C Level: Intermediate
Secondary topics:  Data Platform, Financial services, Streaming
Kevin Mao (Capital One)
Average rating: ****.
(4.67, 3 ratings)
Kevin Mao explores the value of and challenges associated with collecting raw security event data from disparate corners of enterprise infrastructure and transforming them into high-quality intelligence that can be used to forecast, detect, and mitigate cybersecurity threats. Read more.
5:10pm5:50pm Wednesday, March 15, 2017
Location: LL20 A Level: Advanced
Secondary topics:  Architecture, Media, Platform, Streaming
Monal Daxini (Netflix)
Average rating: ****.
(4.50, 2 ratings)
Netflix Keystone processes over a trillion events per day with at-least-once processing semantics in the cloud. Monal Daxini explores what it means to offer stream processing as a service (SPaaS), how Netflix implemented a scalable, fault-tolerant multitenant SPaaS internal offering, and how it evolved the system in flight with no downtime. Read more.
11:00am11:40am Thursday, March 16, 2017
Location: LL20 A Level: Intermediate
Secondary topics:  Data Platform
Tony Xing (Microsoft)
Average rating: ***..
(3.00, 2 ratings)
Tony Xing offers an overview of Microsoft's common anomaly detection platform, an API service built internally to provide product teams the flexibility to plug in any anomaly detection algorithms to fit their own signal types. Read more.
11:50am12:30pm Thursday, March 16, 2017
Location: LL20 A Level: Intermediate
Secondary topics:  Architecture, Media, Platform
Kurt Brown (Netflix)
Average rating: ****.
(4.90, 10 ratings)
The Netflix data platform is constantly evolving, but fundamentally it's an all-cloud platform at a massive scale (40+ PB and over 700 billion new events per day) focused on empowering developers. Kurt Brown dives into the current technology landscape at Netflix and offers some thoughts on what the future holds. Read more.
1:50pm2:30pm Thursday, March 16, 2017
Location: LL20 D Level: Advanced
Secondary topics:  Architecture
Julien Le Dem (WeWork), Jacques Nadeau (Dremio)
Average rating: ****.
(4.00, 2 ratings)
In pursuit of speed, big data is evolving toward columnar execution. The solid foundation laid by Arrow and Parquet for a shared columnar representation across the ecosystem promises a great future. Julien Le Dem and Jacques Nadeau discuss the future of columnar and the hardware trends it takes advantage of, such as RDMA, SSDs, and nonvolatile memory. Read more.
1:50pm2:30pm Thursday, March 16, 2017
Location: LL21 B Level: Beginner
Shirshanka Das (LinkedIn), Yael Garten (LinkedIn)
Average rating: ****.
(4.75, 4 ratings)
Shirshanka Das and Yael Garten share best practices learned using Kafka and Hadoop as the foundation of a petabyte-scale data warehouse at LinkedIn, offering concrete suggestions to help you process data seamlessly. Along the way, Shirshanka and Yael discuss their experience running governance to empower data teams. Read more.
2:40pm3:20pm Thursday, March 16, 2017
Location: LL20 A Level: Intermediate
Secondary topics:  Architecture, Streaming
Gwen Shapira (Confluent)
Average rating: *****
(5.00, 3 ratings)
There are many good reasons to run more than one Kafka cluster. . .and a few bad reasons too. Great architectures are driven by use cases, and multicluster deployments are no exception. Gwen Shapira offers an overview of several use cases, including real-time analytics and payment processing, that may require multicluster solutions to help you better choose the right architecture for your needs. Read more.
2:40pm3:20pm Thursday, March 16, 2017
Location: LL20 C Level: Beginner
Teresa Tung (Accenture), Jurgen Weichenberger (Accenture Analytics), Ishmeet Grewal (Accenture Labs)
Average rating: ***..
(3.80, 5 ratings)
As Accenture scaled to millions of predictive models, it needed automation to manage models at scale, ensure accuracy, prevent false alarms, and preserve trust as models are created, tested, and deployed into production. Teresa Tung, Jürgen Weichenberger, and Ishmeet Grewal share their approach to implementing DevOps for models and employing a self-healing approach to model lifecycle management. Read more.
2:40pm3:20pm Thursday, March 16, 2017
Location: 230 C Level: Beginner
Secondary topics:  Architecture, Data Platform, ecommerce
Gleicon Moraes (luc.id), Arthur Grava (Luizalabs)
Average rating: ****.
(4.00, 3 ratings)
Gleicon Moraes and Arthur Grava share war stories about developing and deploying a cloud-based large-scale recommender system for a top-three Brazilian ecommerce company. The system, which uses Cassandra and graph traversal, led to a more than 15% increase in sales. Read more.
4:20pm5:00pm Thursday, March 16, 2017
Location: LL20 A Level: Intermediate
Secondary topics:  Architecture
Nischal HP (Unnati Data Labs), Raghotham Sripadraj (Ericsson)
Average rating: ****.
(4.67, 3 ratings)
Not all data science problems are big data problems. Lots of small and medium product companies want to start their journey to become data driven. Nischal HP and Raghotham Sripadraj share their experience building data science platforms for various enterprises, with an emphasis on making the right architecture choices and using distributed and fault-tolerant tools. Read more.
4:20pm5:00pm Thursday, March 16, 2017
Location: LL20 D
Secondary topics:  Deep learning
Shivnath Babu (Duke University | Unravel Data Systems)
Average rating: *****
(5.00, 1 rating)
Shivnath Babu offers an introduction to using deep learning to solve complex problems in IT operations analytics. Shivnath focuses on how deep learning can derive operations insights automatically for the complex big data application stack composed of systems such as Hadoop, Spark, Cassandra, Elasticsearch, and Impala, using examples of open source tools for deep learning. Read more.