Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Schedule: Data Platforms sessions

Over the last few years, many companies have begun rolling out data platforms for business intelligence and business analytics. More recently companies have started to expand towards platforms that can support growing teams of data scientists. Common features of modern data science platforms include: support for notebooks and open source machine learning libraries, project management (collaboration and reproducibility), and model visualization.

9:00am–12:30pm Tuesday, 09/11/2018
Location: 1E 12/13 Level: Intermediate
Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio)
Average rating: ***..
(3.12, 8 ratings)
Arun Kejariwal and Karthik Ramasamy lead a journey through the landscape of state-of-the-art systems for each stage of an end-to-end data processing pipeline, covering messaging frameworks, streaming computing frameworks, storage frameworks for real-time data, and more. They also share case studies from the IoT, gaming, and healthcare and their experience operating these systems at internet scale. Read more.
9:00am–12:30pm Tuesday, 09/11/2018
Location: 1A 06/07 Level: Intermediate
Mark Madsen (Think Big Analytics), Todd Walter (Teradata)
Average rating: ***..
(3.50, 10 ratings)
Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that is not subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure. Read more.
9:00am–5:00pm Tuesday, 09/11/2018
Location: 1E 10
Paco Nathan (derwen.ai), Katharina Warzel (EveryMundo), Mike Berger (Mount Sinai Health System), Sam Helmich (Deere & Company), Stephanie Fischer (datanizing GmbH), Maryam Jahanshahi (TapRecruit), Greg Quist (SmartCover Systems), Ann Nguyen (Whole Whale), Steve Otto (Navistar), Jennifer Lim (Cerner), Anand S (Gramener), Ian Brooks (Hortonworks)
Hear practical insights from household brands and global companies: the challenges they tackled, approaches they took, and the benefits—and drawbacks—of their solutions. Read more.
1:30pm–5:00pm Tuesday, 09/11/2018
Location: 1A 06/07 Level: Advanced
Ted Malaska (Capital One), Jonathan Seidman (Cloudera)
Average rating: ***..
(3.12, 8 ratings)
Using Customer 360 and the internet of things as examples, Jonathan Seidman and Ted Malaska explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, including components like Kafka, Flink, Kudu, Spark Streaming, and Spark SQL and modern storage engines to enable new forms of data processing and analytics. Read more.
11:20am–12:00pm Wednesday, 09/12/2018
Location: 1E 09 Level: Beginner
Cory Minton (Dell EMC), Colm Moynihan (Cloudera)
Average rating: *****
(5.00, 1 rating)
Cory Minton and Colm Moynihan explain how to choose the right deployment model for on-premises infrastructure to reduce risk, reduce costs, and be more nimble. Read more.
1:15pm–1:55pm Wednesday, 09/12/2018
Location: 1A 10 Level: Intermediate
Ryan Blue (Netflix), Daniel Weeks (Netflix)
Average rating: *****
(5.00, 3 ratings)
In the last few years, Netflix's data warehouse has grown to more than 100 PB in S3. Ryan Blue and Daniel Weeks share lessons learned, the tools Netflix currently uses and those it has retired, and the improvements it is rolling out, including Iceberg, a new table format for S3. Read more.
2:05pm–2:45pm Wednesday, 09/12/2018
Location: 1A 08 Level: Beginner
Atul Kale (Airbnb), Xiaohan Zeng (Airbnb)
Average rating: *****
(5.00, 3 ratings)
Atul Kale and Xiaohan Zeng offer an overview of Bighead, Airbnb's user-friendly and scalable end-to-end machine learning framework that powers Airbnb's data-driven products. Built on Python, Spark, and Kubernetes, Bighead integrates popular libraries like TensorFlow, XGBoost, and PyTorch and is designed be used in modular pieces. Read more.
2:55pm–3:35pm Wednesday, 09/12/2018
Location: 1A 21/22 Level: Intermediate
Varant Zanoyan (Airbnb)
Average rating: ****.
(4.33, 6 ratings)
Zipline is Airbnb’s soon to be open-sourced data management platform specifically designed for ML use cases. It has taken the task of feature generation from months to days and offers features to support end-to-end data management for machine learning. Varant Zanoyan covers Zipline's architecture and dives into how it solves ML-specific problems. Read more.
4:35pm–5:15pm Wednesday, 09/12/2018
Location: 1A 21/22 Level: Beginner
Osman Sarood (Mist Systems)
Average rating: **...
(2.00, 1 rating)
Mist consumes several terabytes of telemetry data daily from its globally deployed wireless access points, a significant portion of which is consumed by ML algorithms. Last year, Mist saw 10x infrastructure growth. Osman Sarood explains how Mist runs 75% of its production infrastructure, reliably, on AWS EC2 spot instances, which has brought its annual AWS cost from $3 million to $1 million. Read more.
4:35pm–5:15pm Wednesday, 09/12/2018
Location: Expo Hall Level: Intermediate
Dan Harple (Context Labs)
Dan Harple explains how distributed systems are being influenced by and are influencing operational, financial, and social impact requirements of a wide range of enterprises and how trust in these distributed systems is being challenged, elevated, and resolved by engineers and architects today. Read more.
11:20am–12:00pm Thursday, 09/13/2018
Location: 1A 10 Level: Intermediate
Jonathan Hung (LinkedIn), Keqiu Hu (LinkedIn), Zhe Zhang (LinkedIn)
Jonathan Hung, Keqiu Hu, and Zhe Zhang offer an overview of TensorFlow on YARN (TonY), a framework to natively run TensorFlow on Hadoop. TonY enables running TensorFlow distributed training as a new type of Hadoop application. Its native Hadoop connector, together with other features, aims to run TensorFlow jobs as reliably and flexibly as other first-class citizens on Hadoop. Read more.
11:20am–12:00pm Thursday, 09/13/2018
Location: Expo Hall Level: Intermediate
Michelle Ufford (Netflix)
Average rating: ****.
(4.40, 5 ratings)
Michelle Ufford shares some of the cool things Netflix is doing with data and the big bets the company is making on data infrastructure, covering workflow orchestration, machine learning, interactive notebooks, centralized alerting, event-based processing, platform intelligence, and more. Read more.
1:10pm–1:50pm Thursday, 09/13/2018
Location: 1A 10 Level: Intermediate
Wangda Tan (Hortonworks)
Average rating: ****.
(4.50, 2 ratings)
In order to train deep learning and machine learning models, you must leverage applications such as TensorFlow, MXNet, Caffe, and XGBoost. Wangda Tan discusses new features in Apache Hadoop 3.x to better support deep learning workloads and demonstrates how to run these applications on YARN. Read more.
1:10pm–1:50pm Thursday, 09/13/2018
Location: 1A 15/16 Level: Intermediate
Moty Fania (Intel), Sergei Kom (Intel)
Average rating: *****
(5.00, 1 rating)
Moty Fania and Sergei Kom share their experience and lessons learned implementing an AI inference platform to enable internal visual inspection use cases. The platform is based on open source technologies and was designed for real-time, streaming, and online actuation. Read more.
1:10pm–1:50pm Thursday, 09/13/2018
Location: 1A 21/22 Level: Intermediate
Milene Darnis (Uber)
Average rating: ****.
(4.22, 9 ratings)
Every new launch at Uber is vetted via robust A/B testing. Given the pace at which Uber operates, the metrics needed to assess the impact of experiments constantly evolve. Milene Darnis explains how the team built a scalable and self-serve platform that lets users plug in any metric to analyze. Read more.
2:00pm–2:40pm Thursday, 09/13/2018
Location: 1E 09 Level: Beginner
tao huang (JD.com), mang zhang (JD.com), Bing Bai (JD.com)
Average rating: ***..
(3.00, 1 rating)
Tao Huang, Mang Zhang, and 白冰 explain how JD.com uses Alluxio to provide support for ad hoc and real-time stream computing, using Alluxio-compatible HDFS URLs and Alluxio as a pluggable optimization component. To give just one example, one framework, JDPresto, has seen a 10x performance improvement on average. Read more.
2:00pm–2:40pm Thursday, 09/13/2018
Location: 1A 23/24 Level: Beginner
Tim Walpole (BJSS)
Financial service clients demand increased data-driven personalization, faster insight-based decisions, and multichannel real-time access. Tim Walpole details how organizations can deliver real-time, vendor-agnostic, personalized chat services and explores issues around security, privacy, legal sign-off, data compliance, and how the internet of things can be used as a delivery platform. Read more.
2:00pm–2:40pm Thursday, 09/13/2018
Location: 1A 21/22 Level: Intermediate
Occhio Orsini (Aetna)
Occhio Orsini offers an overview of Aetna's Data Fabric platform. Join in to learn the needs and desires that led to the creation of the advanced analytics platform, explore the platform's architecture, technology, and capabilities, and understand the key technologies and capabilities that made it possible to build a hybrid solution across on-premises and cloud-hosted data centers. Read more.
3:30pm–4:10pm Thursday, 09/13/2018
Location: 1E 10/11 Level: Non-technical
Francesco Mucio (Zalando SE)
Average rating: ***..
(3.50, 2 ratings)
Francesco Mucio tells the story of how Zalando went from an old-school BI company to an AI-driven company built on a solid data platform. Along the way, he shares what Zalando learned in the process and the challenges that still lie ahead. Read more.
3:30pm–4:10pm Thursday, 09/13/2018
Location: 1E 09 Level: Intermediate
Kevin Lu (PayPal), MAULIN VASAVADA (PayPal), Na Yang (PayPal)
Average rating: ****.
(4.00, 3 ratings)
PayPal is one of the biggest Kafka users in the industry; it manages and maintains over 40 production Kafka clusters in three geodistributed data centers and supports 400 billion Kafka messages a day. Kevin Lu, Maulin Vasavada, and Na Yang explore the management and monitoring PayPal applies to Kafka, from client-perceived statistics to configuration management, failover, and data loss auditing. Read more.