Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Schedule: Architecture sessions

9:00am - 5:00pm Monday, March 13 & Tuesday, March 14
Jesse Anderson (Big Data Institute)
Average rating: ****.
(4.00, 1 rating)
To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data and how will you process that much data? Jesse Anderson explores the latest real-time frameworks (both open source and managed cloud services), discusses the leading cloud providers, and explains how to choose the right one for your company. Read more.
9:00am12:30pm Tuesday, March 14, 2017
Spark & beyond
Location: 210 D/H Level: Intermediate
John Akred (Silicon Valley Data Science), Stephen O'Sullivan (Data Whisperers)
Average rating: ****.
(4.60, 10 ratings)
What are the essential components of a data platform? John Akred and Stephen O'Sullivan explain how the various parts of the Hadoop, Spark, and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads. Read more.
9:00am12:30pm Tuesday, March 14, 2017
Big data and the Cloud
Location: LL21 A Level: Intermediate
Jennifer Wu (Cloudera), Eugene Fratkin (Cloudera), Andrei Savu (Cloudera), Tony Wu (Cloudera)
Average rating: ****.
(4.50, 2 ratings)
Jennifer Wu, Eugene Fratkin, Andrei Savu, and Tony Wu explore best practices for Hadoop deployments in the public cloud and provide detailed guidance for deploying, configuring, and managing Hive, Spark, and Impala in the public cloud. Read more.
1:30pm5:00pm Tuesday, March 14, 2017
Hadoop platform and applications
Location: LL21 E/F Level: Intermediate
Jonathan Seidman (Cloudera), Ted Malaska (Capital One), Mark Grover (Lyft), Gwen Shapira (Confluent)
Average rating: ****.
(4.17, 6 ratings)
Using Entity 360 as an example, Jonathan Seidman, Ted Malaska, Mark Grover, and Gwen Shapira explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics. Read more.
1:30pm5:00pm Tuesday, March 14, 2017
James Malone (Google), John Mikula (Google Cloud)
Average rating: **...
(2.00, 6 ratings)
James Malone explores using managed Spark and Hadoop solutions in public clouds alongside cloud products for storage, analysis, and message queues to meet enterprise requirements via the Spark and Hadoop ecosystem. Read more.
11:00am11:40am Wednesday, March 15, 2017
Hadoop platform and applications
Location: LL21 E/F Level: Intermediate
Todd Lipcon (Cloudera)
Average rating: ****.
(4.75, 4 ratings)
Todd Lipcon offers a very brief refresher on the goals and feature set of the Kudu storage engine, covering the development that has taken place over the last year, including new features such as improved support for time series workloads, performance improvements, Spark integration, and highly available replicated masters. Read more.
11:00am11:40am Wednesday, March 15, 2017
Big data and the Cloud
Location: 210 A/E Level: Intermediate
Sriram Ganesan (Qubole), Prakhar Jain (Qubole)
Average rating: ***..
(3.00, 2 ratings)
Qubole started out by offering Hadoop as a service in AWS. Over time, it extended its big data capabilities beyond Hadoop and its cloud infrastructure support beyond AWS. Sriram Ganesan and Prakhar Jain explain how and why Qubole built Cloudman, a simple, cloud-agnostic, multipurpose provisioning tool that can be extended for further engines and further cloud support. Read more.
11:00am11:40am Wednesday, March 15, 2017
Data engineering and architecture, Enterprise adoption
Location: 230 A Level: Beginner
Felix Gorodishter (GoDaddy)
Average rating: ****.
(4.25, 4 ratings)
GoDaddy ingests and analyzes 100,000 EPS of logs, metrics, and events each day. Felix Gorodishter shares GoDaddy's big data journey and explains how the company makes sense of 10+-TB-per-day growth for operational insights of its cloud leveraging Kafka, Hadoop, Spark, Pig, Hive, Cassandra, and Elasticsearch. Read more.
11:50am12:30pm Wednesday, March 15, 2017
Sensors, IOT & Industrial Internet
Location: LL20 D Level: Advanced
Tim Gasper (Janrain)
Average rating: *****
(5.00, 1 rating)
Food production and preparation have always been labor and capital intensive, but with the internet of things, low-cost sensors, cloud-computing ubiquity, and big data analysis, farmers and chefs are being replaced with connected, big data robots—not just in the field but also in your kitchen. Tim Gasper explores the tech stack, data science techniques, and use cases driving this revolution. Read more.
11:50am12:30pm Wednesday, March 15, 2017
Data engineering and architecture
Location: LL20 A Level: Intermediate
Christopher Colburn (Netflix), Monal Daxini (Netflix)
Average rating: ****.
(4.00, 3 ratings)
In the past, typical real-time data processing was reserved for answering operational questions and very basic analytical questions, but with better processing frameworks and more-capable hardware, the streaming context can now enable personalization applications. Christopher Colburn and Monal Daxini explore the challenges faced when building a streaming application at scale at Netflix. Read more.
11:50am12:30pm Wednesday, March 15, 2017
Andrei Savu (Cloudera), Jennifer Wu (Cloudera)
Average rating: ***..
(3.00, 3 ratings)
Cloud infrastructure, with a scalable data store and elastic compute, is particularly well suited for large-scale data engineering workloads. Andrei Savu and Jennifer Wu explore the latest cloud technologies and outline cost, security, and ease-of-use considerations for data engineers. Read more.
1:50pm2:30pm Wednesday, March 15, 2017
Enterprise adoption
Location: 230 A Level: Intermediate
Eric Richardson (American Chemical Society)
Average rating: **...
(2.50, 2 ratings)
Eric Richardson explains how ACS used Hadoop, HBase, Spark, Kafka, and Solr to create a hybrid cloud enterprise data hub that scales without drama and drives adoption by ease of use, covering the architecture, technologies used, the challenges faced and defeated, and problems yet to solve. Read more.
1:50pm2:30pm Wednesday, March 15, 2017
Big data and the Cloud
Location: 210 A/E Level: Intermediate
Henry Robinson (Cloudera), Alex Gutow (Cloudera)
Henry Robinson and Alex Gutow explain how to best take advantage of the flexibility and cost-effectiveness of the cloud with your BI and SQL analytic workloads using Apache Hadoop and Apache Impala (incubating) to provide the same great functionality, partner ecosystem, and flexibility of on-premises deployments. Read more.
1:50pm2:30pm Wednesday, March 15, 2017
Hadoop platform and applications
Location: LL21 E/F Level: Intermediate
Daniel Templeton (Cloudera)
Average rating: ****.
(4.00, 4 ratings)
Docker makes it easy to bundle an application with its dependencies and provide full isolation, and YARN now supports Docker as an execution engine for submitted applications. Daniel Templeton explains how YARN's Docker support works, why you'd want to use it, and when you shouldn't. Read more.
1:50pm2:30pm Wednesday, March 15, 2017
Data engineering and architecture
Location: LL20 A Level: Intermediate
Average rating: *****
(5.00, 2 ratings)
Data warehouses are critical in driving business decisions—with SQL dominantly used to build ETL pipelines. While the technology has shifted from using RDBMS-centric data warehouses to data pipelines based on Hadoop and MPP databases, engineering and quality processes have not kept pace. Avinash Padmanabhan highlights the changes that Intuit's team made to improve processes and data quality. Read more.
2:40pm3:20pm Wednesday, March 15, 2017
Big data and the Cloud, Enterprise adoption
Location: 230 A Level: Intermediate
Gwen Shapira (Confluent), Bob Lehmann (Bayer)
Average rating: ****.
(4.50, 2 ratings)
Gwen Shapira and Bob Lehmann share their experience and patterns building a cross-data-center streaming data platform for Monsanto. Learn how to facilitate your move to the cloud while "keeping the lights on" for legacy applications. In addition to integrating private and cloud data centers, you'll discover how to establish a solid foundation for a transition from batch to stream processing. Read more.
2:40pm3:20pm Wednesday, March 15, 2017
Data engineering and architecture, Real-time applications
Location: LL20 A Level: Intermediate
Kartik Paramasivam (LinkedIn)
Average rating: *****
(5.00, 2 ratings)
LinkedIn has one of the largest Kafka installations in the world, ingesting more than a trillion messages per day. Apache Samza-based stream processing applications process this deluge of data. Kartik Paramasivam discusses key improvements and architectural patterns that LinkedIn has adopted in its data systems in order to process millions of requests per second while keeping costs in control. Read more.
2:40pm3:20pm Wednesday, March 15, 2017
Platform Security and Cybersecurity
Location: LL21 B Level: Intermediate
Ajit Gaddam (VISA), Jiphun Satapathy (VISA)
Average rating: ***..
(3.83, 6 ratings)
Apache Kafka is used by over 35% of Fortune 500 companies to store and process some of their most sensitive datasets. Ajit Gaddam and Jiphun Satapathy provide a security reference architecture to secure your Kafka cluster while leveraging it to support your organization's cybersecurity requirements. Read more.
2:40pm3:20pm Wednesday, March 15, 2017
Big data and the Cloud, Data engineering and architecture
Location: LL21 E/F Level: Beginner
Paige Liu (Microsoft), John Zhuge (Netflix)
Paige Liu and John Zhuge explore the options and trade-offs to consider when building a Cloudera cluster on Microsoft Azure Cloud and explain how to deploy and scale a Cloudera cluster on Azure and how to connect a Cloudera cluster with other Azure services to build enterprise-grade end-to-end big data solutions. Read more.
4:20pm5:00pm Wednesday, March 15, 2017
Spark & beyond
Location: LL21 C/D Level: Beginner
Average rating: ***..
(3.00, 3 ratings)
Spark powers various services in Bing, but the Bing team had to customize and extend Spark to cover its use cases and scale the implementation of Spark-based data pipelines to handle internet-scale data volume. Kaarthik Sivashanmugam explores these use cases, covering the architecture of Spark-based data platforms, challenges faced, and the customization done to Spark to address the challenges. Read more.
4:20pm5:00pm Wednesday, March 15, 2017
Hadoop platform and applications
Location: LL21 E/F Level: Intermediate
Dwai Lahiri (Cloudera)
Average rating: ****.
(4.50, 2 ratings)
Dwai Lahiri explains how to leverage private cloud infrastructure to successfully build Hadoop clusters and outlines dos, don'ts, and gotchas for running Hadoop on private clouds. Read more.
4:20pm5:00pm Wednesday, March 15, 2017
Real-time applications, Stream processing and analytics
Location: LL20 A Level: Intermediate
Sridhar Alla (BlueWhale), Shekhar Agrawal (Comcast)
Average rating: *****
(5.00, 2 ratings)
Sridhar Alla and Shekhar Agrawal explain how Comcast built the largest Kudu cluster in the world (scaling to PBs of storage) and explore the new kinds of analytics being performed there, including real-time processing of 1 trillion events and joining multiple reference datasets on demand. Read more.
4:20pm5:00pm Wednesday, March 15, 2017
Kishore R (GE)
Average rating: ***..
(3.00, 1 rating)
Kishore Reddipalli explores how to stream data at a large scale from the edge to the cloud to the client, detect anomalies, analyze machine data in stream and rest in an industrial world, and optimize the industrial operations by providing real-time insights and recommendations using big data technologies. Read more.
5:10pm5:50pm Wednesday, March 15, 2017
Data engineering and architecture
Location: LL20 A Level: Advanced
Monal Daxini (Netflix)
Average rating: ****.
(4.50, 2 ratings)
Netflix Keystone processes over a trillion events per day with at-least-once processing semantics in the cloud. Monal Daxini explores what it means to offer stream processing as a service (SPaaS), how Netflix implemented a scalable, fault-tolerant multitenant SPaaS internal offering, and how it evolved the system in flight with no downtime. Read more.
5:10pm5:50pm Wednesday, March 15, 2017
Big data and the Cloud
Location: LL21 E/F Level: Intermediate
Naghman Waheed (Bayer Crop Science), Martin Mendez-Costabel (Bayer Crop Science)
Average rating: ****.
(4.00, 1 rating)
Recently, the volume of data collected from farmers' fields via sensors, rovers, drones, in-cabin technologies, and other sources has forced Monsanto to rethink its geospatial processing capabilities. Naghman Waheed and Martin Mendez-Costabel explain how Monsanto built a scalable geospatial platform using cloud and open source technologies. Read more.
5:10pm5:50pm Wednesday, March 15, 2017
Big data and the Cloud
Location: 210 A/E
Dale Kim (Arcadia Data)
Big data applications in the cloud are becoming more about the global distribution and access of data than about easier deployments. Dale Kim shares insights on architecting big data applications for the cloud, using an example reference application his team built and published as context for describing several key requirements for cloud-based environments. Read more.
11:00am11:40am Thursday, March 16, 2017
Hadoop platform and applications
Location: LL21 E/F Level: Intermediate
Todd Lipcon (Cloudera), Marcel Kornacker (Cloudera)
Average rating: ****.
(4.00, 1 rating)
Todd Lipcon and Marcel Kornacker offer an introduction to using Impala and Kudu to power your real-time data-centric applications for use cases like time series analysis (fraud detection, stream market data), machine data analytics, and online reporting. Read more.
11:50am12:30pm Thursday, March 16, 2017
Data engineering and architecture
Location: LL20 A Level: Intermediate
Kurt Brown (Netflix)
Average rating: ****.
(4.90, 10 ratings)
The Netflix data platform is constantly evolving, but fundamentally it's an all-cloud platform at a massive scale (40+ PB and over 700 billion new events per day) focused on empowering developers. Kurt Brown dives into the current technology landscape at Netflix and offers some thoughts on what the future holds. Read more.
11:50am12:30pm Thursday, March 16, 2017
Hadoop platform and applications
Location: LL21 E/F Level: Intermediate
Yang Li (Kyligence)
Average rating: *****
(5.00, 3 ratings)
Apache Kylin, which started as a big data OLAP engine, is reaching its v2.0. Yang Li explains how, armed with snowflake schema support, a full SQL interface, spark cubing, and the ability to consume real-time streaming data, Apache Kylin is closing the gap to becoming a real-time data warehouse. Read more.
11:50am12:30pm Thursday, March 16, 2017
Big data and the Cloud
Location: LL21 C/D Level: Beginner
Haoyuan Li (Alluxio), Gene Pang (Alluxio)
Average rating: ****.
(4.00, 1 rating)
Alluxio (formerly Tachyon) is an open source memory-speed virtual distributed storage system. The project has experienced a tremendous improvement in performance and scalability and was extended with key new features. Haoyuan Li and Gene Pang explore Alluxio's goal of making its product accessible to an even wider set of users through a focus on security, new language bindings, and APIs. Read more.
1:50pm2:30pm Thursday, March 16, 2017
Data engineering and architecture, Emerging Technologies
Location: LL20 D Level: Advanced
Julien Le Dem (WeWork), Jacques Nadeau (Dremio)
Average rating: ****.
(4.00, 2 ratings)
In pursuit of speed, big data is evolving toward columnar execution. The solid foundation laid by Arrow and Parquet for a shared columnar representation across the ecosystem promises a great future. Julien Le Dem and Jacques Nadeau discuss the future of columnar and the hardware trends it takes advantage of, such as RDMA, SSDs, and nonvolatile memory. Read more.
2:40pm3:20pm Thursday, March 16, 2017
Gwen Shapira (Confluent)
Average rating: *****
(5.00, 3 ratings)
There are many good reasons to run more than one Kafka cluster. . .and a few bad reasons too. Great architectures are driven by use cases, and multicluster deployments are no exception. Gwen Shapira offers an overview of several use cases, including real-time analytics and payment processing, that may require multicluster solutions to help you better choose the right architecture for your needs. Read more.
2:40pm3:20pm Thursday, March 16, 2017
Gleicon Moraes (luc.id), Arthur Grava (Luizalabs)
Average rating: ****.
(4.00, 3 ratings)
Gleicon Moraes and Arthur Grava share war stories about developing and deploying a cloud-based large-scale recommender system for a top-three Brazilian ecommerce company. The system, which uses Cassandra and graph traversal, led to a more than 15% increase in sales. Read more.
4:20pm5:00pm Thursday, March 16, 2017
Data engineering and architecture
Location: LL20 A Level: Intermediate
Nischal HP (Unnati Data Labs), Raghotham Sripadraj (Ericsson)
Average rating: ****.
(4.67, 3 ratings)
Not all data science problems are big data problems. Lots of small and medium product companies want to start their journey to become data driven. Nischal HP and Raghotham Sripadraj share their experience building data science platforms for various enterprises, with an emphasis on making the right architecture choices and using distributed and fault-tolerant tools. Read more.