Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Schedule: Media sessions

Add to your personal schedule
11:50am12:30pm Wednesday, March 15, 2017
Data engineering and architecture
Location: LL20 A Level: Intermediate
Christopher Colburn (Netflix), Monal Daxini (Netflix)
Average rating: ****.
(4.00, 3 ratings)
In the past, typical real-time data processing was reserved for answering operational questions and very basic analytical questions, but with better processing frameworks and more-capable hardware, the streaming context can now enable personalization applications. Christopher Colburn and Monal Daxini explore the challenges faced when building a streaming application at scale at Netflix. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 15, 2017
Jure Leskovec (Pinterest)
Average rating: ****.
(4.82, 11 ratings)
Pinterest built a flexible, graph-based system for making recommendations to users in real time. The system uses random walks on a user-and-object graph in order to make personalized recommendations to 100+ million Pinterest users out of a catalog of over a billion items. Jure Leskovec explains how Pinterest built its modern recommendation engine and the lessons learned along the way. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 15, 2017
Data engineering and architecture, Real-time applications
Location: LL20 A Level: Intermediate
Kartik Paramasivam (LinkedIn)
Average rating: *****
(5.00, 2 ratings)
LinkedIn has one of the largest Kafka installations in the world, ingesting more than a trillion messages per day. Apache Samza-based stream processing applications process this deluge of data. Kartik Paramasivam discusses key improvements and architectural patterns that LinkedIn has adopted in its data systems in order to process millions of requests per second while keeping costs in control. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 15, 2017
Michael Edwards shares experiences from operating several Kafka clusters in a real-time streaming event ingestion pathway. He'll discuss the lessons learned from working with hundreds of terabytes flowing through every day, petabytes of retention, and gigabytes of historical data streaming to and from storage. Read more.
Add to your personal schedule
4:20pm5:00pm Wednesday, March 15, 2017
Business case studies, Strata Business Summit
Location: 210 D/H Level: Intermediate
Alan Chaney (Bitvore Corp)
Average rating: ***..
(3.50, 2 ratings)
Bitvore Corp’s Bitvore for Munis personalized news surveillance system is rapidly becoming a must-have for all major fixed-income securities analysts, investors, and brokers working in the three-trillion-dollar municipal bond market in the USA. Alan Chaney explains how Bitvore delivers the few important and relevant articles out of thousands each day, saving users many hours daily. Read more.
Add to your personal schedule
4:20pm5:00pm Wednesday, March 15, 2017
Spark & beyond
Location: LL21 C/D Level: Beginner
Average rating: ***..
(3.00, 3 ratings)
Spark powers various services in Bing, but the Bing team had to customize and extend Spark to cover its use cases and scale the implementation of Spark-based data pipelines to handle internet-scale data volume. Kaarthik Sivashanmugam explores these use cases, covering the architecture of Spark-based data platforms, challenges faced, and the customization done to Spark to address the challenges. Read more.
Add to your personal schedule
4:20pm5:00pm Wednesday, March 15, 2017
Real-time applications, Stream processing and analytics
Location: LL20 A Level: Intermediate
Sridhar Alla (Comcast), Shekhar Agrawal (Comcast)
Average rating: *****
(5.00, 2 ratings)
Sridhar Alla and Shekhar Agrawal explain how Comcast built the largest Kudu cluster in the world (scaling to PBs of storage) and explore the new kinds of analytics being performed there, including real-time processing of 1 trillion events and joining multiple reference datasets on demand. Read more.
Add to your personal schedule
5:10pm5:50pm Wednesday, March 15, 2017
Data engineering and architecture
Location: LL20 A Level: Advanced
Monal Daxini (Netflix)
Average rating: ****.
(4.50, 2 ratings)
Netflix Keystone processes over a trillion events per day with at-least-once processing semantics in the cloud. Monal Daxini explores what it means to offer stream processing as a service (SPaaS), how Netflix implemented a scalable, fault-tolerant multitenant SPaaS internal offering, and how it evolved the system in flight with no downtime. Read more.
Add to your personal schedule
5:10pm5:50pm Wednesday, March 15, 2017
Business case studies, Strata Business Summit
Location: 210 D/H Level: Beginner
Viral Bajaria (6Sense)
Average rating: ****.
(4.00, 1 rating)
What if companies could predict what products people will buy, how much they will buy, and when? It would be a game changer—and it’s already possible with the power of predictive intelligence. Viral Bajaria explores how BlueJeans Network was able to leverage predictive analytics to uncover buyers earlier, convert them at a 20x higher rate, and build a $33M pipeline. Read more.
Add to your personal schedule
5:10pm5:50pm Wednesday, March 15, 2017
Sijie Guo (ASF)
Average rating: **...
(2.00, 2 ratings)
Apache DistributedLog (incubating) is a low-latency, high-throughput replicated log service. Sijie Guo shares how Twitter has used DistributedLog as the real-time data foundation in production for years, supporting services like distributed databases, pub-sub messaging, and real-time stream computing and delivering more than 1.5 trillion (17 PB) events per day. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 16, 2017
Stream processing and analytics
Location: LL20 D Level: Intermediate
Bill Graham (Twitter), Avrilia Floratau (Microsoft), Ashvin Agrawal (Microsoft)
Twitter processes billions of events per day the instant the data is generated using Heron, an open source streaming engine tailored for large-scale environments. Bill Graham, Avrilia Floratau, and Ashvin Agrawal explore the techniques Heron uses to elastically scale resources in order to handle highly varying loads without sacrificing real-time performance or user experience. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 16, 2017
Platform Security and Cybersecurity
Location: LL21 B Level: Beginner
Yinglian Xie (DataVisor)
How many of your users are really fraudsters waiting to strike? These sleeper cells exist in all online communities. Using data from more than 400M users and 500B events from online services across the world, Yinglian Xie explores sleeper cells, explains sophisticated attack techniques being used to evade detection, and shows how Spark's in-memory big data security analytics can help. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 16, 2017
Dorna Bandari (Jetlore)
Average rating: ****.
(4.00, 2 ratings)
Most internet companies record a constant stream of logs as a user interacts with their application. Depending on the complexity of the application, the logs can be extremely difficult to decipher. Dorna Bandari presents a novel NLP-based method for clustering user sessions in consumer internet applications, which has proved to be extremely effective in both driving strategy and personalization. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 16, 2017
Data engineering and architecture
Location: LL20 A Level: Intermediate
Kurt Brown (Netflix)
Average rating: ****.
(4.90, 10 ratings)
The Netflix data platform is constantly evolving, but fundamentally it's an all-cloud platform at a massive scale (40+ PB and over 700 billion new events per day) focused on empowering developers. Kurt Brown dives into the current technology landscape at Netflix and offers some thoughts on what the future holds. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 16, 2017
Stream processing and analytics
Location: LL20 D Level: Intermediate
Arun Kejariwal (Independent), Karthik Ramasamy (Twitter)
Average rating: ***..
(3.00, 1 rating)
Anomaly detection plays a key role in the context of analysis of real-time streams. This is exemplified by, say, detection incidents in real life from tweet storms. Arun Kejariwal and Karthik Ramasamy walk you through how anomaly detection is supported in real-time data streams in Heron—the streaming system built in-house at Twitter (and open sourced) for real-time computation. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 16, 2017
Brian Lange (Datascope)
Average rating: ****.
(4.00, 2 ratings)
The goal of RCSA's Scialog conferences is to foster collaboration between scientists with different specialties and approaches, and, working with Datascope, the company has been doing so in a quantitative way for the last six years. Brian Lange discusses how Datasope and RCSA arrived at the problem, the design choices made in the survey and optimization, and how the results were visualized. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 16, 2017
Data science & advanced analytics
Location: 230 A Level: Intermediate
Michelangelo D'Agostino (ShopRunner), BIll Lattner (Civis Analytics)
Average rating: ****.
(4.00, 2 ratings)
How do we know that an advertisement or promotion truly drives incremental revenue? Michelangelo D'Agostino and Bill Lattner share their experience developing machine-learning techniques for predicting treatment responsiveness from randomized controlled experiments and explore the use of these “persuasion” models at scale in politics, social good, and marketing. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 16, 2017
Data science & advanced analytics
Location: LL21 E/F Level: Advanced
Chao Zhong (Microsoft)
Average rating: ****.
(4.67, 3 ratings)
Chao Zhong offers an overview of a new predictive model for customer lifetime value (LTV) in a cloud-computing business. This model is also the first known application of the Fader RFM approach to a cloud business—a Bayesian approach that predicts a customer's LTV with a symmetric absolute percentage error (SAPE) of only 3% on an out-of-time testing dataset. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 16, 2017
Grace Huang (Pinterest)
Average rating: ***..
(3.33, 3 ratings)
With over 75 billion pins, the Pinterest content corpus is one of the largest human-curated collection of ideas. Grace Huang walks you through the lifecycle of a piece of content in Pinterest, a portfolio of metrics developed to monitor the health of the content corpus, and the story of creating a cross-functional initiative to preserve a healthy, sustainable content ecosystem. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 16, 2017
Data science & advanced analytics
Location: 230 A Level: Intermediate
Michelle Casbon (Google)
Average rating: *****
(5.00, 1 rating)
Supporting multiple locales involves the maintenance and generation of localized strings. Michelle Casbon explains how machine learning and natural language processing are applied to the underserved domain of localization using primarily open source tools, including Scala, Apache Spark, Apache Cassandra, and Apache Kafka. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 16, 2017
Business case studies, Strata Business Summit
Location: 210 D/H Level: Intermediate
Romit Jadhwani (Pinterest)
Average rating: ****.
(4.75, 4 ratings)
Over the course of just six years, Pinterest has helped over 100 million pinners discover and collect over 75+ billion ideas to plan their everyday lives. Romit Jadhwani walks you through the different phases of this hypergrowth journey and explores the focuses, thought processes, and decisions of Pinterest’s data team as they scaled and enabled this growth. Read more.