Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Schedule: Big data and data science in the cloud sessions

Add to your personal schedule
9:00am - 5:00pm Monday, March 5 & Tuesday, March 6
Location: 114
Jesse Anderson (Big Data Institute)
To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data and how will you process that much data? Jesse Anderson explores the latest real-time frameworks (both open source and managed cloud services), discusses the leading cloud providers, and explains how to choose the right one for your company. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, March 6, 2018
Location: LL21 A Level: Beginner
Secondary topics:  Graphs and Time-series
Mo Patel (Independent), Neejole Patel (Virginia Tech)
Since its arrival in early 2017, PyTorch has won over many deep learning researchers and developers due to its dynamic computation framework. Mo Patel and Neejole Patel walk you through using PyTorch to build a content recommendation model. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, March 6, 2018
Location: LL21 B Level: Intermediate
Jorge A. Lopez (Amazon Web Services), Radhika Ravirala (Amazon Web Services (AWS)), Paul Sears (Amazon Web Services), Ryan Nienhuis (Amazon Web Services), Randy Ridgley (Amazon Web Services)
Want to learn how to use Amazon's big data web services to launch your first big data application in the cloud? Jorge Lopez walks you through building a big data application using a combination of open source technologies and AWS managed services. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, March 6, 2018
Location: 210 D/H Level: Intermediate
Aishwarya Venkataraman, Jason Wang, Mala Ramakrishnan, Stefan Salandy, and Vinithra Varadharajan lead a deep dive into running data analytic workloads in a managed service capacity in the public cloud and highlight cloud infrastructure best practices. Read more.
Add to your personal schedule
9:00am5:00pm Tuesday, March 6, 2018
Location: LL20 A
Madhav Madaboosi (BP), Meenakshisundaram Thandavarayan (Infosys), Matt Conners (Microsoft), Katie Malone (Civis Analytics), Mike Prorock (mesur.io), Thomas Miller (Northwestern University), Ann Nguyen (Whole Whale), Rajiv Synghal (Kaiser Permanente), Valentin Bercovici (Pencil Data Inc.), Wayde Fleener (General Mills), Joe Dumoulin (Next IT), Jules Malin (GoPro), Taylor Martin (O'Reilly Media), Divya Ramachandran (Captricity)
Hear practical insights from household brands and global companies: the challenges they tackled, approaches they took, and the benefits—and drawbacks—of their solutions. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, March 6, 2018
Location: LL21 C/D Level: Intermediate
Ronny Kohavi (Microsoft), Alex Deng (Microsoft), Somit Gupta (Microsoft), Paul Raff (Microsoft)
Controlled experiments such as A/B tests have revolutionized the way software is being developed, allowing real users to objectively evaluate new ideas. Ronny Kohavi, Alex Deng, Somit Gupta, and Paul Raff lead an introduction to A/B texting and share lessons learned from one of the largest A/B testing platforms on the planet, running at Microsoft, which executes over 10K experiments a year. Read more.
Add to your personal schedule
11:00am11:40am Wednesday, March 7, 2018
Location: LL21 C/D Level: Intermediate
Tom Fisher (MapR Technologies)
The monolithic cloud is dying. Delivering capabilities across multiple clouds and, simultaneously, transitioning to next-generation platforms and applications is the challenge today. Tom Fisher explores technological approaches and solutions that make this possible while delivering data-driven applications and operations. Read more.
Add to your personal schedule
11:00am11:40am Wednesday, March 7, 2018
Location: LL20 D Level: Beginner
Wee Hyong Tok (Microsoft), Danielle Dean (Microsoft)
Artificial intelligence (AI) has tremendous potential to extend our capabilities and empower organizations to accelerate their digital transformation. Wee Hyong Tok and Danielle Dean demystify AI for big data professionals and explain how they can leverage and evolve their valuable big data skills by getting started with AI. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 7, 2018
Location: LL21 C/D Level: Intermediate
Bin Fan (Alluxio), Shaoshan Liu (PerceptIn)
Bin Fan and Shaoshan Liu explain how PerceptIn designed and implemented a cloud architecture to support video streaming and online object recognition tasks and demonstrate how Alluxio delivers high throughput, low latency, and a unified namespace to support these emerging cloud architectures. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 7, 2018
Location: LL21 E/F Level: Intermediate
Manu Mukerji (Criteo)
Criteo is a global leader in commerce marketing. Manu Mukerji walks you through Criteo's machine learning example for universal catalogs, explaining how the training and test sets are generated and annotated, how the model is pushed to production, evaluated (automatically), and used, production issues that arise when applying ML at scale in production, lessons learned, and more. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 7, 2018
Location: 230 A Level: Intermediate
Secondary topics:  Graphs and Time-series
William Chambers (Databricks), Michael Armbrust (Databricks)
William Chambers and Michael Armbrust discuss the motivation and basics of Apache Spark's Structured Streaming processing engine and share lessons they've learned running hundreds of Structured Streaming workloads in the cloud. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 7, 2018
Location: LL21 B Level: Intermediate
Shivaram Venkataraman (Microsoft Research), Sergey Ermolin (Intel)
The BigDL framework scales deep learning for large datasets using Apache Spark. However there is significant scheduling overhead from Spark when running BigDL at large scale. Shivaram Venkataraman and Sergey Ermolin outline a new parameter manager implementation that along with coarse-grained scheduling can provide significant speedups for deep learning models like Inception and VGG. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 7, 2018
Location: 210 D/H Level: Beginner
Ari Gesher (Kairos Aerospace)
A warming planet needs precise, localized predictions about the effects of climate change to make good long-term and medium-term economic decision making. Ari Gesher demonstrates how to use a mix of physical simulation, enhanced scientific models, machine learning verification, and high-scale computing to predict and package climate predictions as data products. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 7, 2018
Location: 212 A-B Level: Intermediate
Secondary topics:  Data Integration and Data Pipelines
Eugene Kirpichov (Google)
Apache Beam equips users with a novel programming model in which the classic batch/streaming data processing dichotomy is erased. Eugene Kirpichov details the modularity and composability advantages created by treating data ingestion as just another data processing task and walks you through building highly modular data ingestion APIs using the new Beam programming model primitive Splittable DoFn. Read more.
Add to your personal schedule
1:50pm2:30pm Wednesday, March 7, 2018
Location: LL21 E/F Level: Beginner
Zhen Fan (JD.com), Wei Ting Chen (Intel)
Zhen Fan and Wei Ting Chen explain how JD.com uses Spark on Kubernetes in a production environment and why the company chose Spark on Kubernetes for its AI workloads. You'll learn how to run Spark with Kubernetes and gain an understanding of the advantages this provides. Read more.
Add to your personal schedule
1:50pm2:30pm Wednesday, March 7, 2018
Location: LL20 A Level: Intermediate
Secondary topics:  Graphs and Time-series
Alexandra Gunderson (Arundo Analytics)
Heavy industries, such as oil and gas, have tremendous amounts of data from which predictive models could be built, but it takes weeks or even months to create a comprehensive dataset from all of the various data sources. Alexandra Gunderson details the methodology behind an industry-tested approach that incorporates machine learning to structure and link data from different sources. Read more.
Add to your personal schedule
1:50pm2:30pm Wednesday, March 7, 2018
Location: 230 C Level: Beginner
Siddharth Teotia (Dremio)
Query processing technology has rapidly evolved since the iconic C-Store paper was published in 2005, with a focus on designing query processing algorithms and data structures that efficiently utilize CPU and leverage the changing trends in hardware to deliver optimal performance. Siddharth Teotia outlines the different types of vectorized query processing in Dremio using Apache Arrow. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 7, 2018
Location: 230 A Level: Intermediate
Henry Cai (Pinterest), Yi Yin (Pinterest)
With the rise of large-scale real-time computation, there is a growing need to link legacy MySQL systems with real-time platforms. Henry Cai and Yi Yin offer an overview of WaterMill, Pinterest's continuous DB ingestion system for streaming SQL data into near-real-time computation pipelines to support dynamic personalized recommendations and search indices. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 7, 2018
Location: LL20 A Level: Non-technical
Secondary topics:  Graphs and Time-series
Baron Schwartz (VividCortex)
Anomaly detection is white hot in the monitoring industry, but many don't really understand or care about it, while others repeat the same pattern many times. Why? And what can we do about it? Baron Schwartz explains how he arrived at a "post-anomaly detection" point of view. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 7, 2018
Location: LL20 D Level: Intermediate
Joseph Bradley (Databricks)
Joseph Bradley discusses common paths to productionizing Apache Spark MLlib models and shares engineering challenges and corresponding best practices. Along the way, Joseph covers several deployment scenarios, including batch scoring, Structured Streaming, and real-time low-latency serving. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 7, 2018
Location: LL21 B Level: Beginner
Mohamed AbdelHady (Microsoft), Zoran Dzunic (Microsoft)
Mohamed AbdelHady and Zoran Dzunic demonstrate how to build a domain-specific entity extraction system from unstructured text using deep learning. In the model, domain-specific word embedding vectors are trained on a Spark cluster using millions of PubMed abstracts and then used as features to train a LSTM recurrent neural network for entity extraction. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 7, 2018
Location: 230 C Level: Intermediate
Ritesh Agrawal (Uber), Anirban Deb (Uber)
Presto has emerged as the de facto query engine to quickly process petabytes of data. However, rogue SQL queries can waste a significant amount of critical compute resource and reduce Presto's throughput. Ritesh Agrawal and Anirban Deb explain how Uber uses machine learning to identify and stop rogue queries, saving both computational power and money. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 7, 2018
Location: 212 A-B Level: Intermediate
Secondary topics:  Data Integration and Data Pipelines
Sean Ma (Trifacta)
Organizations leverage reporting, analytic, and machine learning pipelines to drive decision making and power critical operational systems. Sean Ma discusses methods for detecting, visualizing, and resolving inconsistencies between source and target data models across these pipelines. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 7, 2018
Location: Expo Hall 1 Level: Advanced
Secondary topics:  Expo Hall, Graphs and Time-series
Yu Xu (TigerGraph)
Graph databases are the fastest growing category in data management. However, most graph queries only traverse two hops in big graphs due to limitations in most graph databases. Real-world applications require deep link analytics that traverse far more than three hops. Yu Xu offers an overview of a fraud detection system that manages 100 billion graph elements to detect risk and fraudulent groups. Read more.
Add to your personal schedule
4:20pm5:00pm Wednesday, March 7, 2018
Location: LL21 C/D Level: Beginner
Carlo Torniai (Pirelli Tyre)
Carlo Torniai shares the architectural challenges Pirelli faced in building Pirelli Connesso, an IoT cloud-based system providing information on tire operating conditions, consumption, and maintenance, and highlights the operative approaches that enabled the integration of contributions across cross-functional teams. Read more.
Add to your personal schedule
4:20pm5:00pm Wednesday, March 7, 2018
Location: LL20 C Level: Intermediate
Secondary topics:  Graphs and Time-series
Vlad A Ionescu (ShiftLeft), Fabian Yamaguchi (ShiftLeft)
Vlad Ionescu and Fabian Yamaguchi outline Code Property Graph (CPG), a unique approach that allows the functional elements of code to be represented in an interconnected graph of data and control flows, which enables semantic information about code to be stored scalably on distributed graph databases over the web while allowing them to be rapidly accessed. Read more.
Add to your personal schedule
5:10pm5:50pm Wednesday, March 7, 2018
Location: LL20 D Level: Beginner
Balasubramanian Narasimhan (Stanford University), John-Mark Agosta (Microsoft), Philip Lavori (Stanford University)
Clinical collaboration benefits from pooling data to train models from large datasets, but it's hampered by concerns about sharing data. Balasubramanian Narasimhan, John-Mark Agosta, and Philip Lavori outline a privacy-preserving alternative that creates statistical models equivalent to one from the entire dataset. Read more.
Add to your personal schedule
5:10pm5:50pm Wednesday, March 7, 2018
Location: LL21 B Level: Intermediate
Sergey Ermolin (Intel), Suqiang Song (Mastercard)
Sergey Ermolin and Suqiang Song demonstrate how to use Spark BigDL wide and deep and neural collaborative filtering (NCF) algorithms to predict a user’s probability of shopping at a particular offer merchant during a campaign period. Along the way, they compare the deep learning results with those obtained by MLlib’s alternating least squares (ALS) approach. Read more.
Add to your personal schedule
5:10pm5:50pm Wednesday, March 7, 2018
Location: 212 A-B Level: Non-technical
Secondary topics:  Data Integration and Data Pipelines
Abe Gong (Superconductive Health), James Campbell (USG)
Data science and engineering have been missing out on one of the biggest productivity boosters in modern software development: automated testing. Abe Gong and James Campbell discuss the concept of pipeline tests and offer an overview of Great Expectations, an open source Python framework for bringing data pipelines and products under test. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 8, 2018
Location: LL21 E/F Level: Intermediate
Greg Rahn (Cloudera)
For many organizations, the cloud will likely be the destination of their next big data warehouse. Greg Rahn shares considerations when evaluating the cloud for analytics and big data warehousing in order to help you get the most from the cloud. You'll leave with an understanding of different architectural approaches and impacts for moving analytic workloads to the cloud. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 8, 2018
Location: 210 C/G Level: Beginner
Secondary topics:  Graphs and Time-series
Michael Schrenk (Self-Employed)
Big data becomes much more powerful when it has context. Fortunately, creative data scientists can create needed context though the use of metadata. Michael Schrenk explains how metadata is created and used to gain competitive advantages, predict troop strength, or even guess Social Security numbers. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 8, 2018
Location: LL21 C/D Level: Beginner
dong meng (MapR)
Deep learning model performance relies on underlying data. Dong Meng offers an overview of a converged data platform that serves as a data infrastructure, providing a distributed filesystem, key-value storage and streams, and Kubernetes as orchestration layer to manage containers to train and deploy deep learning models using GPU clusters. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 8, 2018
Location: LL21 E/F Level: Beginner
Szehon Ho (Criteo), Pawel Szostek (Criteo)
Hive is the main data transformation tool at Criteo, and hundreds of analysts and thousands of automated jobs run Hive queries every day. Szehon Ho and Pawel Szostek discuss the evolution of Criteo's Hive platform from an error-prone add-on installed on some spare machines to a best-in-class installation capable of self-healing and automatically scaling to handle its growing load. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 8, 2018
Location: LL21 B Level: Beginner
Jennie Wang (Intel), Valentina Pedoia (UCSF), Berk Norman (UCSF), Yulia Tell (Intel)
Damage to the meniscus is a physically limiting injury that can lead to further medical complications. Automatically classifying this damage at the time of an MRI scan would allow quicker and more accurate diagnosis. Jennie Wang, Valentina Pedoia, Berk Norman, and Yulia Tell offer an overview of their classification system built with 3D convolutional neural networks using BigDL on Apache Spark. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 8, 2018
Location: Expo Hall 1 Level: Beginner
Secondary topics:  Expo Hall
Mike Driscoll (Metamarkets)
There’s a make-or-break step ahead for AI development. AI tools shouldn’t be designed to replace humans; they should be built with them in mind. We need to focus on translating data from machine learning models into beautiful, intuitive visuals. Mike Driscoll shares advice for creators of next-gen predictive algorithms from his experience turning big data into interactive visualizations. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 8, 2018
Location: LL20 A Level: Intermediate
Ram Sriharsha (Databricks)
How do you scale geospatial analytics on big data? And while you're at it, can you make it easy to use while achieving state-of-the-art performance on a single node? Ram Sriharsha offers an overview of Magellan—a geospatial optimization engine that seamlessly integrates with Spark—and explains how it provides scalability and performance without sacrificing simplicity. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 8, 2018
Location: LL21 B Level: Non-technical
Delip Rao (R7 Speech Science)
Spoken conversations have rich information beyond what was said in words. Delip Rao details the potential of spoken conversational datasets, including identifying speakers and their demographic attributes, understanding intent and dynamics between speakers, and so on. Delip also discusses some of the latest science, including some of the work developed at R7. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 8, 2018
Location: 210 D/H Level: Intermediate
Michael Lysaght (Weight Watchers), Steven Levine (Weight Watchers )
For organizations stuck in a myriad of legacy infrastructure, the path to AI and deep learning seems impossible. Michael Lysaght, Steven Levine, and Nicolas Chikhani discuss Weight Watchers's transition from a traditional BI organization to one that uses data effectively, covering the company's needs, the changes that were required, and the technologies and architecture used to achieve its goals. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 8, 2018
Location: LL21 C/D Level: Intermediate
Michelle Casbon (Google Cloud Platform Developer Relations)
Michelle Casbon explains how to speed up the development of ML models by using open source tools such as Kubernetes, Docker, Scala, Apache Spark, and Weave Flux, detailing how to build resilient systems so that you can spend more of your time on product improvement rather than triage and uptime. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 8, 2018
Location: 230 C Level: Intermediate
Tomer Shiran (Dremio), Jacques Nadeau (Dremio)
Most organizations manage 5 to 15 copies of their data in multiple systems and formats to support different analytical use cases. Tomer Shiran and Jacques Nadeau introduce a new approach called data reflections, which dramatically reduces the need for data copies, demonstrate an open source implementation built with Apache Calcite, and explore two production case studies. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 8, 2018
Location: Expo Hall 1 Level: Intermediate
Secondary topics:  Expo Hall
Chris Fregly (PipelineAI)
Chris Fregly demonstrates how to extend existing Spark-based data pipelines to include TensorFlow model training and deploying and offers an overview of TensorFlow’s TFRecord format, including libraries for converting to and from other popular file format’s such as Parquet, CSV, JSON, and Avro stored in HDFS and S3. Read more.
Add to your personal schedule
4:20pm5:00pm Thursday, March 8, 2018
Location: LL21 E/F Level: Intermediate
Shenghu Yang (Lyft)
Lyft’s business grown over 100x in the past four years. Shenghu Yang explains how Lyft’s data pipeline has evolved over the years to serve its ever-growing analytics use cases, migrating from the world's largest AWS Redshift clusters to Apache Hive and Presto for solving scalability and concurrency hard limits. Read more.
Add to your personal schedule
4:20pm5:00pm Thursday, March 8, 2018
Location: LL20 D Level: Intermediate
Goodman Gu (Atlassian)
Machine learning is a pivotal technology. However, bringing an ML application to life often requires overcoming bottlenecks not just in the model code but in operationalizing the end-to-end system itself. Goodman Gu shares a case study from a leading SaaS company that quickly and easily built, trained, optimized, and deployed an XGBoost churn prediction ML app at scale with Amazon SageMaker. Read more.
Add to your personal schedule
4:20pm5:00pm Thursday, March 8, 2018
Location: 230 C Level: Intermediate
Tomer Kaftan (University of Washington)
Tomer Kaftan offers an overview of Cuttlefish, a lightweight framework prototyped in Apache Spark that helps developers adaptively improve the performance of their data processing applications by inserting a few library calls into their code. These calls construct tuning primitives that use reinforcement learning to adaptively modify execution as they observe application performance over time. Read more.
Add to your personal schedule
4:20pm5:00pm Thursday, March 8, 2018
Location: 210 C/G Level: Beginner
Felix Gorodishter (GoDaddy)
GoDaddy ingests and analyzes over 100,000 data points per second. Felix Gorodishter discusses the company's big data journey from ingest to automation, how it is evolving its systems to scale to over 10 TB of new data per day, and how it uses tools like anomaly detection to produce valuable insights, such as the worth of a reminder email. Read more.