Presented By O’Reilly and Cloudera

San Jose • London • New York

Make Data Work

March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Schedule: Wednesday, March 7

iCal: Download Subscribe

List View Grid View

Personal schedule >

List by type

Topics

LL20 A

11:00am Interpretable machine learning products Mike Lee Williams (Cloudera Fast Forward Labs)

11:50am Deploying and monitoring interactive machine learning applications with Clipper Dan Crankshaw (UC Berkeley RISELab)

1:50pm Machine learning to tackle industrial data fusion Alexandra Gunderson (Arundo Analytics)

2:40pm Why nobody cares about your anomaly detection Baron Schwartz (VividCortex)

4:20pm Machine learning applications for the industrial internet Joseph Richards (GE Digital)

5:10pm Continuous machine learning over streaming data Roger Barga (Amazon Web Services), Nina Mishra (Amazon Web Services), Sudipto Guha (Amazon Web Services), Ryan Nienhuis (Amazon Web Services)

LL20 C

11:00am Breaking up the block: Using heterogenous population modeling to drive growth Daniel Lurie (1989)

11:50am Who are we? The largest-scale study of professional data scientists Miryung Kim (UCLA), Muhammad Gulzar (UCLA)

1:50pm Writing distributed graph algorithms Andrew Ray (Sam’s Club Technology)

2:40pm Taming deep learning Evan Sparks (Determined AI)

4:20pm Code Property Graph: A modern, queryable data storage for source code Vlad A Ionescu (ShiftLeft), Fabian Yamaguchi (ShiftLeft)

5:10pm Fast and effective natural language understanding Mike Conover (SkipFlag)

LL20 D

11:00am How does a big data professional get started with AI? Wee Hyong Tok (Microsoft), Danielle Dean (iRobot)

11:50am Want to build a better chatbot? Start with your data. Andrew Mattarella-Micke (Intuit)

1:50pm Spark ML optimization at Intel: A case study Weisheng Xie (Orange Financial), Peng Meng (Intel)

2:40pm Best practices for productionizing Apache Spark MLlib models Joseph Bradley (Databricks)

4:20pm Transforming a machine learning prototype to a deployable solution leveraging Spark in healthcare Rachita Chandra (IBM Watson Health)

5:10pm Distributed clinical models: Inference without sharing patient data Balasubramanian Narasimhan (Stanford University), John-Mark Agosta (Microsoft), Philip Lavori (Stanford University)

LL21 B

11:00am The current state of TensorFlow and where it's headed in 2018 Rajat Monga (Google)

11:50am Accelerating deep learning on Apache Spark using BigDL with coarse-grained scheduling Sergey Ermolin (Intel), Shivaram Venkataraman (Microsoft Research)

1:50pm Deep credit risk ranking with LSTM Kyle Grove (Teradata)

2:40pm Deep learning for domain-specific entity extraction from unstructured text Mohamed AbdelHady (Microsoft), Zoran Dzunic (Microsoft)

4:20pm Detecting time series anomalies at Uber scale with recurrent neural networks Andrea Pasqua (Uber), Anny Chen (Uber)

5:10pm Improving user-merchant propensity modeling using neural collaborative filtering and wide and deep models on Spark BigDL at scale Sergey Ermolin (Intel), Suqiang Song (Mastercard)

LL21 C/D

11:00am Cloud, multicloud, and the data refinery Tom Fisher (MapR Technologies)

11:50am Powering robotics clouds with Alluxio Bin Fan (Alluxio), Shaoshan Liu (PerceptIn)

1:50pm 20 Netflix-style principles and practices to get the most out of your data platform Kurt Brown (Netflix)

2:40pm Dogfooding data at Lyft Mark Grover (Lyft), Arup Malakar (Lyft)

4:20pm Pirelli Connesso: Where the road meets the cloud Carlo Torniai (Pirelli Tyre)

5:10pm How to protect big data in a containerized environment Thomas Phelan (HPE BlueData)

LL21 E/F

11:00am Accelerating development velocity of production ML systems with Docker Kinnary Jangla (Pinterest)

11:50am Machine learning versus machine learning in production Manu Mukerji (8x8)

1:50pm Spark on Kubernetes: A case study from JD.com Zhen Fan (JD.com), Wei Ting Chen (Intel Corporate)

2:40pm DataOps: An Agile methodology for data-driven organizations Ellen Friedman (Independent)

4:20pm Personalization at scale: Mastering the challenges of personalization to create compelling user experiences Rahim Daya (Pinterest)

5:10pm Better machine learning logistics with the rendezvous architecture Ted Dunning (MapR, now part of HPE)

230 A

11:00am Using machine learning to simplify Kafka operations Shivnath Babu (Duke University | Unravel Data Systems), mdhruvg goel (Microsoft)

11:50am Streaming big data in the cloud: What to consider and why Bill Chambers (Databricks), michael dddd (Databricks)

1:50pm Approximation data structures in streaming data processing Debasish Ghosh (Lightbend)

2:40pm Moving the needle of the pin: Streaming hundreds of terabytes of pins from MySQL to S3/Hadoop continuously Henry Cai (Pinterest), Yi Yin (Pinterest)

4:20pm Stream storage with Apache BookKeeper Sijie Guo (StreamNative)

5:10pm Streaming SQL to unify batch and stream processing: Theory and practice with Apache Flink at Uber Fabian Hueske (data Artisans), Shuyi Chen (Uber)

230 C

11:00am What's new in Hadoop 3.0 Daniel Templeton (Cloudera), Andrew Wang (Cloudera)

11:50am Metrics-driven tuning of Apache Spark at scale Edwina Lu (LinkedIn), Ye Zhou (LinkedIn), Min Shen (LinkedIn)

1:50pm Vectorized query processing using Apache Arrow Siddharth Teotia (Dremio)

2:40pm Presto query gate: Identifying and stopping rogue queries Ritesh Agrawal (Uber), Anirban Deb (Uber)

4:20pm NoSQL no more: SQL on Druid with Apache Calcite Gian Merlino (Imply)

5:10pm Classifying job execution using deep learning Ash Munshi (Pepperdata)

210 A/E

11:00am Executive Briefing: BI on big data Mark Madsen (Teradata), Shant Hovsepian (Arcadia Data)

11:50am Executive Briefing: The conversational AI revolution Yishay Carmiel (IntelligentWire)

1:50pm Executive Briefing: Building effective heterogeneous data communities—Driving organizational outcomes with broad-based data science Frances Haugen (Pinterest), Patrick Phelps (Pinterest)

2:40pm Executive Briefing: Artificial intelligence—The next digital frontier? Michael Chui (McKinsey Global Institute)

4:20pm Executive Briefing: Managing successful data projects—Technology selection and team building Ted Malaska (Capital One), Jonathan Seidman (Cloudera)

5:10pm Executive Briefing: Legal best practices for making data work Alysa Z. Hutnik (Kelley Drye & Warren LLP), Crystal Skelton (Kelley Drye & Warren LLP)

210 C/G

11:00am Progressive data governance for emerging technologies Anne Buff (SAS)

11:50am Managing data science at scale Matthew Granade (Domino Data Lab)

1:50pm The rise of big data governance: Insight on this emerging trend from active open source initiatives John Mertic (Linux Foundation), Maryna Strelchuk (ING)

2:40pm Building a data science idea factory: How to prioritize the portfolio of a large, diverse, and opinionated data science team Katie Malone (Civis Analytics), Skipper Seabold (Civis Analytics)

4:20pm Make data work: A VC panel discussion on prospectives and trends Lisha Li (Amplify Partners), Katherine Boyle (General Catalyst), Wayne Hu (SignalFire), Andrew Parker (Spark Capital), Brandon Reeves (Lux Capital)

5:10pm The mathematical corporation: A new leadership mindset for the machine intelligence era Stephanie Beben (Booz Allen Hamilton)

210 D/H

11:00am Bladder cancer diagnosis using deep learning Mauro Damo (Dell EMC), Wei Lin (Dell EMC)

11:50am Big data, big problems: Predicting climate change Ari Gesher (Kairos Aerospace)

1:50pm Reinventing healthcare: Early detection of Alzheimer’s disease with deep learning Ayin Vala (DeepMD | Foundation for Precision Medicine)

2:40pm AI-powered crime prediction Or Herman-Saffar (Dell), Ran Taig (Dell EMC)

4:20pm If you can’t measure it, you can’t improve it: How reporting and experimentation fuel product innovation at LinkedIn Kapil Surlaker (LinkedIn), Ya Xu (LinkedIn)

5:10pm How to avoid pitfalls when reasoning with data Derek Ruths (CAI)

Expo Hall 1

11:00am Being smarter than dinosaurs: How NASA uses deep learning for planetary defense Siddha Ganju (NVIDIA)

11:50am Spark NLP in action: Improving patient flow forecasting at Kaiser Permanente David Talby (Pacific AI), Santosh Kulkarni (Kaiser Permanente)

1:50pm Lessons learned deploying machine learning and deep learning models in production at major tech companies Harish Doddi (Datatron), Jerry Xu (Datatron Technologies)

2:40pm Real-time deep link analytics: The next stage of graph analytics Yu Xu (TigerGraph)

4:20pm Leveraging live data to realize the smart cities vision Arun Kejariwal (Independent), Roman Smolgovsky (MZ)

5:10pm Small pieces, loosely joined: A skater's code Rodney Mullen (Almost Skateboards)

212 A-B

11:00am The future of ETL isn’t what it used to be Gwen Shapira (Confluent)

11:50am Radically modular data ingestion APIs in Apache Beam Eugene Kirpichov (Google)

1:50pm How to build leakproof stream processing pipelines with Apache Kafka and Apache Spark Jordan Hambleton (Cloudera), GuruDharmateja Medasani (Domino Data Lab)

2:40pm Semi-automated analytic pipeline creation and validation using active learning Sean Ma (Trifacta)

4:20pm Building a flexible ML pipeline at a B2B AI startup Dorna Bandari (Jetlore)

5:10pm Pipeline testing with Great Expectations Abe Gong (Superconductive Health), James Campbell (USG)

LL20 B

11:00am Digital transformation demands faster, more productive data science (sponsored by DataScience.com) Ian Swanson (DataScience.com)

11:50am The four elements of modern analytics (sponsored by MicroStrategy) Vijay Kotu (Oath)

1:50pm Data at scale and speed: Real-world use cases (sponsored by MapR) Ted Dunning (MapR, now part of HPE)

2:40pm Building machine learning systems for scale: Amazon insights and best practices (sponsored by Amazon Web Services) Guy Ernest (Amazon Web Services)

4:20pm BI and big data convergence in modern cloud architecture (sponsored by Arcadia Data) terry mcfadden (P&G)

LL21 A

11:00am Architecting an edge-to-cloud data pipeline to unify multiple data sources and processing engines (sponsored by NetApp) Santosh Rao (NetApp)

11:50am Journey to digital (sponsored by IBM) Seth Dobrin, PhD (IBM)

2:40pm Managing the intelligent data pipeline and the connected enterprise (sponsored by Hitachi Vantara) Chuck Yarbrough (Hitachi Vantara)

230 B

11:00am Focus on your business: Case studies on building data solutions that meet your needs (sponsored by Microsoft) Tobias Ternstrom (Microsoft)

11:50am Analytics in real time, the (Grey's) anatomy of event streaming (sponsored by MemSQL) Adam Ahringer (Disney-ABC TV Digital Media)

1:50pm Accelerating analytics and AI from the edge to the cloud (sponsored by Intel) Kevin Huiskes (Intel), Radhika Rangarajan (Intel)

2:40pm The Snowflake data warehouse: How Sharethrough analyzes petabytes of event data in a SQL database (sponsored by Snowflake) Dave Abercrombie (Sharethrough)

4:20pm Winning the big data war pays big dividends for Wargaming (sponsored by SAS) Alexander Ryabov (Wargaming), Jonathan Crow (Wargaming)

5:10pm Bringing AI into the IoT (sponsored by SAS) Evan Guarnaccia (SAS)

210 B/F

11:00am Data and ethics : Brainstorming Session Natalie Evans Harris (BrightHive)

11:50am Speed up mission-critical analytics in the cloud (sponsored by Kyligence) Billy Liu (Kyligence)

5:10pm The future of ETL isn’t what it used to be Gwen Shapira (Confluent)

Grand Ballroom 220
8:45am Wednesday keynote welcome Ben Lorica (O'Reilly), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)

8:50am Machine learning: What’s real and what’s hype Hilary Mason (Cloudera Fast Forward Labs)

9:05am Merging human and machine learning for everyday solutions Li Fan (Pinterest)

9:20am To a hammer, everything is a nail: Choosing the right tool for your business problems (sponsored by Microsoft) Tobias Ternstrom (Microsoft)

9:30am Privacy in the age of machine learning Ben Lorica (O'Reilly)

9:40am Crisis Text Line data usage and insights Nancy Lublin (Crisis Text Line)

9:55am Building the foundation of a latency-free life (sponsored by MemSQL) Nikita Shamgunov (MemSQL)

10:00am Defining responsible data practices: A community-driven approach Natalie Evans Harris (BrightHive)

10:10am Operationalizing machine learning (sponsored by IBM) Dinesh Nirmal (IBM)

10:15am Data science in the cloud Alex Smola (Amazon)

10:30am Morning break sponsored by MemSQL | Room: Hall 1, 2, 3

12:30pm Lunch sponsored by Microsoft Wednesday Topic Tables at lunch | Room: Hall 1, 2, 3

12:30pm Women in Big Data Luncheon (sponsored by LinkedIn) | Room: Almaden Ballroom, San Jose Hilton

12:30pm Wednesday Business Summit Lunch | Room: San Jose Ballroom, Marriott

3:20pm Afternoon break sponsored by IBM | Room: Hall 1, 2, 3

5:50pm Booth Crawl | Room: Hall 1, 2, 3

7:00pm Data After Dark: Night at the Market | Room: San Pedro Market

8:00am Speed Networking | Room: Concourse foyer

11:00am-11:40am (40m) Data science and machine learning

Interpretable machine learning products

Mike Lee Williams (Cloudera Fast Forward Labs)

Interpretable models result in more accurate, safer, and more profitable machine learning products. But interpretability can be hard to ensure. Michael Lee Williams explores the growing business case for interpretability and its concrete applications, including churn, finance, and healthcare. Along the way, Michael offers an overview of the open source, model-agnostic tool LIME.

11:50am-12:30pm (40m) Data science and machine learning, Streaming systems and real-time applications

Deploying and monitoring interactive machine learning applications with Clipper

Dan Crankshaw (UC Berkeley RISELab)

Clipper is an open source, general-purpose model-serving system that provides low-latency predictions under heavy serving workloads for interactive applications. Dan Crankshaw offers an overview of the Clipper serving system and explains how to use it to serve Apache Spark and TensorFlow models on Kubernetes.

1:50pm-2:30pm (40m) Big data and data science in the cloud, Data science and machine learning Graphs and Time-series

Machine learning to tackle industrial data fusion

Alexandra Gunderson (Arundo Analytics)

Heavy industries, such as oil and gas, have tremendous amounts of data from which predictive models could be built, but it takes weeks or even months to create a comprehensive dataset from all of the various data sources. Alexandra Gunderson details the methodology behind an industry-tested approach that incorporates machine learning to structure and link data from different sources.

2:40pm-3:20pm (40m) Big data and data science in the cloud, Data science and machine learning, Visualization and user experience Graphs and Time-series

Why nobody cares about your anomaly detection

Baron Schwartz (VividCortex)

Anomaly detection is white hot in the monitoring industry, but many don't really understand or care about it, while others repeat the same pattern many times. Why? And what can we do about it? Baron Schwartz explains how he arrived at a "post-anomaly detection" point of view.

4:20pm-5:00pm (40m) Data science and machine learning Graphs and Time-series

Machine learning applications for the industrial internet

Joseph Richards (GE Digital)

Deploying ML software applications for use cases in the industrial internet presents a unique set of challenges. Data-driven problems require approaches that are highly accurate, robust, fast, scalable, and fault tolerant. Joseph Richards shares GE's approach to building production-grade ML applications and explores work across GE in industries such as power, aviation, and oil and gas.

5:10pm-5:50pm (40m) Data science and machine learning, Streaming systems and real-time applications Graphs and Time-series

Continuous machine learning over streaming data

Roger Barga (Amazon Web Services), Nina Mishra (Amazon Web Services), Sudipto Guha (Amazon Web Services), Ryan Nienhuis (Amazon Web Services)

Roger Barga, Nina Mishra, Sudipto Guha, and Ryan Nienhuis detail continuous machine learning algorithms that discover useful information in streaming data. They focus on explainable machine learning, including anomaly detection with attribution, the ability to reduce false positives through user feedback, and the detection of anomalies in directed graphs.

11:00am-11:40am (40m) Data science and machine learning

Breaking up the block: Using heterogenous population modeling to drive growth

Daniel Lurie (1989)

All successful startups thrive on tight product-market fit, which can produce homogenous initial user bases. To become the next big thing, your user base will need to diversify, and your product must change to accommodate new needs. Daniel Lurie explains how Pinterest leverages external data to measure racial and income diversity in its user base and changed user modeling to drive growth.

11:50am-12:30pm (40m) Data science and machine learning

Who are we? The largest-scale study of professional data scientists

Miryung Kim (UCLA), Muhammad Gulzar (UCLA)

Even though we know that there are more data scientists in the workforce today, neither what those data scientists actually do nor what we even mean by data scientists has been studied quantitatively. Miryung Kim and Muhammad Gulzar share the results of a large-scale survey with 793 professional data scientists and detail several trends about data scientists in the software engineering context.

1:50pm-2:30pm (40m) Data science and machine learning Graphs and Time-series

Writing distributed graph algorithms

Andrew Ray (Sam’s Club Technology)

Andrew Ray offers a brief introduction to the distributed graph algorithm abstractions provided by Pregel, PowerGraph, and GraphX, drawing on real-world examples, and provides historical context for the evolution between these three abstractions.

2:40pm-3:20pm (40m) Data engineering and architecture, Data science and machine learning

Taming deep learning

Evan Sparks (Determined AI)

Deep learning has shown tremendous improvements in a number of areas and has justifiably generated enormous excitement. However, several key challenges—from prohibitive hardware requirements to immature software offerings—are impeding widespread enterprise adoption. Evan Sparks details fundamental challenges facing organizations looking to adopt deep learning and shares possible solutions.

4:20pm-5:00pm (40m) Big data and data science in the cloud, Data science and machine learning Graphs and Time-series

Code Property Graph: A modern, queryable data storage for source code

Vlad A Ionescu (ShiftLeft), Fabian Yamaguchi (ShiftLeft)

Vlad Ionescu and Fabian Yamaguchi outline Code Property Graph (CPG), a unique approach that allows the functional elements of code to be represented in an interconnected graph of data and control flows, which enables semantic information about code to be stored scalably on distributed graph databases over the web while allowing them to be rapidly accessed.

5:10pm-5:50pm (40m) Data science and machine learning

Fast and effective natural language understanding

Mike Conover (SkipFlag)

Mike Conover offers an overview of the essential techniques for understanding and working with natural language. From off-the-shelf neural networks and snappy preprocessing libraries to architectural patterns for bulletproof productionization, this talk will be of interest to anyone who uses language on a regular basis.

11:00am-11:40am (40m) Big data and data science in the cloud, Data science and machine learning

How does a big data professional get started with AI?

Wee Hyong Tok (Microsoft), Danielle Dean (iRobot)

Artificial intelligence (AI) has tremendous potential to extend our capabilities and empower organizations to accelerate their digital transformation. Wee Hyong Tok and Danielle Dean demystify AI for big data professionals and explain how they can leverage and evolve their valuable big data skills by getting started with AI.

11:50am-12:30pm (40m) Data science and machine learning, Streaming systems and real-time applications

Want to build a better chatbot? Start with your data.

Andrew Mattarella-Micke (Intuit)

When building a chatbot, it’s important to develop one that is humanized, has contextual responses, and can simulate true empathy for the end users. Andrew Mattarella-Micke shares how Intuit's data science team preps, cleans, organizes, and augments training data along with best practices he's learned along the way.

1:50pm-2:30pm (40m) Data science and machine learning

Spark ML optimization at Intel: A case study

Weisheng Xie (Orange Financial), Peng Meng (Intel)

Intel has been deeply involved in Spark from its earliest moments. Vincent Xie and Peng Meng share what Intel has been working on with Spark ML and introduce the methodology behind Intel's work on SparkML optimization.

2:40pm-3:20pm (40m) Big data and data science in the cloud, Data science and machine learning

Best practices for productionizing Apache Spark MLlib models

Joseph Bradley (Databricks)

Joseph Bradley discusses common paths to productionizing Apache Spark MLlib models and shares engineering challenges and corresponding best practices. Along the way, Joseph covers several deployment scenarios, including batch scoring, Structured Streaming, and real-time low-latency serving.

4:20pm-5:00pm (40m) Data science and machine learning

Transforming a machine learning prototype to a deployable solution leveraging Spark in healthcare

Rachita Chandra (IBM Watson Health)

Rachita Chandra outlines challenges and considerations for transforming a research prototype built for a single machine to a deployable healthcare solution that leverages Spark in a distributed environment.

5:10pm-5:50pm (40m) Big data and data science in the cloud, Data science and machine learning, Platform security and cybersecurity

Distributed clinical models: Inference without sharing patient data

Balasubramanian Narasimhan (Stanford University), John-Mark Agosta (Microsoft), Philip Lavori (Stanford University)

Clinical collaboration benefits from pooling data to train models from large datasets, but it's hampered by concerns about sharing data. Balasubramanian Narasimhan, John-Mark Agosta, and Philip Lavori outline a privacy-preserving alternative that creates statistical models equivalent to one from the entire dataset.

11:00am-11:40am (40m) Data science and machine learning

The current state of TensorFlow and where it's headed in 2018

Rajat Monga (Google)

Rajat Monga offers an overview of TensorFlow's progress and adoption in 2017 before looking ahead to the areas of importance in the future—performance, usability, and ubiquity—and the efforts TensorFlow is making in those areas.

11:50am-12:30pm (40m) Big data and data science in the cloud, Data science and machine learning

Accelerating deep learning on Apache Spark using BigDL with coarse-grained scheduling

Sergey Ermolin (Intel), Shivaram Venkataraman (Microsoft Research)

The BigDL framework scales deep learning for large datasets using Apache Spark. However there is significant scheduling overhead from Spark when running BigDL at large scale. Shivaram Venkataraman and Sergey Ermolin outline a new parameter manager implementation that along with coarse-grained scheduling can provide significant speedups for deep learning models like Inception and VGG.

1:50pm-2:30pm (40m) Data science and machine learning Graphs and Time-series

Deep credit risk ranking with LSTM

Kyle Grove (Teradata)

Kyle Grove explains how Teradata and some of world’s largest financial institutions are innovating credit risk ranking with deep learning techniques and AnalyticOps. With the AnalyticOps framework, these organizations have built models with increased accuracy to drive more profitable lending decisions while being explainable to regulators.

2:40pm-3:20pm (40m) Big data and data science in the cloud, Data science and machine learning

Deep learning for domain-specific entity extraction from unstructured text

Mohamed AbdelHady (Microsoft), Zoran Dzunic (Microsoft)

Mohamed AbdelHady and Zoran Dzunic demonstrate how to build a domain-specific entity extraction system from unstructured text using deep learning. In the model, domain-specific word embedding vectors are trained on a Spark cluster using millions of PubMed abstracts and then used as features to train a LSTM recurrent neural network for entity extraction.

4:20pm-5:00pm (40m) Data science and machine learning Graphs and Time-series

Detecting time series anomalies at Uber scale with recurrent neural networks

Andrea Pasqua (Uber), Anny Chen (Uber)

Time series forecasting and anomaly detection is of utmost importance at Uber. However, the scale of the problem, the need for speed, and the importance of accuracy make anomaly detection a challenging data science problem. Andrea Pasqua and Anny Chen explain how the use of recurrent neural networks is allowing Uber to meet this challenge.

5:10pm-5:50pm (40m) Big data and data science in the cloud, Data science and machine learning

Improving user-merchant propensity modeling using neural collaborative filtering and wide and deep models on Spark BigDL at scale

Sergey Ermolin (Intel), Suqiang Song (Mastercard)

Sergey Ermolin and Suqiang Song demonstrate how to use Spark BigDL wide and deep and neural collaborative filtering (NCF) algorithms to predict a user’s probability of shopping at a particular offer merchant during a campaign period. Along the way, they compare the deep learning results with those obtained by MLlib’s alternating least squares (ALS) approach.

11:00am-11:40am (40m) Big data and data science in the cloud, Data engineering and architecture

Cloud, multicloud, and the data refinery

Tom Fisher (MapR Technologies)

The monolithic cloud is dying. Delivering capabilities across multiple clouds and, simultaneously, transitioning to next-generation platforms and applications is the challenge today. Tom Fisher explores technological approaches and solutions that make this possible while delivering data-driven applications and operations.

11:50am-12:30pm (40m) Big data and data science in the cloud, Data engineering and architecture

Powering robotics clouds with Alluxio

Bin Fan (Alluxio), Shaoshan Liu (PerceptIn)

Bin Fan and Shaoshan Liu explain how PerceptIn designed and implemented a cloud architecture to support video streaming and online object recognition tasks and demonstrate how Alluxio delivers high throughput, low latency, and a unified namespace to support these emerging cloud architectures.

1:50pm-2:30pm (40m) Data engineering and architecture

20 Netflix-style principles and practices to get the most out of your data platform

Kurt Brown (Netflix)

Kurt Brown explains how to get the most out of your data infrastructure with 20 principles and practices used at Netflix. Kurt covers each in detail and explores how they relate to the technologies used at Netflix, including S3, Spark, Presto, Druid, R, Python, and Jupyter.

2:40pm-3:20pm (40m) Data engineering and architecture

Dogfooding data at Lyft

Mark Grover (Lyft), Arup Malakar (Lyft)

Mark Grover and Arup Malakar offer an overview of how Lyft leverages application metrics, logs, and auditing to monitor and troubleshoot its data platform and share how the company dogfoods the platform to provide security, auditing, alerting, and replayability. They also detail some of the internal services and tools Lyft has developed to make its data more robust, scalable, and self-serving.

4:20pm-5:00pm (40m) Big data and data science in the cloud, Data engineering and architecture

Pirelli Connesso: Where the road meets the cloud

Carlo Torniai (Pirelli Tyre)

Carlo Torniai shares the architectural challenges Pirelli faced in building Pirelli Connesso, an IoT cloud-based system providing information on tire operating conditions, consumption, and maintenance, and highlights the operative approaches that enabled the integration of contributions across cross-functional teams.

5:10pm-5:50pm (40m) Data engineering and architecture, Platform security and cybersecurity

How to protect big data in a containerized environment

Thomas Phelan (HPE BlueData)

Recent headline-grabbing data breaches demonstrate that protecting data is essential for every enterprise. The best-of-breed approach for big data is HDFS configured with Transparent Data Encryption (TDE). However, TDE can be difficult to configure and manage—issues that are only compounded when running on Docker containers. Thomas Phelan explores these challenges and how to overcome them.

11:00am-11:40am (40m) Data engineering and architecture

Accelerating development velocity of production ML systems with Docker

Kinnary Jangla (Pinterest)

Having trouble coordinating development of your production ML system between a team of developers? Microservices drifting and causing problems during debugging? Kinnary Jangla explains how Pinterest dockerized the services powering its home feed and how it impacted the engineering productivity of the company's ML teams while increasing uptime and ease of deployment.

11:50am-12:30pm (40m) Big data and data science in the cloud, Data engineering and architecture, Media, entertainment, and advertising, Streaming systems and real-time applications

Machine learning versus machine learning in production

Manu Mukerji (8x8)

Acme Corporation is a global leader in commerce marketing. Manu Mukerji walks you through Acme Corporation's machine learning example for universal catalogs, explaining how the training and test sets are generated and annotated; how the model is pushed to production, automatically evaluated, and used; production issues that arise when applying ML at scale in production; lessons learned; and more.

1:50pm-2:30pm (40m) Big data and data science in the cloud, Data engineering and architecture, Visualization and user experience

Spark on Kubernetes: A case study from JD.com

Zhen Fan (JD.com), Wei Ting Chen (Intel Corporate)

Zhen Fan and Wei Ting Chen explain how JD.com uses Spark on Kubernetes in a production environment and why the company chose Spark on Kubernetes for its AI workloads. You'll learn how to run Spark with Kubernetes and gain an understanding of the advantages this provides.

2:40pm-3:20pm (40m) Data engineering and architecture

DataOps: An Agile methodology for data-driven organizations

Ellen Friedman (Independent)

DataOps—a culture and practice for building data-intensive applications, including machine learning pipelines—expands DevOps philosophy to include data-heavy roles such as data engineering and data science. DataOps is based on cross-functional collaboration resulting in fast time to value and an agile workflow. Ellen Friedman offers an overview of DataOps and explains how to implement it.

4:20pm-5:00pm (40m) Data engineering and architecture, Visualization and user experience

Personalization at scale: Mastering the challenges of personalization to create compelling user experiences

Rahim Daya (Pinterest)

Personalization is a powerful tool for building sticky and impactful product experiences. Rahim Daya shares Pinterest's frameworks for building personalized user experiences, from sourcing the right contextual data to designing and evaluating personalization algorithms that can delight the user.

5:10pm-5:50pm (40m) Data engineering and architecture

Better machine learning logistics with the rendezvous architecture

Ted Dunning (MapR, now part of HPE)

Ted Dunning offers an overview of the rendezvous architecture, developed to be the "continuous integration" system for machine learning models. It allows always-hot zero latency rollout and rollback of new models and supports extensive metrics and diagnostics so models can be compared as they process production data. It can even hot-swap the framework itself with no downtime.

11:00am-11:40am (40m) Data engineering and architecture, Streaming systems and real-time applications Graphs and Time-series

Using machine learning to simplify Kafka operations

Shivnath Babu (Duke University | Unravel Data Systems), mdhruvg goel (Microsoft)

Getting the best performance, predictability, and reliability for Kafka-based applications is a complex art. Shivnath Babu and Dhruv Goel explain how to simplify the process by leveraging recent advances in machine learning and AI and outline a methodology for applying statistical learning to the rich and diverse monitoring data that is available from Kafka.

11:50am-12:30pm (40m) Big data and data science in the cloud, Data engineering and architecture, Streaming systems and real-time applications Graphs and Time-series

Streaming big data in the cloud: What to consider and why

Bill Chambers (Databricks), michael dddd (Databricks)

William Chambers and Michael Armbrust discuss the motivation and basics of Apache Spark's Structured Streaming processing engine and share lessons they've learned running hundreds of Structured Streaming workloads in the cloud.

1:50pm-2:30pm (40m) Data engineering and architecture, Streaming systems and real-time applications

Approximation data structures in streaming data processing

Debasish Ghosh (Lightbend)

Debasish Ghosh explores the role that approximation data structures play in processing streaming data. Typically, streams are unbounded in space and time, and processing has to be done online using sublinear space. Debasish covers the probabilistic bounds that these data structures offer and shows how they can be used to implement solutions for fast and streaming architectures.

2:40pm-3:20pm (40m) Big data and data science in the cloud, Data engineering and architecture, Streaming systems and real-time applications

Moving the needle of the pin: Streaming hundreds of terabytes of pins from MySQL to S3/Hadoop continuously

Henry Cai (Pinterest), Yi Yin (Pinterest)

With the rise of large-scale real-time computation, there is a growing need to link legacy MySQL systems with real-time platforms. Henry Cai and Yi Yin offer an overview of WaterMill, Pinterest's continuous DB ingestion system for streaming SQL data into near-real-time computation pipelines to support dynamic personalized recommendations and search indices.

4:20pm-5:00pm (40m) Data engineering and architecture, Streaming systems and real-time applications Graphs and Time-series

Stream storage with Apache BookKeeper

Sijie Guo (StreamNative)

Apache BookKeeper, a scalable, fault-tolerant, and low-latency storage service optimized for real-time workloads, has been widely adopted by enterprises like Twitter, Yahoo, and Salesforce to store and serve mission-critical data. Sijie Guo explains how Apache BookKeeper satisfies the needs of stream storage.

5:10pm-5:50pm (40m) Data engineering and architecture, Streaming systems and real-time applications

Streaming SQL to unify batch and stream processing: Theory and practice with Apache Flink at Uber

Fabian Hueske (data Artisans), Shuyi Chen (Uber)

Fabian Hueske and Shuyi Chen explore SQL's role in the world of streaming data and its implementation in Apache Flink and cover fundamental concepts, such as streaming semantics, event time, and incremental results. They also share their experience using Flink SQL in production at Uber, explaining how Uber leverages Flink SQL to solve its unique business challenges.

11:00am-11:40am (40m) Data engineering and architecture

What's new in Hadoop 3.0

Daniel Templeton (Cloudera), Andrew Wang (Cloudera)

Hadoop 3.0 has been years in the making, and now it's finally arriving. Andrew Wang and Daniel Templeton offer an overview of new features, including HDFS erasure coding, YARN Timeline Service v2, YARN federation, and much more, and discuss current release management status and community testing efforts dedicated to making Hadoop 3.0 the best Hadoop major release yet.

11:50am-12:30pm (40m) Data engineering and architecture

Metrics-driven tuning of Apache Spark at scale

Edwina Lu (LinkedIn), Ye Zhou (LinkedIn), Min Shen (LinkedIn)

Spark applications need to be well tuned so that individual applications run quickly and reliably and cluster resources are efficiently utilized. Edwina Lu, Ye Zhou, and Min Shen outline a fast, reliable, and automated process used at LinkedIn for tuning Spark applications, enabling users to quickly identify and fix problems.

1:50pm-2:30pm (40m) Big data and data science in the cloud, Data engineering and architecture, Streaming systems and real-time applications

Vectorized query processing using Apache Arrow

Siddharth Teotia (Dremio)

Query processing technology has rapidly evolved since the iconic C-Store paper was published in 2005, with a focus on designing query processing algorithms and data structures that efficiently utilize CPU and leverage the changing trends in hardware to deliver optimal performance. Siddharth Teotia outlines the different types of vectorized query processing in Dremio using Apache Arrow.

2:40pm-3:20pm (40m) Big data and data science in the cloud, Data engineering and architecture

Presto query gate: Identifying and stopping rogue queries

Ritesh Agrawal (Uber), Anirban Deb (Uber)

Presto has emerged as the de facto query engine to quickly process petabytes of data. However, rogue SQL queries can waste a significant amount of critical compute resource and reduce Presto's throughput. Ritesh Agrawal and Anirban Deb explain how Uber uses machine learning to identify and stop rogue queries, saving both computational power and money.

4:20pm-5:00pm (40m) Data engineering and architecture

NoSQL no more: SQL on Druid with Apache Calcite

Gian Merlino (Imply)

Gian Merlino discusses the SQL layer recently added to the open source Druid project. It's based on Apache Calcite, which bills itself as "the foundation for your next high-performance database." Gian explains how Druid and Calcite are integrated and why you should stop worrying and learn to love relational algebra in your own projects.

5:10pm-5:50pm (40m) Data engineering and architecture

Classifying job execution using deep learning

Ash Munshi (Pepperdata)

Ash Munshi shares techniques for labeling big data apps using runtime measurements of CPU, memory, I/O, and network and details a deep neural network to help operators understand the types of apps running on the cluster and better predict runtimes, tune resource utilization, and increase efficiency. These methods are new and are the first approach to classify multivariate time series.

11:00am-11:40am (40m) Strata Business Summit

Executive Briefing: BI on big data

Mark Madsen (Teradata), Shant Hovsepian (Arcadia Data)

There are 70+ BI tools in the market and a dozen or more SQL- or OLAP-on-Hadoop open source projects. Mark Madsen and Shant Hovsepian outline the trade-offs between a number of architectures that provide self-service access to data and discuss the pros and cons of architectures, deployment strategies, and examples of BI on big data.

11:50am-12:30pm (40m) Strata Business Summit

Executive Briefing: The conversational AI revolution

Yishay Carmiel (IntelligentWire)

One of the most important tasks of AI has been to understand humans. People want machines to understand not only what they say but also what they mean and to take particular actions based on that information. This goal is the essence of conversational AI. Yishay Carmiel explores the latest breakthroughs and revolutions in this field and the challenges still to come.

1:50pm-2:30pm (40m) Data-driven business management, Strata Business Summit

Executive Briefing: Building effective heterogeneous data communities—Driving organizational outcomes with broad-based data science

Frances Haugen (Pinterest), Patrick Phelps (Pinterest)

Data science is most powerful when combined with deep domain knowledge, but those with domain knowledge don't work on data-focused teams. So how do you empower employees with diverse backgrounds and skill sets to be effective users of data? Frances Haugen and Patrick Phelps dive into the social side of data and share strategies for unlocking otherwise unobtainable insights.

2:40pm-3:20pm (40m) Strata Business Summit

Executive Briefing: Artificial intelligence—The next digital frontier?

Michael Chui (McKinsey Global Institute)

After decades of extravagant promises, artificial intelligence is finally starting to deliver real-life benefits to early adopters. However, we're still early in the cycle of adoption. Michael Chui explains where investment is going, patterns of AI adoption and value capture by enterprises, and how the value potential of AI across sectors and business functions is beginning to emerge.

4:20pm-5:00pm (40m) Data-driven business management, Strata Business Summit

Executive Briefing: Managing successful data projects—Technology selection and team building

Ted Malaska (Capital One), Jonathan Seidman (Cloudera)

Recent years have seen dramatic advancements in the technologies available for managing and processing data. While these technologies provide powerful tools to build data applications, they also require new skills. Ted Malaska and Jonathan Seidman explain how to evaluate these new technologies and build teams to effectively leverage these technologies and achieve ROI with your data initiatives.

5:10pm-5:50pm (40m) Strata Business Summit

Executive Briefing: Legal best practices for making data work

Alysa Z. Hutnik (Kelley Drye & Warren LLP), Crystal Skelton (Kelley Drye & Warren LLP)

Big data promises enormous benefits for companies. But what about privacy, data protection, and consumer laws? Having a solid understanding of the legal and self-regulatory rules of the road are key to maximizing the value of your data while avoiding data disasters. Alysa Hutnik and Crystal Skelton share legal best practices and practical tips to avoid becoming a big data “don’t.”

11:00am-11:40am (40m) Law, ethics, and governance, Strata Business Summit

Progressive data governance for emerging technologies

Anne Buff (SAS)

Emerging technologies such as the IoT, AI, and ML present businesses with enormous opportunities for innovation, but to maximize the potential of these technologies, businesses must radically shift their approach to governance. Anne Buff explains what it takes to shift the focus of governance from standards, conformity, and control to accountability, extensibility, and enablement.

11:50am-12:30pm (40m) Data-driven business management, Strata Business Summit

Managing data science at scale

Matthew Granade (Domino Data Lab)

Predictive analytics and artificial intelligence have become critical competitive capabilities. Yet IT teams struggle to provide the support data science teams need to succeed. Matthew Granade explains how leading banks, insurance and pharmaceutical companies, and others manage data science at scale.

1:50pm-2:30pm (40m) Law, ethics, and governance, Strata Business Summit

The rise of big data governance: Insight on this emerging trend from active open source initiatives

John Mertic (Linux Foundation), Maryna Strelchuk (ING)

John Mertic and Maryna Strelchuk detail the benefits of a vendor-neutral approach to data governance, explain the need for an open metadata standard, and share how companies like ING, IBM, Hortonworks, and more are delivering solutions to this challenge as an open source initiative.

2:40pm-3:20pm (40m) Data-driven business management, Strata Business Summit

Building a data science idea factory: How to prioritize the portfolio of a large, diverse, and opinionated data science team

Katie Malone (Civis Analytics), Skipper Seabold (Civis Analytics)

A huge challenge for data science managers is determining priorities for their teams, which often have more good ideas than they have time. Katie Malone and Skipper Seabold share a framework that their large and diverse data science team uses to identify, discuss, select, and manage data science projects for a fast-moving startup.

4:20pm-5:00pm (40m) Strata Business Summit

Make data work: A VC panel discussion on prospectives and trends

Lisha Li (Amplify Partners), Katherine Boyle (General Catalyst), Wayne Hu (SignalFire), Andrew Parker (Spark Capital), Brandon Reeves (Lux Capital)

To anticipate who will succeed and to invest wisely, investors spend a lot of time trying to understand the longer-term trends within an industry. In this panel discussion, top-tier VCs look over the horizon to consider the big trends in how data is being put to work in startups and share what they think the field will look like in a few years (or more).

5:10pm-5:50pm (40m) Data-driven business management, Strata Business Summit

The mathematical corporation: A new leadership mindset for the machine intelligence era

Stephanie Beben (Booz Allen Hamilton)

How can you most effectively use machine intelligence to drive strategy? By merging it in the right way with the human ingenuity of leaders throughout your organization. Stephanie Beben shares insights from her work with pioneering companies, government agencies, and nonprofits that are successfully navigating this partnership by becoming “mathematical corporations.”

11:00am-11:40am (40m) Strata Business Summit

Bladder cancer diagnosis using deep learning

Mauro Damo (Dell EMC), Wei Lin (Dell EMC)

Image recognition classification of diseases will minimize the possibility of medical mistakes, improve patient treatment, and speed up patient diagnosis. Mauro Damo and Wei Lin offer an overview of an approach to identify bladder cancer in patients using nonsupervised and supervised machine learning techniques on more than 5,000 magnetic resonance images from the Cancer Imaging Archive.

11:50am-12:30pm (40m) Big data and data science in the cloud, Data engineering and architecture, Data science and machine learning, Strata Business Summit

Big data, big problems: Predicting climate change

Ari Gesher (Kairos Aerospace)

A warming planet needs precise, localized predictions about the effects of climate change to make good long-term and medium-term economic decision making. Ari Gesher demonstrates how to use a mix of physical simulation, enhanced scientific models, machine learning verification, and high-scale computing to predict and package climate predictions as data products.

1:50pm-2:30pm (40m) Strata Business Summit

Reinventing healthcare: Early detection of Alzheimer’s disease with deep learning

Ayin Vala (DeepMD | Foundation for Precision Medicine)

Complex diseases like Alzheimer’s cannot be cured by pharmaceutical or genetic sciences alone, and current treatments and therapies lead to mixed successes. Ayin Vala explains how to use the power of big data and AI to treat challenging diseases with personalized medicine, which takes into account individual variability in medicine intake, lifestyle, and genetic factors for each patient.

2:40pm-3:20pm (40m) Law, ethics, and governance, Strata Business Summit

AI-powered crime prediction

Or Herman-Saffar (Dell), Ran Taig (Dell EMC)

What if we could predict when and where crimes will be committed? Or Herman-Saffar and Ran Taig offer an overview of Crimes in Chicago, a publicly published dataset of reported incidents of crime that have occurred in Chicago since 2001. Or and Ran explain how to use this data to explore committed crimes to find interesting trends and make predictions for the future.

4:20pm-5:00pm (40m) Data-driven business management, Strata Business Summit

If you can’t measure it, you can’t improve it: How reporting and experimentation fuel product innovation at LinkedIn

Kapil Surlaker (LinkedIn), Ya Xu (LinkedIn)

Metrics measurement and experimentation play crucial roles in every product decision at LinkedIn. Kapil Surlaker and Ya Xu explain why, to meet the company's needs, LinkedIn built the UMP and XLNT platforms for metrics computation and experimentation, respectively, which have allowed the company to perform measurement and experimentation very efficiently at scale while preserving trust in data.

5:10pm-5:50pm (40m) Strata Business Summit

How to avoid pitfalls when reasoning with data

Derek Ruths (CAI)

Unreasonable sales forecasts, badly overstocked inventory, misguided investments . . . bad analyses happen all the time, leading to bad decisions and costing businesses millions of dollars. Derek Ruths shares the five most common issues that lead to bad data-informed thinking.

11:00am-11:40am (40m) Data science and machine learning Expo Hall

Being smarter than dinosaurs: How NASA uses deep learning for planetary defense

Siddha Ganju (NVIDIA)

Siddha Ganju explains how the FDL lab at NASA uses artificial intelligence to improve and automate the identification of meteors above human-level performance using meteor shower images and recover known meteor shower streams and characterize previously unknown meteor showers using orbital data—research aimed at providing more warning time for long-period comet impacts.

11:50am-12:30pm (40m) Data science and machine learning Expo Hall

Spark NLP in action: Improving patient flow forecasting at Kaiser Permanente

David Talby (Pacific AI), Santosh Kulkarni (Kaiser Permanente)

David Talby and Santosh Kulkarni explain how Kaiser Permanente uses the open source NLP library for Apache Spark to tackle one of the most common challenges with applying natural language process in practice: integrating domain-specific NLP as part of a scalable, performant, measurable, and reproducible machine learning pipeline.

1:50pm-2:30pm (40m) Data science and machine learning Expo Hall

Lessons learned deploying machine learning and deep learning models in production at major tech companies

Harish Doddi (Datatron), Jerry Xu (Datatron Technologies)

Deploying machine learning models and deep learning models in production is hard. Harish Doddi and Jerry Xu outline the enterprise data science lifecycle, covering how production model deployment flow works, challenges, best practices, and lessons learned. Along the way, they explain why monitoring models in the production should be mandatory.

2:40pm-3:20pm (40m) Big data and data science in the cloud, Data engineering and architecture, Data-driven business management, Platform security and cybersecurity, Streaming systems and real-time applications Expo Hall, Graphs and Time-series

Real-time deep link analytics: The next stage of graph analytics

Yu Xu (TigerGraph)

Graph databases are the fastest growing category in data management. However, most graph queries only traverse two hops in big graphs due to limitations in most graph databases. Real-world applications require deep link analytics that traverse far more than three hops. Yu Xu offers an overview of a fraud detection system that manages 100 billion graph elements to detect risk and fraudulent groups.

4:20pm-5:00pm (40m) Data engineering and architecture Expo Hall

Leveraging live data to realize the smart cities vision

Arun Kejariwal (Independent), Roman Smolgovsky (MZ)

One of the key application domains leveraging live data is smart cities, but success depends on the availability of generic platforms that support high throughput and ultralow latency. Arun Kejariwal and Francois Orsini offer an overview of Satori's live data platform and walk you through a country-scale case study of its implementation.

5:10pm-5:50pm (40m) Data science and machine learning Expo Hall

Small pieces, loosely joined: A skater's code

Rodney Mullen (Almost Skateboards)

The essence of modern skating is learning tricks that couple with specific terrain. Activision’s video game franchise testifies to the nearly endless possibilities. Rodney Mullen offers a nuanced look at how skaters nudge the endpoints of disparate submovements to create new combinations that may shine a different light on ideas in machine learning—plus it’s a lot of fun.

11:00am-11:40am (40m) Data engineering and architecture Data Integration and Data Pipelines

The future of ETL isn’t what it used to be

Gwen Shapira (Confluent)

Gwen Shapira shares design and architecture patterns that are used to modernize data engineering and details how Apache Kafka, microservices, and event streams are used by modern engineering organizations to efficiently build data pipelines that are scalable, reliable, and built to evolve.

11:50am-12:30pm (40m) Big data and data science in the cloud, Data engineering and architecture, Streaming systems and real-time applications Data Integration and Data Pipelines

Radically modular data ingestion APIs in Apache Beam

Eugene Kirpichov (Google)

Apache Beam equips users with a novel programming model in which the classic batch/streaming data processing dichotomy is erased. Eugene Kirpichov details the modularity and composability advantages created by treating data ingestion as just another data processing task and walks you through building highly modular data ingestion APIs using the new Beam programming model primitive Splittable DoFn.

1:50pm-2:30pm (40m) Data engineering and architecture, Streaming systems and real-time applications Data Integration and Data Pipelines

How to build leakproof stream processing pipelines with Apache Kafka and Apache Spark

Jordan Hambleton (Cloudera), GuruDharmateja Medasani (Domino Data Lab)

When Kafka stream processing pipelines fail, they can leave users panicked about data loss when restarting their application. Jordan Hambleton and Guru Medasani explain how offset management provides users the ability to restore the state of the stream throughout its lifecycle, deal with unexpected failure, and improve accuracy of results.

2:40pm-3:20pm (40m) Big data and data science in the cloud, Data engineering and architecture, Visualization and user experience Data Integration and Data Pipelines

Semi-automated analytic pipeline creation and validation using active learning

Sean Ma (Trifacta)

Organizations leverage reporting, analytic, and machine learning pipelines to drive decision making and power critical operational systems. Sean Ma discusses methods for detecting, visualizing, and resolving inconsistencies between source and target data models across these pipelines.

4:20pm-5:00pm (40m) Data engineering and architecture Data Integration and Data Pipelines

Building a flexible ML pipeline at a B2B AI startup

Dorna Bandari (Jetlore)

Dorna Bandari offers an overview of the machine learning pipeline at B2B AI startup Jetlore and explains why even small B2B startups in AI should invest in a flexible machine learning pipeline. Dorna covers the design choices, the trade-offs made when implementing and maintaining the pipeline, and how it has accelerated Jetlore's product development and growth.

5:10pm-5:50pm (40m) Big data and data science in the cloud, Data engineering and architecture Data Integration and Data Pipelines

Pipeline testing with Great Expectations

Abe Gong (Superconductive Health), James Campbell (USG)

Data science and engineering have been missing out on one of the biggest productivity boosters in modern software development: automated testing. Abe Gong and James Campbell discuss the concept of pipeline tests and offer an overview of Great Expectations, an open source Python framework for bringing data pipelines and products under test.

11:00am-11:40am (40m) Sponsored

Digital transformation demands faster, more productive data science (sponsored by DataScience.com)

Ian Swanson (DataScience.com)

Ian Swanson shares strategies for leading more productive data science teams, along with steps you can take today to meet growing demands for AI and machine learning use cases.

11:50am-12:30pm (40m) Sponsored

The four elements of modern analytics (sponsored by MicroStrategy)

Vijay Kotu (Oath)

Vijay Kotu details how Oath is using MicroStrategy to combine elements of data science, enterprise mobility, information design, and data lakes in its transformation into an intelligent enterprise.

1:50pm-2:30pm (40m) Sponsored

Data at scale and speed: Real-world use cases (sponsored by MapR)

Ted Dunning (MapR, now part of HPE)

Getting value from data at large scale and on a variety of time scales is hard. True, it's not as hard as it used to be, but you still don’t win by default. Ted Dunning explains why it takes good design, the right technology, and a pragmatic approach to succeed.

2:40pm-3:20pm (40m) Sponsored

Building machine learning systems for scale: Amazon insights and best practices (sponsored by Amazon Web Services)

Guy Ernest (Amazon Web Services)

Amazon SageMaker is platform to build, train, and deploy machine learning models at any scale. Guy Ernest explores the scalable algorithms that SageMaker provides, distributed training with Apache MXNet and TensorFlow, automatic tuning of hyperparameters, and model deployments.

4:20pm-5:00pm (40m) Sponsored

BI and big data convergence in modern cloud architecture (sponsored by Arcadia Data)

terry mcfadden (P&G)

Procter & Gamble relies heavily on data, particularly for BI. Running compute where the data lives is critical for performance, and the company has found added benefits to this architecture, which complements its Hadoop and BI needs. Terry McFadden offers an overview of P&G's modern analytics architecture and explains how it differs from traditional approaches.

11:00am-11:40am (40m) Sponsored

Architecting an edge-to-cloud data pipeline to unify multiple data sources and processing engines (sponsored by NetApp)

Santosh Rao (NetApp)

Santosh Rao explores the architecture of a data pipeline from edge to core to cloud and across various data sources and processing engines and explains how to build a solution architecture that enables businesses to maximize the competitive differentiation with the ability to unify data insights in compelling yet efficient ways.

11:50am-12:30pm (40m) Sponsored

Journey to digital (sponsored by IBM)

Seth Dobrin, PhD (IBM)

Companies that want to become truly digital must take a journey of three steps: data transformation, data science transformation, and digital transformation. This also requires transforming the business with machine learning to fundamentally change the relationship with customers. Seth Dobrin explains the detailed steps along the way to digital transformation—and the pitfalls.

2:40pm-3:20pm (40m) Sponsored

Managing the intelligent data pipeline and the connected enterprise (sponsored by Hitachi Vantara)

Chuck Yarbrough (Hitachi Vantara)

Intelligently managing the data pipeline is the key to driving business acceleration and reducing costs. Chuck Yarbrough outlines ways to gain control over the data pipeline. Along the way, you’ll learn how cloud, big data, and machine learning models intersect and how streaming and cloud integration can help create the connected enterprise.

11:00am-11:40am (40m) Sponsored

Focus on your business: Case studies on building data solutions that meet your needs (sponsored by Microsoft)

Tobias Ternstrom (Microsoft)

Tobias Ternstrom leads a deep dive into case studies from three Microsoft customers who put technology before solutions. Tobias examines the decisions that brought them there and outlines how they got back on track and solved their business problems.

11:50am-12:30pm (40m) Sponsored

Analytics in real time, the (Grey's) anatomy of event streaming (sponsored by MemSQL)

Adam Ahringer (Disney-ABC TV Digital Media)

Adam Ahringer explains how Disney-ABC TV leverages Amazon Kinesis and MemSQL to provide real-time insights based on user telemetry as well as the platform for traditional data warehousing activities.

1:50pm-2:30pm (40m) Sponsored

Accelerating analytics and AI from the edge to the cloud (sponsored by Intel)

Kevin Huiskes (Intel), Radhika Rangarajan (Intel)

Advanced analytics and AI workloads require a scalable and optimized architecture, from hardware and storage to software and applications. Kevin Huiskes and Radhika Rangarajan share best practices for accelerating analytics and AI and explain how businesses globally are leveraging Intel’s technology portfolio, along with optimized frameworks and libraries, to build AI workloads at scale.

2:40pm-3:20pm (40m) Sponsored

The Snowflake data warehouse: How Sharethrough analyzes petabytes of event data in a SQL database (sponsored by Snowflake)

Dave Abercrombie (Sharethrough)

Dave Abercrombie explains how Sharethrough used Snowflake to build an analytic and reporting platform that handles petabyte-scale data with ease.

4:20pm-5:00pm (40m) Sponsored

Winning the big data war pays big dividends for Wargaming (sponsored by SAS)

Alexander Ryabov (Wargaming), Jonathan Crow (Wargaming)

Alexander Ryabov and Jonathan Crow explain how Wargaming is winning the battle for bigger profits in the virtual world of online gaming using a best-in-class business intelligence solution to equip its business units with decision-making tools.

5:10pm-5:50pm (40m) Sponsored

Bringing AI into the IoT (sponsored by SAS)

Evan Guarnaccia (SAS)

As the internet of things grows, there is an increasing need for sophisticated but lightweight analytics at the edge. Evan Guarnaccia walks you through a multiphase analytics approach to IoT data, analyzing data at rest to discover patterns of interest and develop analytical models that can be easily deployed into a streaming analytics engine out at the edge, in the fog, or in the cloud.

11:00am-11:40am (40m)

Data and ethics : Brainstorming Session

Natalie Evans Harris (BrightHive)

Join Natalie Evans Harris for a brainstorming session on data and ethics. You'll cover the current Community Principles on Ethical Data Practices (CPEDP) and next steps, existing tools that support ethical data practices, how the community can support the needs of the individual, and whether or not the community needs to be held accountable to regulations (or something more like fiduciary duty).

11:50am-12:30pm (40m) Sponsored

Speed up mission-critical analytics in the cloud (sponsored by Kyligence)

Billy Liu (Kyligence)

As organizations look to scale their analytics capability, the need to grow beyond a traditional data warehouse becomes critical, and cloud-based solutions allow more flexibility while being more cost efficient. Billy Liu offers an overview of Kyligence Cloud, a managed Apache Kylin online service designed to speed up mission-critical analytics at web scale for big data.

5:10pm-5:50pm (40m) Data engineering and architecture Data Integration and Data Pipelines

The future of ETL isn’t what it used to be

Gwen Shapira (Confluent)

8:45am-8:50am (5m)

Wednesday keynote welcome

Ben Lorica (O'Reilly), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

8:50am-9:05am (15m)

Machine learning: What’s real and what’s hype

Hilary Mason (Cloudera Fast Forward Labs)

The power of machine learning is very real, but so too is the hype and confusion about when, where, and how to apply it. Hilary Mason explores practical business applications for intelligent machines and details the tools and processes required to implement machine learning successfully.

9:05am-9:20am (15m)

Merging human and machine learning for everyday solutions

Li Fan (Pinterest)

Li Fan shares insights into how Pinterest improves products based on usage and explains how the company is using AI to predict what’s in an image, what a user wants, and what they’ll want next, answering subjective questions better than machines or humans alone could achieve.

9:20am-9:30am (10m) Sponsored keynote

To a hammer, everything is a nail: Choosing the right tool for your business problems (sponsored by Microsoft)

Tobias Ternstrom (Microsoft)

The emergence of the cloud combined with open source software ushered in an explosive use of a broad range of technologies. Tobias Ternstrom explains why you should step back and attempt to objectively evaluate the problem you are trying to solve before choosing the tool to fix it.

9:30am-9:40am (10m) Data science and machine learning

Privacy in the age of machine learning

Ben Lorica (O'Reilly)

Ben Lorica shares emerging security best practices for business intelligence, machine learning, and mobile computing products and explores new tools, methods, and products that can help ease the way for companies interested in deploying secure and privacy-preserving analytics.

9:40am-9:55am (15m)

Crisis Text Line data usage and insights

Nancy Lublin (Crisis Text Line)

Nancy Lublin shares insights from Crisis Text Line.

9:55am-10:00am (5m) Sponsored keynote

Building the foundation of a latency-free life (sponsored by MemSQL)

Nikita Shamgunov (MemSQL)

We live in a world that’s always connected. As a result, today’s intelligent applications need to react immediately to changing conditions. To achieve this, applications require a foundation that is latency free. Nikita Shamgunov shares a vision of latency-free life supported by modern data architectures.

10:00am-10:10am (10m)

Defining responsible data practices: A community-driven approach

Natalie Evans Harris (BrightHive)

Natalie Evans Harris explores the Community Principles on Ethical Data Practices (CPEDP), a community-driven code of ethics for data collection, sharing, and utilization that provides people in the data science community a standard set of easily digestible, recognizable principles for guiding their behaviors.

10:10am-10:15am (5m) Sponsored keynote

Operationalizing machine learning (sponsored by IBM)

Dinesh Nirmal (IBM)

Machine learning research and incubation projects are everywhere, but less common, and far more valuable, is the innovation unlocked once you bring machine learning out of research and into production. Dinesh Nirmal explains how real-world machine learning reveals assumptions embedded in business processes and in the models themselves that cause expensive and time-consuming misunderstandings.

10:15am-10:30am (15m)

Data science in the cloud

Alex Smola (Amazon)

In this talk Alex will discuss lessons learned from AWS SageMaker, an integrated framework for handling all stages of analysis. AWS uses open source components such as Jupyter, Docker containers, Python and well established deep learning frameworks such as Apache MxNet and TensorFlow for an easy to learn workflow.

10:30am-11:00am (30m)

Break: Morning break sponsored by MemSQL

12:30pm-1:50pm (1h 20m)

Wednesday Topic Tables at lunch

Topic Table discussions are a great way to informally network with people in similar industries or interested in the same topics.

12:30pm-1:50pm (1h 20m)

Women in Big Data Luncheon (sponsored by LinkedIn)

Want to network with other women attending Strata Data Conference? Then be sure to come to the Women in Big Data Luncheon on Wednesday.

12:30pm-1:50pm (1h 20m)

Wednesday Business Summit Lunch

Join fellow executives, business leaders, and strategists for a networking lunch on Wednesday for Strata Business Summit attendees and speakers.

3:20pm-4:20pm (1h)

Break: Afternoon break sponsored by IBM

5:50pm-6:50pm (1h)

Booth Crawl

Join us for vendor-hosted libations (plus snacks) after sessions on Wednesday.

7:00pm-9:30pm (2h 30m)

Data After Dark: Night at the Market

Join us at San Pedro Square Market for an exciting evening filled with cocktails, food, and entertainment at Data After Dark at Strata in San Jose.

8:00am-8:30am (30m)

Speed Networking

Gather before keynotes on Wednesday morning for a speed networking event. Enjoy casual conversation while meeting fellow attendees. If one of your goals at Strata is to meet new people, this session will jumpstart your networking with fellow attendees.

Presented by

Elite Sponsors

Strategic Sponsors

Zettabyte Sponsor

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com