Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Monday, 03/05/2018

9:00am

Add to your personal schedule
9:00am–5:00pm Monday, 03/05/2018
Brooke Wenig (Databricks)
Brooke Wenig walks you through the core APIs for using Spark, fundamental mechanisms and basic internals of the framework, SQL and other high-level data access tools, and Spark’s streaming capabilities and machine learning APIs. Read more.
Add to your personal schedule
9:00am–5:00pm Monday, 03/05/2018
Location: 212 C
Brian Bloechle (Cloudera, Inc.)
Brian Bloechle demonstrates how to implement typical data science workflows using Apache Spark. You'll learn how to wrangle and explore data using Spark SQL DataFrames and how to build, evaluate, and tune machine learning models using Spark MLlib. Read more.
Add to your personal schedule
9:00am–5:00pm Monday, 03/05/2018
Location: 212 D
Angie Ma (ASI)
Angie Ma offers a condensed introduction to key data science and machine learning concepts and techniques, showing you what is (and isn't) possible with these exciting new tools and how they can benefit your organization. Read more.
Add to your personal schedule
9:00am–5:00pm Monday, 03/05/2018
Robert Schroll (The Data Incubator), Dana Mastropole (The Data Incubator)
The TensorFlow library enables the use of data flow graphs for numerical computations, with automatic parallelization across several CPUs or GPUs, making it ideal for implementing neural networks and other machine learning algorithms. Robert Schroll and Dana Mastropole demonstrate TensorFlow's capabilities and walk you through building machine learning models on real-world data. Read more.
Add to your personal schedule
9:00am–5:00pm Monday, 03/05/2018
Jesse Anderson (Big Data Institute)
To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data and how will you process that much data? Jesse Anderson explores the latest real-time frameworks (both open source and managed cloud services), discusses the leading cloud providers, and explains how to choose the right one for your company. Read more.
Add to your personal schedule
9:00am–5:00pm Monday, 03/05/2018
Location: San Jose Ballroom (salon 1&2)
Delip Rao (R7 Speech Science), Brian McMahan (Joostware)
PyTorch is a recent deep learning framework from Facebook that is gaining massive momentum in the deep learning community. Its fundamentally flexible design makes building and debugging models straightforward, simple, and fun. Delip Rao and Brian McMahan walk you through PyTorch's capabilities and demonstrate how to use PyTorch to build deep learning models and apply them to real-world problems. Read more.
Add to your personal schedule
9:00am–5:00pm Monday, 03/05/2018
Location: Willow Glen (1&2)
Zachary Glassman (The Data Incubator)
Instructors from the Data Incubator demonstrate how to build intelligent business applications using machine learning, taking you through each step in developing a machine learning pipeline, from prototyping to production. You'll explore data cleaning, feature engineering, model building and evaluation, and deployment and extend your knowledge by building two applications from real-world datasets. Read more.

Tuesday, 03/06/2018

9:00am

Add to your personal schedule
9:00am–5:00pm Tuesday, 03/06/2018
Location: LL20 A
Madhav Madaboosi (BP), Meenakshisundaram Thandavarayan (Infosys), Meagan O'Leary (Microsoft), Katie Malone (Civis Analytics), Mike Prorock (mesur.io), Thomas Miller (Northwestern University), Ann Nguyen (Whole Whale), Rajiv Synghal (Kaiser Permanente), Rishi Ranjan (Freddie Mac), Wayde Fleener (General Mills), Jules Malin (GoPro), Alistair Croll (Solve For Interesting)
Hear practical insights from household brands and global companies: the challenges they tackled, approaches they took, and the benefits—and drawbacks—of their solutions. Read more.
Add to your personal schedule
9:00am–5:00pm Tuesday, 03/06/2018
Location: LL20 B
Ray Bernard (SuprFanz), Jennifer Webb (SuprFanz)
Hear from innovators in ad tech, measurement, automation, and audience engagement about where the media industry is today—and where it's likely to go next. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/06/2018
Burcu Baran (LinkedIn), Wei Di (LinkedIn), Michael Li (LinkedIn), Chi-Yi Kuan (LinkedIn)
Burcu Baran, Wei Di, Michael Li, and Chi-Yi Kuan walk you through the big data analytics and data science lifecycle and share their experience and lessons learned leveraging advanced analytics and machine learning techniques such as predictive modeling to drive and grow business at LinkedIn. Read more.
Add to your personal schedule
9:00am–5:00pm Tuesday, 03/06/2018
Join in for an introduction to Apache Spark 2.0 core concepts with a focus on Spark's machine learning library, using text mining on real-world data as the primary end-to-end use case. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/06/2018
Philip Langdale (Cloudera), Eugene Fratkin (Cloudera), Vinithra Varadharajan (Cloudera)
Vinithra Varadharajan, Philip Langdale, and Eugene Fratkin lead a deep dive into running data analytic workloads in a managed service capacity in the public cloud and highlight cloud infrastructure best practices. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/06/2018
Jorge A. Lopez (Amazon Web Services)
Want to learn how to use Amazon's big data web services to launch your first big data application on the cloud? Jorge Lopez walks you through building a big data application using a combination of open source technologies and AWS managed services. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/06/2018
Data science and machine learning
Location: LL21 C/D Level: Intermediate
Mario Inchiosa (Microsoft), Vanja Paunic (Microsoft), Robert Horton (Microsoft), Debraj GuhaThakurta (Microsoft), Ali Zaidi (Microsoft), Tomas Singliar (Microsoft), John-Mark Agosta (Microsoft)
R and Python top the list of languages used in data science and machine learning, and data scientists and engineers fluent in one of these languages are increasingly marketable. Come learn how to build and operationalize machine learning models using distributed functions and do scalable, end-to-end data science in R and Python on single machines, Spark clusters, and cloud-based infrastructure. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/06/2018
Data science and machine learning
Location: LL21 E/F Level: Intermediate
Yufeng Guo (Google), Amy Unruh (Google)
Yufeng Guo and Amy Unruh walk you through training and deploying a machine learning system using TensorFlow, a popular open source library. Yufeng and Amy take you from a conceptual overview all the way to building complex classifiers and explain how you can apply deep learning to complex problems in science and industry. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/06/2018
Data science and machine learning
Location: 210 A/E Level: Intermediate
Vartika Singh (Cloudera), Jeffrey Shmain (Cloudera)
Vartika Singh and Jeffrey Shmain outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/06/2018 Secondary topics:  Graphs and Time-series
Karthik Ramasamy (Streamlio), Sanjeev Kulkarni (Streamlio), Sijie Guo (Streamlio)
Across diverse segments in industry, there has been a shift in focus from big data to fast data. Karthik Ramasamy, Sanjeev Kulkarni, and Sijie Guo walk you through state-of-the-art streaming architectures, streaming frameworks, and streaming algorithms, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/06/2018
Tim Berglund (Confluent)
Tim Berglund leads a basic architectural introduction to Kafka and walks you through using Kafka Streams and KSQL to process streaming data. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/06/2018 Secondary topics:  Graphs and Time-series
Mo Patel (Independent), Neejole Patel (Virginia Tech)
Since its arrival in early 2017, PyTorch has won over many deep learning researchers and developers due to its dynamic computation framework. Mo Patel and Neejole Patel walk you through using PyTorch to build a content recommendation model. Read more.

10:30am

10:30am–11:00am Tuesday, 03/06/2018
Location: Break
Break (30m)

12:30pm

12:30pm–1:30pm Tuesday, 03/06/2018
Location: Lunch
Break (1h)

1:30pm

Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/06/2018
Data-driven business management, Strata Business Summit
Location: LL20 C Level: Non-technical
Nick Elprin (Domino Data Lab)
The honeymoon era of data science is ending, and accountability is coming. Not content to wait for results that may or may not arrive, successful data science leaders deliver measurable impact on an increasing share of an enterprise's KPIs. Nick Elprin details how leading organizations have taken a holistic approach to people, process, and technology to build a sustainable competitive advantage. Read more.
Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/06/2018
Data engineering and architecture
Location: LL21 A Level: Intermediate
Juan Yu (Cloudera)
Apache Impala (incubating) is an exceptional, best-of-breed massively parallel processing SQL query engine that is a fundamental component of the big data software stack. Juan Yu explores the cost model Impala planner uses, how Impala optimizes queries, how to identify performance bottleneck through query plan and profile, and how to drive Impala to its full potential. Read more.
Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/06/2018
Data engineering and architecture
Location: LL21 B Level: Intermediate
Ron Bodkin (Google)
TensorFlow and Keras are popular libraries for machine learning because of their support for deep learning and GPU deployment. Join Ron Bodkin to learn how to execute these libraries in production with vision and recommendation models and how to export, package, deploy, optimize, serve, monitor, and test models using Docker and TensorFlow Serving in Kubernetes. Read more.
Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/06/2018
Ronny Kohavi (Microsoft), Alex Deng (Microsoft), Paul Raff (Microsoft)
Controlled experiments such as A/B tests have revolutionized the way software is being developed, allowing real users to objectively evaluate new ideas. Ronny Kohavi, Alex Deng, Pavel Dmitriev, and Paul Raff lead an introduction to A/B texting and share lessons learned from one of the largest A/B testing platforms on the planet, running at Microsoft, which executes over 10K experiments a year. Read more.
Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/06/2018
Abhishek Kumar (SapientRazorfish), Dr. Vijay Srinivas Agneeswaran (SapientRazorfish)
Abhishek Kumar and Vijay Srinivas Agneeswaran offer an introduction to deep learning-based recommendation and learning-to-rank systems using TensorFlow. You'll learn how to build a recommender system based on intent prediction using deep learning that is based on a real-world implementation for an ecommerce client. Read more.
Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/06/2018
Data science and machine learning
Location: 210 A/E Level: Intermediate
David Talby (Pacific AI), Claudiu Branzan (G2 Web Services), Alexander Thomas (Indeed)
Natural language processing is a key component in many data science systems. David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial on scalable NLP using spaCy for building annotation pipelines, Spark NLP for building distributed natural language machine-learned pipelines, and Spark ML and TensorFlow for using deep learning to build and apply word embeddings. Read more.
Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/06/2018
Data engineering and architecture
Location: 210 B/F Level: Intermediate
Secondary topics:  Graphs and Time-series
Ted Malaska (Blizzard Entertainment)
If you have data that has a time factor to it, then you need to think in terms of time series datasets. Ted Malaska explores time series in all of its forms, from tumbling windows to sessionization in batch or in streaming. You'll gain exposure to the tools and background you need to be successful in the world of time-oriented data. Read more.
Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/06/2018
Dean Wampler (Lightbend), Boris Lublinsky (Lightbend)
Join Dean Wampler and Boris Lublinsky to learn how to build two microservice streaming applications based on Kafka using Akka Streams and Kafka Streams for data processing. You'll explore the strengths and weaknesses of each tool for particular design needs and contrast them with Spark Streaming and Flink, so you'll know when to chose them instead. Read more.
Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/06/2018
James Bednar (Anaconda), Philipp Rudiger (Anaconda)
Python lets you solve data science problems by stitching together packages from its ecosystem, but it can be difficult to choose packages that work well together. James Bednar and Philipp Rudiger walk you through a concise, fast, easily customizable, and fully reproducible recipe for interactive visualization of millions or billions of datapoints—all in just 30 lines of Python code. Read more.

3:00pm

3:00pm–3:30pm Tuesday, 03/06/2018
Location: Lunch
Break (30m)

5:00pm

Add to your personal schedule
5:00pm–6:30pm Tuesday, 03/06/2018
Location: Hall 1, 2, 3
Join us after tutorials on Tuesday in the Expo Hall. Grab a drink and mingle with fellow Strata attendees while you check out all of the exhibitors. Read more.

6:30pm

Add to your personal schedule
6:30pm–8:00pm Tuesday, 03/06/2018
Location: TBD (Ignite)
Ignite is happening at Strata on Tuesday, March 6. Join us for a fun, high-energy evening of five-minute talks—all aspiring to live up to the Ignite motto: Enlighten us, but make it quick. Read more.

Wednesday, 03/07/2018

8:00am

Add to your personal schedule
8:00am–8:30am Wednesday, 03/07/2018
Location: TBD (Speed Networking)
Gather before keynotes on Wednesday morning for a speed networking event. Enjoy casual conversation while meeting fellow attendees. If one of your goals at Strata is to meet new people, this session will jumpstart your networking with fellow attendees. Read more.

8:30am

8:30am–8:45am Wednesday, 03/07/2018
Location: Break
Coffee break (15m)

8:45am

Add to your personal schedule
8:45am–8:50am Wednesday, 03/07/2018
Location: Grand Ballroom 220
Ben Lorica (O'Reilly Media), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes. Read more.

8:50am

Add to your personal schedule
8:50am–9:05am Wednesday, 03/07/2018
Location: Grand Ballroom 220
Details to come. Read more.

9:05am

Add to your personal schedule
9:05am–9:20am Wednesday, 03/07/2018
Location: Grand Ballroom 220
Li Fan (Pinterest)
Li Fan, Senior Vice President, Engineering, Pinterest. Read more.

9:35am

Add to your personal schedule
9:35am–9:50am Wednesday, 03/07/2018
Location: Grand Ballroom 220
Wayne Peacock (Blizzard Entertainment)
Wayne Peacock, Vice President, Global Insights for Blizzard Entertainment. Read more.

10:00am

Add to your personal schedule
10:00am–10:10am Wednesday, 03/07/2018
Location: Grand Ballroom 220
Natalie Evans Harris (BrightHive)
Natalie Evans Harris, COO and VP of Ecosystem Development, BrightHive, Inc. Read more.

10:15am

Add to your personal schedule
10:15am–10:30am Wednesday, 03/07/2018
Location: Grand Ballroom 220
Seth Stephens-Davidowitz (Everybody Lies | NY Times)
Keynote with Seth Stephens-Davidowitz Read more.

10:30am

10:30am–11:00am Wednesday, 03/07/2018
Location: Break
Morning break sponsored by MemSQL (30m)

11:00am

Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Data engineering and architecture
Location: 212 A-B Level: Intermediate
Secondary topics:  Data Integration and Data Pipelines
Gwen Shapira (Confluent)
Gwen Shapira shares design and architecture patterns that are used to modernize data engineering and details how Apache Kafka, microservices and event streams are used by modern engineering organizations to efficiently build data pipelines that are scalable, reliable, and built to evolve. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Data science and machine learning
Location: LL20 A Level: Intermediate
Michael Lee Williams (Fast Forward Labs)
Interpretable models result in more accurate, safer, and more profitable machine learning products. But interpretability can be hard to ensure. Michael Lee Williams explores the growing business case for interpretability and its concrete applications, including churn, finance, and healthcare. Along the way, Michael offers an overview of open source, model-agnostic tool LIME. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Data science and machine learning
Location: LL20 C Level: Non-technical
Daniel Lurie (Pinterest)
All successful startups thrive on tight product-market fit, which can produce homogenous initial user bases. To become the next big thing, your user base will need to diversify and your product must change to accommodate new needs. Daniel Lurie explains how Pinterest leverages external data to measure racial and income diversity in its user base and changed user modeling to drive growth. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Wee Hyong Tok (Microsoft), Danielle Dean (Microsoft)
Artificial intelligence (AI) has tremendous potential to extend our capabilities and empower organizations to accelerate their digital transformation. Wee Hyong Tok and Danielle Dean demystify AI for big data professionals and explain how they can leverage and evolve their valuable big data skills by getting started with AI. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Rajat Monga (Google)
Rajat Monga offers an overview of TensorFlow progress and adoption in 2017 before looking ahead to the areas of importance in the future—performance, usability, and ubiquity—and the efforts TensorFlow is making in those areas. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Strata Business Summit
Location: LL21 C/D Level: Intermediate
Mark Madsen (Third Nature)
If your goal is to provide data to an analyst rather than a data scientist, what’s the best way to deliver analytics? There are 70+ BI tools in the market and a dozen or more SQL- or OLAP-on-Hadoop open source projects. A panel of experts details the trade-offs between a number of architectures that provide self-service access to data. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Law, ethics, and governance, Strata Business Summit
Location: LL21 E/F Level: Intermediate
Anne Buff (SAS Institute)
Emerging technologies such as the IoT, AI, and ML present businesses with enormous opportunities for innovation, but to maximize the potential of these technologies, businesses must radically shift their approach to governance. Anne Buff explains what it takes to shift the focus of governance from standards, conformity, and control to accountability, extensibility, and enablement. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Tom Fisher (MapR Technologies)
The monolithic cloud is dying. Delivering capabilities across multiple clouds and, simultaneously, transitioning to the next generation platforms and applications is the challenge today. Tom Fisher explores technological approaches and solutions that make this possible while delivering data-driven applications and operations. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Data engineering and architecture
Location: 210 C/G Level: Intermediate
Kinnary Jangla (Pinterest)
Having trouble coordinating development of your production ML system between a team of developers? Microservices drifting and causing problems debugging? Kinnary Jangla explains how Pinterest dockerized the services powering its home feed and how it impacted the engineering productivity of the company's ML teams while increasing uptime and ease of deployment. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018 Secondary topics:  Graphs and Time-series
Shivnath Babu (Duke University | Unravel Data Systems), Sumit Jindal (Unravel Data Systems)
Getting the best performance, predictability, and reliability for Kafka-based applications is a complex art. Shivnath Babu and Sumit Jindal explain how to simplify the process by leveraging recent advances in machine learning and AI and outline a methodology for applying statistical learning to the rich and diverse monitoring data that is available from Kafka. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Strata Business Summit
Location: 230 A Level: Intermediate
Mauro Damo (Dell EMC), Wei Lin (Dell EMC)
Image recognition classification of diseases will minimize the possibility of medical mistakes, improve patient treatment, and speed up patient diagnosis. Mauro Damo and Wei Lin offer an overview of an approach to identify bladder cancer on patients using nonsupervised and supervised machine learning techniques on more than 5,000 magnetic resonance images from the Cancer Imaging Archive. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Data engineering and architecture
Location: 230 C Level: Intermediate
Daniel Templeton (Cloudera), Andrew Wang (Cloudera)
Hadoop 3.0 has been years in the making, and now it's finally arriving. Andrew Wang and Daniel Templeton offer an overview of new features, including HDFS erasure coding, YARN Timeline Service v2, YARN federation, and much more, and discuss current release management status and community testing efforts dedicated to making Hadoop 3.0 the best Hadoop major release yet. Read more.

11:50am

Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018 Secondary topics:  Data Integration and Data Pipelines
Eugene Kirpichov (Google)
Apache Beam equips users with a novel programming model in which the classic batch/streaming data processing dichotomy is erased. Eugene Kirpichov details the modularity and composability advantages created by treating data ingestion as just another data processing task and walks you through building highly modular data ingestion APIs using the new Beam programming model primitive SplittableDoFn. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
Dan Crankshaw (UC Berkeley RISE Lab)
Clipper is an open source, general-purpose model-serving system that provides low-latency predictions under heavy serving workloads for interactive applications. Dan Crankshaw offers an overview of the Clipper serving system and explains how to use it to serve Apache Spark and TensorFlow models on Kubernetes. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
Balasubramanian Narasimhan (Stanford University), John-Mark Agosta (Microsoft), Daniel Rubin (Stanford)
Clinical collaboration benefits from pooling data to train models from large datasets, but it's hampered by concerns about sharing data. Balasubramanian Narasimhan, John-Mark Agosta, and Daniel Rubin outline a privacy-preserving alternative that creates statistical models equivalent to one from the entire dataset. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
Diane Chang (Intuit)
When building a chatbot, it’s important to develop one that is humanized, has contextual responses, and can simulate true empathy for the end users. Diane Chang shares how Intuit's data science team preps, cleans, organizes, and augments training data along with best practices she's learned along the way. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
Shivaram Venkataraman (UC Berkeley), Sergey Ermolin (Intel), Ding Ding (Intel)
The BigDL framework scales deep learning for large datasets using Apache Spark. However there is significant scheduling overhead from Spark when running BigDL at large scale. Shivaram Venkataraman, Sergey Ermolin, and Ding Ding outline a new parameter manager implementation that along with coarse-grained scheduling can provide significant speedups for deep learning models like Inception and VGG. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
Strata Business Summit
Location: LL21 C/D Level: Non-technical
Yishay Carmiel (IntelligentWire | Spoken Labs)
One of the most important tasks of AI has been to understand humans. People want machines to understand not only what they say but also what they mean and to take particular actions based on that information. This goal is the essence of conversational AI. Yishay Carmiel explores the latest breakthroughs and revolutions in this field and the challenges still to come. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
Data-driven business management, Strata Business Summit
Location: LL21 E/F Level: Beginner
Matthew Granade (Domino Data Lab)
Predictive analytics and artificial intelligence have become critical competitive capabilities. Yet IT teams struggle to provide the support data science teams needs to succeed. Matthew Granade explains how leading banks, insurance and pharmaceutical companies, and others manage data science at scale. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
Bin Fan (Alluxio), Shaoshan Liu (PerceptIn)
Bin Fan and Shaoshan Liu explain how PerceptIn designed and implemented a cloud architecture to support video streaming and online object recognition tasks and demonstrate how Alluxio delivers high throughput, low latency, and a unified namespace to support these emerging cloud architectures. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
Data engineering and architecture
Location: 210 C/G Level: Non-technical
Crystal Valentine (MapR Technologies)
DataOps—a methodology for developing and deploying data-intensive applications, especially those involving data science and machine learning pipelines—supports cross-functional collaboration and fast time to value with an Agile, self-service workflow. Crystal Valentine offers an overview of this emerging field and explains how to implement a DataOps process. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018 Secondary topics:  Graphs and Time-series
William Chambers (Databricks), Michael Armbrust (Databricks)
William Chambers and Michael Armbrust discuss the motivation and basics of Apache Spark's Structured Streaming processing engine and share lessons they've learned running hundreds of Structured Streaming workloads in the cloud. Read more.
11:50am–12:30pm Wednesday, 03/07/2018 TBC
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
Data engineering and architecture
Location: 230 C Level: Intermediate
Edwina Lu (LinkedIn), Ye Zhou (LinkedIn), Min Shen (LinkedIn)
Spark applications need to be well tuned so that individual applications run quickly and reliably, and cluster resources are efficiently utilized. Edwina Lu, Ye Zhou, and Min Shen outline a fast, reliable, and automated process used at LinkedIn for tuning Spark applications, enabling users to quickly identify and fix problems. Read more.

12:30pm

Add to your personal schedule
12:30pm–1:50pm Wednesday, 03/07/2018
Location: Hall 1, 2, 3
Topic Table discussions are a great way to informally network with people in similar industries or interested in the same topics. Read more.
Add to your personal schedule
12:30pm–1:50pm Wednesday, 03/07/2018
Location: Almaden Ballroom, San Jose Hilton
Want to network with other women attending Strata Data Conference? Then be sure to come to the Women in Data Luncheon on Wednesday. Read more.
Add to your personal schedule
12:30pm–1:50pm Wednesday, 03/07/2018
Location: TBD (SBS lunch)
Join fellow executives, business leaders, and strategists for a networking lunch on Wednesday for Strata Business Summit attendees and speakers. Read more.

1:50pm

Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018 Secondary topics:  Data Integration and Data Pipelines
Jordan Hambleton (Cloudera), Guru Medasani (Cloudera)
When Kafka stream processing pipelines fail, they can leave users panicked about data loss when restarting their application. Jordan Hambleton and Guru Medasani explain how offset management provides users the ability to restore the state of the stream throughout its lifecycle, deal with unexpected failure, and improve accuracy of results. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018 Secondary topics:  Graphs and Time-series
Alexandra Gunderson (Arundo Analytics)
Heavy industries, such as oil and gas, have tremendous amounts of data from which predictive models could be built, but it takes weeks and even months to create a comprehensive dataset from all of the various data sources. Alexandra Gunderson details the methodology behind an industry-tested approach that incorporates machine learning to structure and link data from different sources. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Data science and machine learning
Location: LL20 C Level: Advanced
Secondary topics:  Graphs and Time-series
Andrew Ray (Sam’s Club Technology)
Andrew Ray offers a brief introduction to the distributed graph algorithm abstractions provided by Pregel, PowerGraph, and GraphX, drawing on real-world examples, and provides historical context for the evolution between these three abstractions. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Data science and machine learning
Location: LL20 D Level: Intermediate
Vincent Xie (Intel), Peng Meng (Intel)
Intel has been deeply involved in Spark from its earliest moments. Vincent Xie and Peng Meng share what Intel has been working on Spark ML and introduce the methodology behind Intel's work on SparkML optimization. Read more.
1:50pm–2:30pm Wednesday, 03/07/2018
Location: LL21 B
TBC
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Data-driven business management, Strata Business Summit
Location: LL21 C/D Level: Non-technical
Frances Haugen (Pinterest), Patrick Phelps (Pinterest)
Data science is most powerful when combined with deep domain knowledge, but those with domain knowledge don't work on data-focused teams. So how do you empower employees with diverse backgrounds and skill sets to be effective users of data? Frances Haugen and Patrick Phelps dive into the social side of data and share strategies for unlocking otherwise unobtainable insights. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Data-driven business management, Strata Business Summit
Location: LL21 E/F Level: Beginner
Paco Nathan (O'Reilly Media)
Human in the loop (HITL) has emerged as a key design pattern for managing teams where people and machines collaborate. Such systems are mostly automated, with exceptions referred to human experts, who help train the machines further. Paco Nathan offers an overview of HITL from the perspective of a business manager, focusing on use cases within O'Reilly Media. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Data engineering and architecture
Location: 210 A/E Level: Intermediate
Kurt Brown (Netflix)
Kurt Brown explains how to get the most out of your data infrastructure with 20 principles and practices used at Netflix. Kurt covers each in detail and explores how they relate to the technologies used at Netflix, including S3, Spark, Presto, Druid, R, Python, and Jupyter. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Zhen Fan (JD.com), Wei Ting Chen (Intel)
Zhen Fan and Wei Ting Chen explain how JD.com uses Spark on Kubernetes in a production environment and why the company chose Spark on Kubernetes for its AI workloads. You'll learn how to run Spark with Kubernetes and gain an understanding of the advantages this provides. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Debasish Ghosh (Lightbend )
Debasish Ghosh explores the role that approximation data structures play in processing streaming data. Typically streams are unbounded in space and time, and processing has to be done online using sublinear space. Debasish covers the probabilistic bounds that these data structures offer and how they can be used to implement solutions for the fast and streaming architectures. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Strata Business Summit
Location: 230 A Level: Beginner
Ayin Vala (Foundation for Precision Medicine)
Complex diseases like Alzheimer’s cannot be cured by pharmaceutical or genetic sciences alone, and current treatments and therapies lead to mixed successes. Ayin Vala explains how to use the power of big data and AI to treat challenging diseases with personalized medicine, which takes into account individual variability in medicine intake, lifestyle, and genetic factors for each patient. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Siddharth Teotia (Dremio)
Query processing technology has rapidly evolved since the iconic C-Store paper was published in 2005, with a focus on designing query processing algorithms and data structures that efficiently utilize CPU and leverage the changing trends in hardware to deliver optimal performance. Siddharth Teotia outlines the different types of vectorized query processing in Dremio using Apache Arrow. Read more.

2:40pm

Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018 Secondary topics:  Data Integration and Data Pipelines
Sean Kandel (Trifacta)
Organizations leverage reporting, analytic, and machine learning pipelines to drive decision making and power critical operational systems. Sean Kandel discusses methods for detecting, visualizing, and resolving inconsistencies between source and target data models across these pipelines. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018 Secondary topics:  Graphs and Time-series
Baron Schwartz (VividCortex)
Anomaly detection is white hot in the monitoring industry, but many don't really understand or care about it, while others repeat the same pattern many times. Why? And what can we do about it? Baron Schwartz explains how he arrived at a "post-anomaly detection" point of view. Read more.
2:40pm–3:20pm Wednesday, 03/07/2018 Secondary topics:  Graphs and Time-series
TBC
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018
Joseph Bradley (Databricks)
Joseph Bradley discusses common paths to productionizing Apache Spark MLlib models and shares engineering challenges and corresponding best practices. Along the way, Joseph covers several deployment scenarios, including batch scoring, Structured Streaming, and real-time low-latency serving. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018
Mohamed AbdelHady (Microsoft), Zoran Dzunic (Microsoft)
Mohamed AbdelHady and Zoran Dzunic demonstrate how to build a domain-specific entity extraction system from unstructured text using deep learning. In the model, domain-specific word embedding vectors are trained on a Spark cluster using millions of PubMed abstracts and then used as features to train a LSTM recurrent neural network for entity extraction. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018
Strata Business Summit
Location: LL21 C/D
Michael Chui (McKinsey Global Institute)
After decades of extravagant promises, artificial intelligence is finally starting to deliver real-life benefits to early adopters. However, we're still early in the cycle of adoption. Michael Chui explains where investment is going, patterns of AI adoption and value capture by enterprises, and how the value potential of AI across sectors and business functions is beginning to emerge. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018
Data-driven business management, Strata Business Summit
Location: LL21 E/F Level: Non-technical
Katie Malone (Civis Analytics), Skipper Seabold
A huge challenge for data science managers is determining priorities for their team. Every data science team has more good ideas than they have time, so it’s critical to quickly prioritize the highest-impact projects. This talk shares a framework that our large and diverse data science team uses to identify, discuss, select, and manage a data science portfolio for a fast-moving startup. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018
Data engineering and architecture
Location: 210 A/E Level: Intermediate
Mark Grover (Lyft), Arup Malakar (Lyft)
Mark Grover and Arup Malakar offer an overview of how Lyft leverages application metrics, logs, and auditing to monitor and troubleshoot its data platform and share how the company dogfoods the platform to provide security, auditing, alerting, and replayability. They also detail some of the internal services and tools Lyft has developed to make its data more robust, scalable, and self-serving. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018
Manu Mukerji (Criteo)
Criteo is a global leader in commerce marketing. Manu Mukerji walks you through Criteo's machine learning example for universal catalogs, explaining how the training and test sets are generated and annotated, how the model is pushed to production, evaluated (automatically), and used, production issues that arise when applying ML at scale in production, lessons learned, and more. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018
Henry Cai (Pinterest), Yi Yin (Pinterest)
With the rise of large-scale real-time computation, there is a growing need to link legacy MySQL systems with real-time platforms. Henry Cai and Yi Yin explain offer an overview of WaterMill, Pinterest's continuous DB ingestion system for streaming SQL data into near-real-time computation pipelines to support dynamic personalized recommendations and search indices. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018
Law, ethics, and governance, Strata Business Summit
Location: 230 A Level: Beginner
Or Herman-Saffar (Dell), Ran Taig (Dell)
What if we could predict when and where next crimes will be committed? Or Herman-Saffar and Ran Taig offer an overview of Crimes in Chicago, a publicly published dataset of reported incidents of crime that have occurred in Chicago since 2001. Or and Ran explain how to use this data to explore committed crimes to find interesting trends and make predictions for the future. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018
Ritesh Agrawal (Uber), Anirban Deb (Uber)
Presto has emerged as the de facto query engine to quickly process petabytes of data. However, rogue SQL queries can waste a significant amount of critical compute resource and reduce Presto's throughput. Ritesh Agrawal and Anirban Deb explain how Uber uses machine learning to identify and stop rogue queries, saving both computational power and money. Read more.

3:20pm

3:20pm–4:20pm Wednesday, 03/07/2018
Location: Break
Afternoon break sponsored by IBM (1h)

4:20pm

Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018 Secondary topics:  Data Integration and Data Pipelines
Dorna Bandari (Jetlore)
Dorna Bandari offers an overview of the machine learning pipeline at B2B AI startup Jetlore and explains why even small B2B startups in AI should invest in a flexible machine learning pipeline. Dorna covers the design choices, the trade-offs made when implementing and maintaining the pipeline, and how it has accelerated Jetlore's product development and growth. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018
Data science and machine learning
Location: LL20 A Level: Intermediate
Secondary topics:  Graphs and Time-series
Joseph Richards (GE Digital)
Deploying ML software applications for use cases in the industrial internet presents a unique set of challenges. Data-driven problems require approaches that are highly accurate, robust, fast, scalable, and fault tolerant. Joseph Richards shares GE's approach to building production-grade ML applications and explores work across GE in industries such as power, aviation, and oil and gas. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018 Secondary topics:  Graphs and Time-series
Vlad A Ionescu (ShiftLeft), Fabian Yamaguchi (ShiftLeft)
Vlad Ionescu and Fabian Yamaguchi outline Code Property Graph (CPG), a unique approach that allows the functional elements of code to be represented in an interconnected graph of data and control flows, which enables semantic information about code to be stored scalably on distributed graph databases over the web while allowing them to be rapidly accessed. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018
Data science and machine learning
Location: LL20 D Level: Intermediate
Rachita Chandra (IBM Watson Health)
Rachita Chandra outlines challenges and considerations for transforming a research prototype built for a single machine to a deployable healthcare solution that leverages Spark in a distributed environment. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018
Data science and machine learning
Location: LL21 B Level: Intermediate
Secondary topics:  Graphs and Time-series
Andrea Pasqua (Uber), Anny Chen (Uber)
Time series forecasting and anomaly detection is of utmost importance at Uber. However, the scale of the problem, the need for speed, and the importance of accuracy make anomaly detection a challenging data science problem. Andrea Pasqua and Anny Chen explain how the use of recurrent neural networks is allowing Uber to meet this challenge. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018
Data-driven business management, Strata Business Summit
Location: LL21 C/D Level: Intermediate
Ted Malaska (Blizzard Entertainment), Jonathan Seidman (Cloudera)
Recent years have seen dramatic advancements in the technologies available for managing and processing data. While these technologies provide powerful tools to build data applications, they also require new skills. Ted Malaska and Jonathan Seidman explain how to evaluate these new technologies and build teams to effectively leverage these technologies and achieve ROI with your data initiatives. Read more.
4:20pm–5:00pm Wednesday, 03/07/2018
Location: LL21 E/F
TBC
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018
Carlo Torniai (Pirelli Tyre)
Carlo Torniai shares the architectural challenges Pirelli faced in building Pirelli Connesso, an IoT cloud-based system providing information on tire operating conditions, consumption, and maintenance, and highlights the operative approaches that enabled the integration of different contribution across cross-functional teams. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018
Rahim Daya (Pinterest)
Personalization is a powerful tool for building sticky and impactful product experiences. Rahim Daya shares Pinterest's frameworks for building personalized user experiences, from sourcing the right contextual data to designing and evaluating personalization algorithms that can delight the user. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018 Secondary topics:  Graphs and Time-series
Sijie Guo (Streamlio)
Apache BookKeeper, a scalable, fault-tolerant, and low-latency storage service optimized for real-time workloads, has been widely adopted by enterprises like Twitter, Yahoo, and Salesforce to store and serve mission-critical data. Sijie Guo explains how Apache BookKeeper satisfies the needs of stream storage. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018
Data-driven business management, Strata Business Summit
Location: 230 A Level: Intermediate
Kapil Surlaker (LinkedIn), Ya Xu (LinkedIn)
Metrics measurement and experimentation plays a crucial role in every product decision at LinkedIn. Kapil Surlaker and Ya Xu explain why, to meet the company's needs, LinkedIn built the UMP and XLNT platforms for metrics computation and experimentation, respectively, which have allowed the company to perform measurement and experimentation very efficiently at scale while preserving trust in data. Read more.
4:20pm–5:00pm Wednesday, 03/07/2018
Location: 230 C
TBC

5:10pm

Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018 Secondary topics:  Data Integration and Data Pipelines
Abe Gong (Superconductive Health), James Campbell (USG)
Data science and engineering have been missing out on one of the biggest productivity boosters in modern software development: automated testing. Abe Gong and James Campbell discuss the concept of pipeline tests and offer an overview of Great Expectations, an open source Python framework for bringing data pipelines and products under test. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018 Secondary topics:  Graphs and Time-series
Roger Barga (Amazon Web Services), Nina Mishra (Amazon Web Services), Sudipto Guha (Amazon Web Services), Ryan Nienhuis (Amazon Web Services)
Roger Barga, Nina Mishra, Sudipto Guha, and Ryan Nienhuis detail continuous machine learning algorithms that discover useful information in streaming data, focusing on explainable machine learning, including anomaly detection with attribution, ability to reduce false positives through user feedback, and detection of anomalies in directed graphs. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018
Data science and machine learning
Location: LL20 C Level: Intermediate
Mike Conover (SkipFlag)
Mike Conover offers an overview of the essential techniques for understanding and working with natural language. From off-the-shelf neural networks and snappy preprocessing libraries to architectural patterns for bulletproof productionization, this talk will be of interest to anyone who uses language on a regular basis. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018
Data science and machine learning
Location: LL20 D Level: Non-technical
Miryung Kim (UCLA), Muhammad Gulzar (UCLA)
Even though we know that there are more data scientists in the workforce today, what those data scientists actually do and what we mean by data scientists have not been studied quantitatively. In this talk, we present a large-scale survey with 793 professional data scientists. Our study should inform managers on how to leverage data science capability effectively within their teams. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018
Sergey Ermolin (Intel), Suqiang Song (MasterCard)
Sergey Ermolin and Suqiang Song will demonstrate how to use Spark BigDL Wide-and-Deep and Neural Collaborative Filtering (NCF) algorithms to predict a user’s probability of shopping at a particular offer merchant during a campaign period. Along the way, they will compare the Deep Learning results with those obtained by a classical MLlib’s Alternating Least Squares (ALS) approach. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018
Strata Business Summit
Location: LL21 C/D
Alysa Z. Hutnik (Kelley Drye & Warren LLP), Crystal Skelton (Kelley Drye & Warren LLP)
Big data promises enormous benefits for companies. But what about privacy, data protection, and consumer laws? Having a solid understanding of the legal and self-regulatory rules of the road are key to maximizing the value of your data while avoiding data disasters. Alysa Hutnik and Crystal Skelton share legal best practices and practical tips to avoid becoming a big data “don’t.” Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018
Data-driven business management, Strata Business Summit
Location: LL21 E/F Level: Non-technical
Angela Zutavern (Booz Allen Hamilton)
How can you most effectively use machine intelligence to drive strategy? By merging it in the right way with the human ingenuity of leaders throughout your organization. Angela Zutavern shares insights from her work with pioneering companies, government agencies, and nonprofits that are successfully navigating this partnership by becoming “mathematical corporations.” Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018
Thomas Phelan (BlueData)
Recent headline-grabbing data breaches demonstrate that protecting data is essential for every enterprise. The best-of-breed approach for big data is HDFS configured with transparent data encryption (TDE). However, TDE can be difficult to configure and manage; issues that are only compounded when running on Docker containers. Thomas Phelan explores these challenges and how to overcome them. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018
Data engineering and architecture
Location: 210 C/G Level: Beginner
Ted Dunning (MapR Technologies)
Ted Dunning offers an overview of the rendezvous architecture, developed to be the "continuous integration" system for machine learning models. It allows always-hot zero latency rollout and rollback of new models and supports extensive metrics and diagnostics so models can be compared as they process production data. It can even hot-swap the framework itself with no downtime. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018
Fabian Hueske (data Artisans), Shuyi Chen (Uber)
Fabian Hueske and Shuyi Chen explore SQL's role in the world of streaming data and its implementation in Apache Flink and cover fundamental concepts, such as streaming semantics, event time, incremental results. They also share their experience using Flink SQL in production at Uber, explaining how Uber leverages Flink SQL to solve its unique business challenges. Read more.
5:10pm–5:50pm Wednesday, 03/07/2018
Location: 230 A
TBC
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018
Data engineering and architecture
Location: 230 C Level: Advanced
Ash Munshi (Pepperdata)
Ash Munshi shares techniques for labeling big data apps using runtime measurements of CPU, memory, I/O and network and details a deep neural network to help operators understand the types of apps running on the cluster and better predict runtimes, tune resource utilization, and increase efficiency. These methods are new and are the first approach to classify multivariate time series. Read more.

5:50pm

Add to your personal schedule
5:50pm–6:50pm Wednesday, 03/07/2018
Location: Hall 1, 2, 3
Join us for vendor-hosted libations (plus snacks) after sessions on Wednesday. Read more.

7:00pm

Add to your personal schedule
7:00pm–9:30pm Wednesday, 03/07/2018
Location: San Pedro Market
Join us at San Pedro Square Market for an exciting evening filled with cocktails, food, and entertainment at Data After Dark at Strata in San Jose. Read more.

Thursday, 03/08/2018

8:00am

Add to your personal schedule
8:00am–8:30am Thursday, 03/08/2018
Location: TBD (Speed Networking)
Gather before keynotes on Thursday morning for a speed networking event. Enjoy casual conversation while meeting fellow attendees. If one of your goals at Strata is to meet new people, this session will jumpstart your networking with fellow attendees. Read more.

8:45am

Add to your personal schedule
8:45am–10:30am Thursday, 03/08/2018
Location: Grand Ballroom 220
Ben Lorica (O'Reilly Media), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes. Read more.

10:30am

10:30am–11:00am Thursday, 03/08/2018
Location: Break
Break (30m)

11:00am

Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Clare Gollnick (Terbium Labs)
At the heart of the reproducibility crisis in the sciences is the widespread misapplication of statistics. Data science relies on the same statistical methodology as these scientific fields. Can we avoid the same crisis of integrity? Clare Gollnick considers the philosophy of data science and shares a framework that explains (and even predicts) the likelihood of success of a data project. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Data science and machine learning
Location: LL20 C Level: Beginner
Mike Ruberry (ZestFinance)
What does it mean to explain a machine learning model, and why is it important? Mike Ruberry offers an overview of several modern explainability methods, including traditional feature contributions, LIME, and DeepLift. Each of these techniques presents a different perspective, and their clever application can reveal new insights and solve business requirements. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Data science and machine learning
Location: LL20 D Level: Intermediate
Josh Wills (Slack)
Josh Wills describes recent data science and machine learning projects at Slack. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Data science and machine learning
Location: LL21 B Level: Intermediate
Karthik Ramasamy (Uber), Lenny Evans (Uber)
Stolen credit cards are a major problem faced by many companies, including Uber. Karthik Ramasamy and Lenny Evans detail a new weapon against stolen credit cards that uses computer vision to scan credit cards, verifying possession of the physical card with basic fake card detection capabilities. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Strata Business Summit
Location: LL21 C/D
Mike Olson (Cloudera)
Mike Olson shares examples of real-world machine learning applications, explores a variety of challenges in putting these capabilities into production—including the speed with with technology is moving, cloud versus in-data-center consumption, security and regulatory compliance, and skills and agility in getting data and answers into the right hands—and outlines proven ways to meet them. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018 Secondary topics:  Graphs and Time-series
Michael Schrenk (Self-Employed)
Big data becomes much more powerful when it has context. Fortunately, creative data scientists can create needed context though the use of metadata. Michael Schrenk explains how metadata is created and used to gain competitive advantages, predict troop strength, or even guess Social Security numbers. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Data engineering and architecture
Location: 210 A/E Level: Intermediate
Francesca Lazzeri (Microsoft), Fidan Boylu Uz (Microsoft)
Francesca Lazzeri and Fidan Boylu Uz explain how to operationalize LSTM networks to predict the remaining useful life of aircraft engines. They use simulated aircraft sensor values to predict when an aircraft engine will fail in the future so that maintenance can be planned in advance. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Greg Rahn (Cloudera)
For many organizations, the cloud will likely be the destination of their next big data warehouse. Greg Rahn shares considerations when evaluating the cloud for analytics and big data warehousing in order to help you get the most from the cloud. You'll leave with an understanding of different architectural approaches and impacts for moving analytic workloads to the cloud. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Tyler Akidau (Google)
What does it mean to execute streaming queries in SQL? What is the relationship of streaming queries to classic relational queries? Are streams and tables the same thing? And how does all of this relate to the programmatic frameworks we’re all familiar with? Tyler Akidau answers these questions and more as he walks you through key concepts underpinning data processing in general. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Law, ethics, and governance, Strata Business Summit
Location: 230 A Level: Beginner
Sugreev Chawla (Thorn)
Sugreev Chawla offers an overview of Spotlight, a tool created by Thorn, a nonprofit that uses technology to fight online child sexual exploitation. It allows law enforcement to process millions of escort ads per month in an effort to fight sex trafficking, using graph analysis, time series analysis and NLP techniques to surface important networks of ads and characterize their behavior over time. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Data engineering and architecture
Location: 230 C Level: Intermediate
Jiangjie Qin (LinkedIn)
LinkedIn runs more than 1,800+ Kafka brokers that deliver more than two trillion messages a day. Running Kafka at such a scale makes automated operations a necessity. Jiangjie Qin shares lessons learned from operating Kafka at scale with minimum human intervention. Read more.

11:50am

Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018 Secondary topics:  Graphs and Time-series
Ram Shankar Siva Kumar (Microsoft (Azure Security Data Science))
How should you best debug a security data science system: change the ML approach, redefine the security scenario, or start over from scratch? Ram Shankar answers this question by sharing the results of failed experiments and the lessons learned when building ML detections for cloud lateral movement, identifying anomalous executables, and automating incident response process. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
Data science and machine learning
Location: LL20 C Level: Intermediate
Stephen O'Sullivan (Data Whisperers)
Stephen O'Sullivan takes you along the data science journey, from on-boarding data (using a number of data/object stores) to understanding and choosing the right data format for the data assets to using query engines (and basic query tuning). You'll learn some new skills to help you be more productive and reduce contention with the data engineering team. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
Data science and machine learning
Location: LL20 D Level: Intermediate
Adam Greenhall explains how Lyft uses simulation to test out new algorithms, help develop new features, and study the economics of ride-sharing markets as they grow. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
Jennie Wang (Intel), Valentina Pedoia (UCSF), Berk Norman (UCSF), Yulia Tell (Intel)
Damage to the meniscus is a physically limiting injury that can lead to further medical complications. Automatically classifying this damage at the time of an MRI scan would allow quicker and more accurate diagnosis. Jennie Wang, Valentina Pedoia, Berk Norman, and Yulia Tell offer an overview of their classification system built with 3D convolutional neural networks using BigDL on Apache Spark. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
Data-driven business management, Strata Business Summit
Location: LL21 C/D Level: Intermediate
David Talby (Pacific AI)
Machine learning and data science systems often fail in production in unexpected ways. David Talby shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
Law, ethics, and governance, Strata Business Summit
Location: LL21 E/F Level: Beginner
John Mertic (The Linux Foundation), Ferd Scheepers (ING)
John Mertic and Ferd Scheepers detail the benefits of a vendor-neutral approach to data governance, explain the need for an open metadata standard, and share insight around how companies ING, IBM, Hortonworks, and more are delivering solutions to this challenge as an open source initiative. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
dong meng (MapR)
Deep learning model performance relies on underlying data. Dong Meng offers an overview of a converged data platform that serves as a data infrastructure, providing a distributed filesystem, key-value storage and streams, and Kubernetes as orchestration layer to manage containers to train and deploy deep learning models using GPU clusters. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
Szehon Ho (Criteo), Pawel Szostek (Criteo)
Hive is the main data transformation tool at Criteo, and hundreds of analysts and thousands of automated jobs run Hive queries every day. Szehon Ho and Pawel Szostek discuss the evolution of Criteo's Hive platform from an error-prone add-on installed on some spare machines to a best-in-class installation capable of self-healing and automatically scaling to handle its growing load. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
Karthik Ramasamy (Streamlio), Sanjeev Kulkarni (Streamlio)
Stream processing systems must support a number of different types of processing semantics due to the diverse nature of streaming applications. Karthik Ramasamy and Sanjeev Kulkarni explore effectively once, exactly once, and other types of stateful processing techniques, explain how they are implemented in Heron, and discuss how applications can benefit. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
Data-driven business management, Strata Business Summit
Location: 230 A Level: Intermediate
With so many business intelligence tools in the Hadoop ecosystem and no common measure to identify the efficiency of each tool, where do you begin to build or modify your enterprise data lake strategy? Sagar Kewalramani shares real-world BI problems and how they were resolved with Hadoop tools and demonstrates how to build an effective data lake strategy with open source tools and components. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
Data engineering and architecture
Location: 230 C Level: Intermediate
Secondary topics:  Graphs and Time-series
Alexis Roos (Salesforce), Noah Burbank (Salesforce)
In the customer age, being able to extract relevant communications information in real time and cross-reference it with context is key. Alexis Roos and Noah Burbank explain how Salesforce uses data science and engineering to enable salespeople to monitor their emails in real time to surface insights and recommendations using a graph modeling contextual data. Read more.

12:30pm

Add to your personal schedule
12:30pm–1:50pm Thursday, 03/08/2018
Location: Hall 1, 2, 3
Topic Table discussions are a great way to informally network with people in similar industries or interested in the same topics. Read more.
Add to your personal schedule
12:30pm–1:50pm Thursday, 03/08/2018
Location: TBD (SBS lunch)
Join Strata Business Summit speakers and attendees for a networking lunch on Thursday. Read more.

1:50pm

Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018
Ram Sriharsha (Databricks)
How do you scale geospatial analytics on big data? And while you're at it, can you make it easy to use while achieving state-of-the-art performance on a single node? Ram Sriharsha offers an overview of Magellan—a geospatial optimization engine that seamlessly integrates with Spark—and explains how it provides scalability and performance without sacrificing simplicity. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018
Veronica Mapes (Pinterest), Garner Chung (Pinterest)
Veronica Mapes and Garner Chung detail the human evaluation platform Pinterest developed to better serve its deep learning and operational teams when its needs grew beyond platforms like Mechanical Turk. Along the way, they cover tricks for increasing data reliability and judgement reproducibility and explain how Pinterest integrated end-user-sourced judgements into its in-house platform. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018
Jennifer Prendki (Atlassian)
Jennifer Prendki explains how to develop machine learning models even if the data is protected by privacy and compliance laws and cannot be used without anonymizing, covering techniques ranging from contextual bandits to document vector representation. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018
Delip Rao (R7 Speech Science)
Spoken conversations have rich information beyond what was said in words. Delip Rao details the potential of spoken conversational datasets, including identifying speakers and their demographic attributes, understanding intent and dynamics between speakers, and so on. Delip also discusses some of the latest science, including some of the work developed at R7. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018
Law, ethics, and governance, Strata Business Summit
Location: LL21 C/D Level: Intermediate
Mark Donsky (Cloudera)
In May 2018, the General Data Protection Regulation (GDPR) goes into effect for firms doing business in the EU, but many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). Mark Donsky outlines the capabilities your data environment needs to simplify compliance with GDPR and future regulations. Read more.
1:50pm–2:30pm Thursday, 03/08/2018
Location: LL21 E/F
TBC
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018
Emre Velipasaoglu (Lightbend)
Most machine learning algorithms are designed to work on stationary data, but real-life streaming data is rarely stationary. Models lose prediction accuracy over time if they are not retrained. Without model quality monitoring, retraining decisions are suboptimal and costly. Emre Velipasaoglu evaluates monitoring methods for applicability in modern fast data and streaming applications. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018
Data engineering and architecture
Location: 210 C/G Level: Intermediate
Chris Harland (Textio)
The number of resources explaining how to build a machine learning model from data greatly overshadows information on how to make real data products from such models, creating a gap between what machine learning engineers and data scientists know is possible and what users experience. Using examples from Textio's augmented writing platform, Chris Harland walks you through building a data product. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018
Data engineering and architecture
Location: 210 D/H Level: Intermediate
Secondary topics:  Graphs and Time-series
Michael Freedman (TimescaleDB | Princeton)
Michael Freedman offers an overview of TimescaleDB, a new scale-out database designed for time series workloads yet open-sourced and engineered up as a plugin to Postgres. Unlike most time series newcomers, TimescaleDB supports full SQL while achieving fast ingest and complex queries. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018
Steven Levine (Weight Watchers ), Nicolas Chikhani (Weight Watchers International)
For organizations stuck in a myriad of legacy infrastructure, the path to AI and deep learning seems impossible. Michael Lysaght, Steven Levine, and Nicolas Chikhani discuss Weight Watchers's transition from a traditional BI organization to one that uses data effectively, covering the company's needs, the changes that were required, and the technologies and architecture used to achieve its goals. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018
Holden Karau (Google), Rachel Warren (Independent)
Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka). Read more.

2:40pm

Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Data science and machine learning
Location: LL20 A Level: Intermediate
Ian Cook (Cloudera)
The popular R package dplyr provides a consistent grammar for data manipulation that can abstract over diverse data sources. Ian Cook shows how you can use dplyr to query large-scale data using different processing engines including Spark and Impala. He demonstrates the R package sparklyr (from RStudio) and the new R package implyr (from Cloudera) and shares tips for making dplyr code portable. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Data science and machine learning, Law, ethics, and governance
Location: LL20 C Level: Intermediate
Pramit Choudhary (DataScience.com)
Pramit Choudhary explores the usefulness of a generative approach that applies Bayesian inference to generate human-interpretable decision sets in the form of "if. . .and else" statements. These human interpretable decision lists with high posterior probabilities might be the right way to balance between model interpretability, performance, and computation. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Data science and machine learning
Location: LL20 D Level: Intermediate
Simon Hughes (Dice.com), Yuri Bykov (Dice.com)
Dice.com recently released several free career advisory tools for technology professionals, including a salary predictor, a tool that recommends the next skills to learn, and a career path explorer. Simon Hughes and Yuri Bykov offer an overview of the machine learning algorithms behind these tools and the technologies used to build, deploy, and monitor these solutions in production. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Data science and machine learning
Location: LL21 B Level: Intermediate
Patrick Harrison (S&P Global)
Word vector embeddings are everywhere, but relatively few understand how they produce their remarkable results. Patrick Harrison opens up the black box of a popular word embedding algorithm and walks you through of how it works its magic. Along the way, Patrick also covers core neural network concepts, including hidden layers, loss gradients, backpropagation, and more. Read more.
2:40pm–3:20pm Thursday, 03/08/2018 TBC
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Data-driven business management, Strata Business Summit
Location: LL21 E/F Level: Beginner
Brian Karfunkel (Pinterest)
When software companies use A/B tests to evaluate product changes and fail to accurately estimate the long-term impact of such experiments, they risk optimizing for the users they have at the expense of the users they want to have. Brian Karfunkel explains how to estimate an experiment’s impact over time, thus mitigating this risk and giving full credit to experiments targeted at noncore users. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Michelle Casbon (Qordoba)
Michelle Casbon explains how to speed up the development of ML models by using open source tools such as Kubernetes, Docker, Scala, Apache Spark, and Weave Flux, detailing how to build resilient systems so that you can spend more of your time on product improvement rather than triage and uptime. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Ajay Mothukuri (Sapient), Arunkumar Ramanatha (Sapient), Dr. Vijay Srinivas Agneeswaran (SapientRazorfish)
Ajay Mothukuri, Arunkumar Ramanatha, and Vijay Srinivas Agneeswaran explain how to use open source blockchain technologies such as Hyperledger to implement the European Union's General Data Protection Regulation (GDPR) regulation. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018 Secondary topics:  Graphs and Time-series
Stephan Ewen (data Artisans), Flavio Junqueira (Dell EMC)
Stephan Ewen and Flavio Junqueira detail an open source streaming data stack consisting of Pravega (stream storage) and Apache Flink (computation on streams) that offers an unprecedented way of handling “everything as a stream” that includes unbounded streaming storage and unified batch and streaming abstraction and dynamically accommodates workload variations in a novel way. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Data-driven business management, Strata Business Summit
Location: 230 A Level: Beginner
Matt Derda (Trifacta), Jonathon Whitton (PRGX USA)
PRGX is a global leader in recovery audit and source-to-pay (S2P) analytics services, serving around 75% of the top 20 global retailers. Matt Derda and Jonathon Whitton explain how PRGX uses Trifacta and Cloudera to scale current processes and increase revenue for the products and services it offers clients. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Tomer Shiran (Dremio), Jacques Nadeau (Dremio)
Most organizations manage 5 to 15 copies of their data in multiple systems and formats to support different analytical use cases. Tomer Shiran and Jacques Nadeau introduce a new approach called data reflections, which dramatically reduces the need for data copies, demonstrate an open source implementation built with Apache Calcite, and explore two production case studies. Read more.

3:20pm

3:20pm–4:20pm Thursday, 03/08/2018
Location: Break
Break (1h)

4:20pm

Add to your personal schedule
4:20pm–5:00pm Thursday, 03/08/2018
Data science and machine learning
Location: LL20 A Level: Intermediate
Keno Fischer (Julia Computing)
Julia is rapidly becoming a popular language at the forefront of scientific discovery. Keno Fischer explores one of the most ambitious use cases for Julia: using machine learning to catalog astronomical objects to derive a catalog from multiterabyte size astronomical image datasets. This work was a collaboration between MIT, UC Berkley, LBNL, and Julia Computing. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/08/2018
Data science and machine learning
Location: LL20 C Level: Intermediate
David Talby (Pacific AI), Ganesh Thondikulam (Kaiser Permanente)
This is a real-world case study of applying the open source NLP library for Apache Spark, and tackling one of the most common challenges with applying natural language process in practice: Integrating domain-specific NLP as part of a scalable, performant, measurable and reproducible machine learning pipeline. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/08/2018
Goodman Gu (Atlassian)
Machine learning is a pivotal technology. However, bringing an ML application to life often requires overcoming bottlenecks not just in the model code but in operationalizing the end-to-end system itself. Goodman Gu shares a case study from a leading SaaS company that quickly and easily built, trained, optimized, and deployed an XGBoost churn prediction ML app at scale with Amazon SageMaker. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/08/2018
Data science and machine learning
Location: LL21 B Level: Intermediate
Siddha Ganju (Deep Vision)
Siddha Ganju explains how the FDL lab at NASA uses artificial intelligence to improve and automate the identification of meteors above human-level performance using meteor shower images and recover known meteor shower streams and characterize previously unknown meteor showers using orbital data—research aimed at providing more warning time for long-period comet impacts. Read more.
4:20pm–5:00pm Thursday, 03/08/2018
Location: LL21 C/D
TBC
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/08/2018
Mike Driscoll (Metamarkets)
There’s a make-or-break step ahead for AI development. AI tools shouldn’t be designed to replace humans; they should be built with them in mind. We need to focus on translating data from machine learning models into beautiful, intuitive visuals. Mike Driscoll shares advice for creators of next-gen predictive algorithms from his experience turning big data into interactive visualizations. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/08/2018 Secondary topics:  Graphs and Time-series
Yu Xu (TigerGraph)
Graph databases are the fastest growing category in data management. However, most graph queries only traverse two hops in big graphs due to limitations in most graph databases. Real-world applications require deep link analytics that traverse far more than three hops. Yu Xu offers an overview of a fraud detection system that manages 100 billion graph elements to detect risk and fraudulent groups. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/08/2018 Secondary topics:  Graphs and Time-series
Roy Ben-Alta (Amazon Web Services)
Many domains, such as mobile, web, the IoT, ecommerce, and more, have turned to analyzing streaming data. However, this presents challenges both in transforming the raw data to metrics and automatically analyzing the metrics in to produce the insights. Roy Ben-Alta shares a solution implemented using Amazon Kinesis as the real-time pipeline feeding Anodot's anomaly detection solution. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/08/2018
Dean Wampler (Lightbend)
Dean Wampler explores two microservice streaming applications based on Kafka to compare and contrast using Akka Streams and Kafka Streams for data processing. Dean discusses the strengths and weaknesses of each tool for particular design needs and contrasts them with Spark Streaming and Flink, so you'll know when to chose them instead. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/08/2018
Data-driven business management, Strata Business Summit
Location: 230 A Level: Beginner
Marcin Pilarczyk (Ryanair)
Managing fuel at a company flying 120 millions passengers yearly is not a trivial task. Marcin Pilarczyk explores the main aspects of fuel management of a modern airline and offers an overview of machine learning methods supporting long-term planning and daily decisions. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/08/2018
Tomer Kaftan (University of Washington)
Tomer Kaftan offers an overview of Cuttlefish, a lightweight framework prototyped in Apache Spark that helps developers adaptively improve the performance of their data processing applications by inserting a few library calls into their code. These calls construct tuning primitives that use reinforcement learning to adaptively modify execution as they observe application performance over time. Read more.