Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Monday, 03/05/2018

9:00am

Add to your personal schedule
9:00am–5:00pm Monday, 03/05/2018
Brooke Wenig (Databricks)
Brooke Wenig walks you through the core APIs for using Spark, fundamental mechanisms and basic internals of the framework, SQL and other high-level data access tools, and Spark’s streaming capabilities and machine learning APIs. Read more.
Add to your personal schedule
9:00am–5:00pm Monday, 03/05/2018
Brian Bloechle (Cloudera)
Brian Bloechle demonstrates how to implement typical data science workflows using Apache Spark. You'll learn how to wrangle and explore data using Spark SQL DataFrames and how to build, evaluate, and tune machine learning models using Spark MLlib. Read more.
Add to your personal schedule
9:00am–5:00pm Monday, 03/05/2018
Strata Business Summit
Location: 212 D
Angie Ma (ASI)
Angie Ma offers a condensed introduction to key data science and machine learning concepts and techniques, showing you what is (and isn't) possible with these exciting new tools and how they can benefit your organization. Read more.
Add to your personal schedule
9:00am–5:00pm Monday, 03/05/2018
Robert Schroll (The Data Incubator), Dana Mastropole (The Data Incubator)
The TensorFlow library enables the use of data flow graphs for numerical computations, with automatic parallelization across several CPUs or GPUs, making it ideal for implementing neural networks and other machine learning algorithms. Robert Schroll and Dana Mastropole demonstrate TensorFlow's capabilities and walk you through building machine learning models on real-world data. Read more.
Add to your personal schedule
9:00am–5:00pm Monday, 03/05/2018
Jesse Anderson (Big Data Institute)
To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data and how will you process that much data? Jesse Anderson explores the latest real-time frameworks (both open source and managed cloud services), discusses the leading cloud providers, and explains how to choose the right one for your company. Read more.
Add to your personal schedule
9:00am–5:00pm Monday, 03/05/2018
Data science and machine learning
Location: San Jose Ballroom (salon 1&2)
Delip Rao (R7 Speech Science), Brian McMahan (Joostware)
PyTorch is a recent deep learning framework from Facebook that is gaining massive momentum in the deep learning community. Its fundamentally flexible design makes building and debugging models straightforward, simple, and fun. Delip Rao and Brian McMahan walk you through PyTorch's capabilities and demonstrate how to use PyTorch to build deep learning models and apply them to real-world problems. Read more.
Add to your personal schedule
9:00am–5:00pm Monday, 03/05/2018
Data science and machine learning
Location: Willow Glen (1&2)
Zachary Glassman (The Data Incubator)
Zachary Glassman demonstrates how to build intelligent business applications using machine learning, taking you through each step in developing a machine learning pipeline, from prototyping to production. You'll explore data cleaning, feature engineering, model building and evaluation, and deployment and extend your knowledge by building two applications from real-world datasets. Read more.

10:30am

10:30am–11:00am Monday, 03/05/2018
Location: Executive Concourse
Morning break (30m)

12:30pm

12:30pm–1:30pm Monday, 03/05/2018
Location: The Hub
Lunch (1h)

3:00pm

3:00pm–3:30pm Monday, 03/05/2018
Location: Executive Concourse
Afternoon break (30m)

7:00pm

Add to your personal schedule
7:00pm–10:00pm Monday, 03/05/2018
Location: Various locations
Get to know your fellow attendees over dinner. We've made reservations for you at some of the most sought-after restaurants in town. This is a great chance to make new connections and sample some of the great cuisine San Jose has to offer. Read more.

Tuesday, 03/06/2018

9:00am

Add to your personal schedule
9:00am–5:00pm Tuesday, 03/06/2018
Location: LL20 A
Madhav Madaboosi (BP), Meenakshisundaram Thandavarayan (Infosys), Matt Conners (Microsoft), Katie Malone (Civis Analytics), Mike Prorock (mesur.io), Thomas Miller (Northwestern University), Ann Nguyen (Whole Whale), Rajiv Synghal (Kaiser Permanente), Valentin Bercovici (Pencil Data Inc.), Wayde Fleener (General Mills), Joe Dumoulin (Next IT), Jules Malin (GoPro), Taylor Martin (O'Reilly Media), Divya Ramachandran (Captricity)
Hear practical insights from household brands and global companies: the challenges they tackled, approaches they took, and the benefits—and drawbacks—of their solutions. Read more.
Add to your personal schedule
9:00am–5:00pm Tuesday, 03/06/2018
Location: LL20 B
David Boyle (MasterClass), Violeta Hennessey (Warner Bros.), April Chen (Civis Analytics), Sridhar Alla (Comcast), Noah Gift (UC Davis), Blake Irvine (Netflix), Kevin Lyons (Nielsen Marketing Cloud), Jennifer Webb (SuprFanz), Rizwan Patel (Caesars Entertainment), Anthony Accardo (Disney), Amanda Gerdes (Blizzard Entertainment)
Hear from innovators in ad tech, measurement, automation, and audience engagement about where the media industry is today—and where it's likely to go next. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/06/2018
Data engineering and architecture, Law, ethics, and governance
Location: LL20 C Level: Intermediate
Mark Donsky (Cloudera), Andre Araujo (Cloudera), Syed Rafice (Cloudera), Mubashir Kazia (Cloudera)
New regulations such as GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads that span on-premises, private cloud, multicloud, and hybrid cloud deployments. Mark Donsky walks you through securing a Hadoop cluster, with special attention to GDPR. Read more.
Add to your personal schedule
9:00am–5:00pm Tuesday, 03/06/2018
Jacob D Parr (Databricks)
Join Jacob Parr for an introduction to Apache Spark 2.0 core concepts with a focus on Spark's machine learning library, using text mining on real-world data as the primary end-to-end use case. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/06/2018 Secondary topics:  Graphs and Time-series
Mo Patel (Independent), Neejole Patel (Virginia Tech)
Since its arrival in early 2017, PyTorch has won over many deep learning researchers and developers due to its dynamic computation framework. Mo Patel and Neejole Patel walk you through using PyTorch to build a content recommendation model. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/06/2018
Jorge A. Lopez (Amazon Web Services), Radhika Ravirala (Amazon Web Services (AWS)), Paul Sears (Amazon Web Services), Ryan Nienhuis (Amazon Web Services), Randy Ridgley (Amazon Web Services)
Want to learn how to use Amazon's big data web services to launch your first big data application in the cloud? Jorge Lopez walks you through building a big data application using a combination of open source technologies and AWS managed services. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/06/2018
Data science and machine learning
Location: LL21 C/D Level: Intermediate
Mario Inchiosa (Microsoft), Vanja Paunic (Microsoft), Robert Horton (Microsoft), Debraj GuhaThakurta (Microsoft), Ali Zaidi (Microsoft), Tomas Singliar (Microsoft), John-Mark Agosta (Microsoft)
R and Python top the list of languages used in data science and machine learning, and data scientists and engineers fluent in one of these languages are increasingly marketable. Come learn how to build and operationalize machine learning models using distributed functions and do scalable, end-to-end data science in R and Python on single machines, Spark clusters, and cloud-based infrastructure. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/06/2018
Data science and machine learning
Location: LL21 E/F Level: Intermediate
Martin Görner (Google)
Martin Görner walks you through training and deploying a machine learning system using popular open source library TensorFlow. Martin takes you from a conceptual overview all the way to building complex classifiers and explains how you can apply deep learning to complex problems in science and industry. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/06/2018
Burcu Baran (LinkedIn), Wei Di (LinkedIn), Michael Li (LinkedIn), Chi-Yi Kuan (LinkedIn)
Burcu Baran, Wei Di, Michael Li, and Chi-Yi Kuan walk you through the big data analytics and data science lifecycle and share their experience and lessons learned leveraging advanced analytics and machine learning techniques such as predictive modeling to drive and grow business at LinkedIn. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/06/2018 Secondary topics:  Graphs and Time-series
Karthik Ramasamy (Streamlio), Sanjeev Kulkarni (Streamlio), Sijie Guo (Streamlio), Arun Kejariwal (MZ)
Across diverse segments in industry, there has been a shift in focus from big data to fast data. Karthik Ramasamy, Sanjeev Kulkarni, Arun Kejariwal, and Sijie Guo walk you through state-of-the-art streaming architectures, streaming frameworks, and streaming algorithms, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/06/2018
Tim Berglund (Confluent)
Tim Berglund leads a basic architectural introduction to Kafka and walks you through using Kafka Streams and KSQL to process streaming data. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/06/2018
Aishwarya Venkataraman, Jason Wang, Mala Ramakrishnan, Stefan Salandy, and Vinithra Varadharajan lead a deep dive into running data analytic workloads in a managed service capacity in the public cloud and highlight cloud infrastructure best practices. Read more.

10:30am

10:30am–11:00am Tuesday, 03/06/2018
Location: Executive Concourse
Morning break (30m)

12:30pm

12:30pm–1:30pm Tuesday, 03/06/2018
Location: 230 A-C
Lunch (1h)

1:30pm

Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/06/2018
Data science and machine learning
Location: LL20 C Level: Intermediate
David Talby (Pacific AI), Claudiu Branzan (G2 Web Services), Alexander Thomas (Indeed)
Natural language processing is a key component in many data science systems. David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial on scalable NLP, using spaCy for building annotation pipelines, Spark NLP for building distributed natural language machine-learned pipelines, and Spark ML and TensorFlow for using deep learning to build and apply word embeddings. Read more.
Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/06/2018
Data engineering and architecture
Location: LL21 A Level: Intermediate
Juan Yu (Cloudera)
Apache Impala (incubating) is an exceptional, best-of-breed massively parallel processing SQL query engine that is a fundamental component of the big data software stack. Juan Yu demystifies the cost model Impala Planner uses and how Impala optimizes queries and explains how to identify performance bottleneck through query plan and profile and how to drive Impala to its full potential. Read more.
Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/06/2018
Data engineering and architecture
Location: LL21 B Level: Intermediate
Ron Bodkin (Google), Brian Foo (Google)
TensorFlow and Keras are popular libraries for machine learning because of their support for deep learning and GPU deployment. Join Ron Bodkin and Brian Foo to learn how to execute these libraries in production with vision and recommendation models and how to export, package, deploy, optimize, serve, monitor, and test models using Docker and TensorFlow Serving in Kubernetes. Read more.
Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/06/2018
Ronny Kohavi (Microsoft), Alex Deng (Microsoft), Somit Gupta (Microsoft), Paul Raff (Microsoft)
Controlled experiments such as A/B tests have revolutionized the way software is being developed, allowing real users to objectively evaluate new ideas. Ronny Kohavi, Alex Deng, Somit Gupta, and Paul Raff lead an introduction to A/B texting and share lessons learned from one of the largest A/B testing platforms on the planet, running at Microsoft, which executes over 10K experiments a year. Read more.
Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/06/2018
Abhishek Kumar (SapientRazorfish), Dr. Vijay Srinivas Agneeswaran (SapientRazorfish)
Abhishek Kumar and Vijay Srinivas Agneeswaran offer an introduction to deep learning-based recommendation and learning-to-rank systems using TensorFlow. You'll learn how to build a recommender system based on intent prediction using deep learning that is based on a real-world implementation for an ecommerce client. Read more.
Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/06/2018
Data-driven business management, Strata Business Summit
Location: 210 A/E Level: Non-technical
Nick Elprin (Domino Data Lab)
The honeymoon era of data science is ending, and accountability is coming. Not content to wait for results that may or may not arrive, successful data science leaders deliver measurable impact on an increasing share of an enterprise's KPIs. Nick Elprin details how leading organizations have taken a holistic approach to people, process, and technology to build a sustainable competitive advantage. Read more.
Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/06/2018
Data engineering and architecture
Location: 210 B/F Level: Intermediate
Secondary topics:  Graphs and Time-series
Ted Malaska (Blizzard Entertainment)
If you have data that has a time factor to it, then you need to think in terms of time series datasets. Ted Malaska explores time series in all of its forms, from tumbling windows to sessionization in batch or in streaming. You'll gain exposure to the tools and background you need to be successful in the world of time-oriented data. Read more.
Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/06/2018
Dean Wampler (Lightbend), Boris Lublinsky (Lightbend)
Join Dean Wampler and Boris Lublinsky to learn how to build two microservice streaming applications based on Kafka using Akka Streams and Kafka Streams for data processing. You'll explore the strengths and weaknesses of each tool for particular design needs and contrast them with Spark Streaming and Flink, so you'll know when to choose them instead. Read more.
Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/06/2018
James Bednar (Anaconda), Philipp Rudiger (Anaconda)
Python lets you solve data science problems by stitching together packages from its ecosystem, but it can be difficult to choose packages that work well together. James Bednar and Philipp Rudiger walk you through a concise, fast, easily customizable, and fully reproducible recipe for interactive visualization of millions or billions of datapoints—all in just 30 lines of Python code. Read more.

3:00pm

3:00pm–3:30pm Tuesday, 03/06/2018
Location: Executive Concourse
Afternoon break (30m)

5:00pm

Add to your personal schedule
5:00pm–6:30pm Tuesday, 03/06/2018
Location: Hall 1, 2, 3
Join us after tutorials on Tuesday in the Expo Hall. Grab a drink and mingle with fellow Strata attendees while you check out all of the exhibitors. Read more.

6:30pm

Add to your personal schedule
6:30pm–8:00pm Tuesday, 03/06/2018
Location: Grand Ballroom 220
Ignite is happening at Strata on Tuesday, March 6. Join us for a fun, high-energy evening of five-minute talks—all aspiring to live up to the Ignite motto: Enlighten us, but make it quick. Read more.

Wednesday, 03/07/2018

8:00am

Add to your personal schedule
8:00am–8:30am Wednesday, 03/07/2018
Location: Concourse foyer
Gather before keynotes on Wednesday morning for a speed networking event. Enjoy casual conversation while meeting fellow attendees. If one of your goals at Strata is to meet new people, this session will jumpstart your networking with fellow attendees. Read more.

8:45am

Add to your personal schedule
8:45am–8:50am Wednesday, 03/07/2018
Location: Grand Ballroom 220
Ben Lorica (O'Reilly Media), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes. Read more.

8:50am

Add to your personal schedule
8:50am–9:05am Wednesday, 03/07/2018
Location: Grand Ballroom 220
Hilary Mason (Cloudera Fast Forward Labs)
The power of machine learning is very real, but so too is the hype and confusion about when, where, and how to apply it. Hilary Mason explores practical business applications for intelligent machines and details the tools and processes required to implement machine learning successfully. Read more.

9:05am

Add to your personal schedule
9:05am–9:20am Wednesday, 03/07/2018
Location: Grand Ballroom 220
Li Fan (Pinterest)
Li Fan shares insights into how Pinterest improves products based on usage and explains how the company is using AI to predict what’s in an image, what a user wants, and what they’ll want next, answering subjective questions better than machines or humans alone could achieve. Read more.

9:20am

Add to your personal schedule
9:20am–9:30am Wednesday, 03/07/2018
Location: Grand Ballroom 220
Tobias Ternstrom (Microsoft)
The emergence of the cloud combined with open source software ushered in an explosive use of a broad range of technologies. Tobias Ternstrom explains why you should step back and attempt to objectively evaluate the problem you are trying to solve before choosing the tool to fix it. Read more.

9:30am

Add to your personal schedule
9:30am–9:40am Wednesday, 03/07/2018
Location: Grand Ballroom 220 Level: Non-technical
Ben Lorica (O'Reilly Media)
Ben Lorica shares emerging security best practices for business intelligence, machine learning, and mobile computing products and explores new tools, methods, and products that can help ease the way for companies interested in deploying secure and privacy-preserving analytics. Read more.

9:40am

Add to your personal schedule
9:40am–9:55am Wednesday, 03/07/2018
Location: Grand Ballroom 220
Nancy Lublin (Crisis Text Line)
Keynote with Nancy Lublin Read more.

9:55am

Add to your personal schedule
9:55am–10:00am Wednesday, 03/07/2018
Location: Grand Ballroom 220
Nikita Shamgunov (MemSQL)
We live in a world that’s always connected. As a result, today’s intelligent applications need to react immediately to changing conditions. To achieve this, applications require a foundation that is latency free. Nikita Shamgunov shares a vision of latency-free life supported by modern data architectures. Read more.

10:00am

Add to your personal schedule
10:00am–10:10am Wednesday, 03/07/2018
Location: Grand Ballroom 220
Natalie Evans Harris (BrightHive)
Keynote with Natalie Evans Harris Read more.

10:10am

Add to your personal schedule
10:10am–10:15am Wednesday, 03/07/2018
Location: Grand Ballroom 220
Machine learning research and incubation projects are everywhere, but less common, and far more valuable, is the innovation unlocked once you bring machine learning out of research and into production. Dinesh Nirmal explains how real-world machine learning reveals assumptions embedded in business processes and in the models themselves that cause expensive and time-consuming misunderstandings. Read more.

10:15am

Add to your personal schedule
10:15am–10:30am Wednesday, 03/07/2018
Location: Grand Ballroom 220
Alex Smola (Amazon)
Keynote with Alex Smola Read more.

10:30am

10:30am–11:00am Wednesday, 03/07/2018
Location: Hall 1, 2, 3
Morning break sponsored by MemSQL (30m)

11:00am

Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Data engineering and architecture
Location: 212 A-B Level: Intermediate
Secondary topics:  Data Integration and Data Pipelines
Gwen Shapira (Confluent)
Gwen Shapira shares design and architecture patterns that are used to modernize data engineering and details how Apache Kafka, microservices, and event streams are used by modern engineering organizations to efficiently build data pipelines that are scalable, reliable, and built to evolve. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Data science and machine learning
Location: LL20 A Level: Intermediate
Mike Lee Williams (Cloudera Fast Forward Labs)
Interpretable models result in more accurate, safer, and more profitable machine learning products. But interpretability can be hard to ensure. Michael Lee Williams explores the growing business case for interpretability and its concrete applications, including churn, finance, and healthcare. Along the way, Michael offers an overview of the open source, model-agnostic tool LIME. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Sponsored
Location: LL20 B
Ian Swanson (DataScience.com)
Ian Swanson shares strategies for leading more productive data science teams, along with steps you can take today to meet growing demands for AI and machine learning use cases. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Data science and machine learning
Location: LL20 C Level: Non-technical
Daniel Lurie (Pinterest)
All successful startups thrive on tight product-market fit, which can produce homogenous initial user bases. To become the next big thing, your user base will need to diversify, and your product must change to accommodate new needs. Daniel Lurie explains how Pinterest leverages external data to measure racial and income diversity in its user base and changed user modeling to drive growth. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Wee Hyong Tok (Microsoft), Danielle Dean (Microsoft)
Artificial intelligence (AI) has tremendous potential to extend our capabilities and empower organizations to accelerate their digital transformation. Wee Hyong Tok and Danielle Dean demystify AI for big data professionals and explain how they can leverage and evolve their valuable big data skills by getting started with AI. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Sponsored
Location: LL21 A
Santosh Rao (NetApp)
Santosh Rao explores the architecture of a data pipeline from edge to core to cloud and across various data sources and processing engines and explains how to build a solution architecture that enables businesses to maximize the competitive differentiation with the ability to unify data insights in compelling yet efficient ways. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Rajat Monga (Google)
Rajat Monga offers an overview of TensorFlow's progress and adoption in 2017 before looking ahead to the areas of importance in the future—performance, usability, and ubiquity—and the efforts TensorFlow is making in those areas. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Tom Fisher (MapR Technologies)
The monolithic cloud is dying. Delivering capabilities across multiple clouds and, simultaneously, transitioning to next-generation platforms and applications is the challenge today. Tom Fisher explores technological approaches and solutions that make this possible while delivering data-driven applications and operations. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Data engineering and architecture
Location: LL21 E/F Level: Intermediate
Kinnary Jangla (Pinterest)
Having trouble coordinating development of your production ML system between a team of developers? Microservices drifting and causing problems during debugging? Kinnary Jangla explains how Pinterest dockerized the services powering its home feed and how it impacted the engineering productivity of the company's ML teams while increasing uptime and ease of deployment. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Strata Business Summit
Location: 210 A/E Level: Intermediate
Mark Madsen (Third Nature), Shant Hovsepian (Arcadia Data)
If your goal is to provide data to an analyst rather than a data scientist, what’s the best way to deliver analytics? There are 70+ BI tools in the market and a dozen or more SQL- or OLAP-on-Hadoop open source projects. This session details trade-offs between a number of architectures that provide self-service access to data, and discusses the pros and cons of architectures, deployment... Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Law, ethics, and governance, Strata Business Summit
Location: 210 C/G Level: Intermediate
Anne Buff (SAS Institute)
Emerging technologies such as the IoT, AI, and ML present businesses with enormous opportunities for innovation, but to maximize the potential of these technologies, businesses must radically shift their approach to governance. Anne Buff explains what it takes to shift the focus of governance from standards, conformity, and control to accountability, extensibility, and enablement. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Strata Business Summit
Location: 210 D/H Level: Intermediate
Mauro Damo (Dell EMC), Wei Lin (Dell EMC)
Image recognition classification of diseases will minimize the possibility of medical mistakes, improve patient treatment, and speed up patient diagnosis. Mauro Damo and Wei Lin offer an overview of an approach to identify bladder cancer in patients using nonsupervised and supervised machine learning techniques on more than 5,000 magnetic resonance images from the Cancer Imaging Archive. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018 Secondary topics:  Graphs and Time-series
Shivnath Babu (Duke University | Unravel Data Systems), Sumit Jindal (Unravel Data Systems)
Getting the best performance, predictability, and reliability for Kafka-based applications is a complex art. Shivnath Babu and Sumit Jindal explain how to simplify the process by leveraging recent advances in machine learning and AI and outline a methodology for applying statistical learning to the rich and diverse monitoring data that is available from Kafka. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Sponsored
Location: 230 B
Tobias Ternstrom (Microsoft)
Tobias Ternstrom leads a deep dive into case studies from three Microsoft customers who put technology before solutions. Tobias examines the decisions that brought them there and outlines how they got back on track and solved their business problems. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Data engineering and architecture
Location: 230 C Level: Intermediate
Daniel Templeton (Cloudera), Andrew Wang (Cloudera)
Hadoop 3.0 has been years in the making, and now it's finally arriving. Andrew Wang and Daniel Templeton offer an overview of new features, including HDFS erasure coding, YARN Timeline Service v2, YARN federation, and much more, and discuss current release management status and community testing efforts dedicated to making Hadoop 3.0 the best Hadoop major release yet. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Data science and machine learning
Location: Expo Hall 1 Level: Intermediate
Secondary topics:  Expo Hall
Siddha Ganju (Deep Vision)
Siddha Ganju explains how the FDL lab at NASA uses artificial intelligence to improve and automate the identification of meteors above human-level performance using meteor shower images and recover known meteor shower streams and characterize previously unknown meteor showers using orbital data—research aimed at providing more warning time for long-period comet impacts. Read more.

11:50am

Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018 Secondary topics:  Data Integration and Data Pipelines
Eugene Kirpichov (Google)
Apache Beam equips users with a novel programming model in which the classic batch/streaming data processing dichotomy is erased. Eugene Kirpichov details the modularity and composability advantages created by treating data ingestion as just another data processing task and walks you through building highly modular data ingestion APIs using the new Beam programming model primitive Splittable DoFn. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
Dan Crankshaw (UC Berkeley RISELab)
Clipper is an open source, general-purpose model-serving system that provides low-latency predictions under heavy serving workloads for interactive applications. Dan Crankshaw offers an overview of the Clipper serving system and explains how to use it to serve Apache Spark and TensorFlow models on Kubernetes. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
Data science and machine learning
Location: LL20 C Level: Non-technical
Miryung Kim (UCLA), Muhammad Gulzar (UCLA)
Even though we know that there are more data scientists in the workforce today, neither what those data scientists actually do nor what we even mean by data scientists has been studied quantitatively. Miryung Kim and Muhammad Gulzar share the results of a large-scale survey with 793 professional data scientists and detail several trends about data scientists in the software engineering context. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
When building a chatbot, it’s important to develop one that is humanized, has contextual responses, and can simulate true empathy for the end users. Andrew Mattarella-Micke shares how Intuit's data science team preps, cleans, organizes, and augments training data along with best practices he's learned along the way. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
Sponsored
Location: LL21 A
Companies that want to become truly digital must take a journey of three steps: data transformation, data science transformation, and digital transformation. This also requires transforming the business with machine learning to fundamentally change the relationship with customers. Seth Dobrin explains the detailed steps along the way to digital transformation—and the pitfalls. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
Shivaram Venkataraman (Microsoft Research), Sergey Ermolin (Intel)
The BigDL framework scales deep learning for large datasets using Apache Spark. However there is significant scheduling overhead from Spark when running BigDL at large scale. Shivaram Venkataraman and Sergey Ermolin outline a new parameter manager implementation that along with coarse-grained scheduling can provide significant speedups for deep learning models like Inception and VGG. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
Bin Fan (Alluxio), Shaoshan Liu (PerceptIn)
Bin Fan and Shaoshan Liu explain how PerceptIn designed and implemented a cloud architecture to support video streaming and online object recognition tasks and demonstrate how Alluxio delivers high throughput, low latency, and a unified namespace to support these emerging cloud architectures. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
Manu Mukerji (Criteo)
Criteo is a global leader in commerce marketing. Manu Mukerji walks you through Criteo's machine learning example for universal catalogs, explaining how the training and test sets are generated and annotated, how the model is pushed to production, evaluated (automatically), and used, production issues that arise when applying ML at scale in production, lessons learned, and more. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
Strata Business Summit
Location: 210 A/E Level: Non-technical
Yishay Carmiel (IntelligentWire | Spoken Labs)
One of the most important tasks of AI has been to understand humans. People want machines to understand not only what they say but also what they mean and to take particular actions based on that information. This goal is the essence of conversational AI. Yishay Carmiel explores the latest breakthroughs and revolutions in this field and the challenges still to come. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
Data-driven business management, Strata Business Summit
Location: 210 C/G Level: Beginner
Matthew Granade (Domino Data Lab)
Predictive analytics and artificial intelligence have become critical competitive capabilities. Yet IT teams struggle to provide the support data science teams need to succeed. Matthew Granade explains how leading banks, insurance and pharmaceutical companies, and others manage data science at scale. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
Ari Gesher (Kairos Aerospace)
A warming planet needs precise, localized predictions about the effects of climate change to make good long-term and medium-term economic decision making. Ari Gesher demonstrates how to use a mix of physical simulation, enhanced scientific models, machine learning verification, and high-scale computing to predict and package climate predictions as data products. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018 Secondary topics:  Graphs and Time-series
William Chambers (Databricks), Michael Armbrust (Databricks)
William Chambers and Michael Armbrust discuss the motivation and basics of Apache Spark's Structured Streaming processing engine and share lessons they've learned running hundreds of Structured Streaming workloads in the cloud. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
Data engineering and architecture
Location: 230 C Level: Intermediate
Edwina Lu (LinkedIn), Ye Zhou (LinkedIn), Min Shen (LinkedIn)
Spark applications need to be well tuned so that individual applications run quickly and reliably and cluster resources are efficiently utilized. Edwina Lu, Ye Zhou, and Min Shen outline a fast, reliable, and automated process used at LinkedIn for tuning Spark applications, enabling users to quickly identify and fix problems. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
Data science and machine learning
Location: Expo Hall 1 Level: Intermediate
Secondary topics:  Expo Hall
David Talby (Pacific AI), Santosh Kulkarni (Kaiser Permanente)
David Talby and Santosh Kulkarni explain how Kaiser Permanente uses the open source NLP library for Apache Spark to tackle one of the most common challenges with applying natural language process in practice: integrating domain-specific NLP as part of a scalable, performant, measurable, and reproducible machine learning pipeline. Read more.

12:30pm

Add to your personal schedule
12:30pm–1:50pm Wednesday, 03/07/2018
Location: Hall 1, 2, 3
Topic Table discussions are a great way to informally network with people in similar industries or interested in the same topics. Read more.
Add to your personal schedule
12:30pm–1:50pm Wednesday, 03/07/2018
Location: Almaden Ballroom, San Jose Hilton
Want to network with other women attending Strata Data Conference? Then be sure to come to the Women in Big Data Luncheon on Wednesday. Read more.
Add to your personal schedule
12:30pm–1:50pm Wednesday, 03/07/2018
Location: San Jose Ballroom, Marriott
Join fellow executives, business leaders, and strategists for a networking lunch on Wednesday for Strata Business Summit attendees and speakers. Read more.

1:50pm

Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018 Secondary topics:  Data Integration and Data Pipelines
Jordan Hambleton (Cloudera), Guru Medasani (Cloudera)
When Kafka stream processing pipelines fail, they can leave users panicked about data loss when restarting their application. Jordan Hambleton and Guru Medasani explain how offset management provides users the ability to restore the state of the stream throughout its lifecycle, deal with unexpected failure, and improve accuracy of results. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018 Secondary topics:  Graphs and Time-series
Alexandra Gunderson (Arundo Analytics)
Heavy industries, such as oil and gas, have tremendous amounts of data from which predictive models could be built, but it takes weeks or even months to create a comprehensive dataset from all of the various data sources. Alexandra Gunderson details the methodology behind an industry-tested approach that incorporates machine learning to structure and link data from different sources. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Sponsored
Location: LL20 B
Ted Dunning (MapR Technologies)
Getting value from data at large scale and on a variety of time scales is hard. True, it's not as hard as it used to be, but you still don’t win by default. Ted Dunning explains why it takes good design, the right technology, and a pragmatic approach. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Data science and machine learning
Location: LL20 C Level: Advanced
Secondary topics:  Graphs and Time-series
Andrew Ray (Sam’s Club Technology)
Andrew Ray offers a brief introduction to the distributed graph algorithm abstractions provided by Pregel, PowerGraph, and GraphX, drawing on real-world examples, and provides historical context for the evolution between these three abstractions. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Data science and machine learning
Location: LL20 D Level: Intermediate
Weisheng Xie (Intel), Peng Meng (Intel)
Intel has been deeply involved in Spark from its earliest moments. Vincent Xie and Peng Meng share what Intel has been working on with Spark ML and introduce the methodology behind Intel's work on SparkML optimization. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Data science and machine learning
Location: LL21 B Level: Intermediate
Secondary topics:  Graphs and Time-series
Chanchal Chatterjee (Google Cloud Platform)
Chanchal Chatterjee reveals how Wells Fargo was able to productionize credit risk analytics by leveraging LSTM-TensorSpark. Through a unique algorithm and process for model interpretations, Wells Fargo is now achieving an unprecedented 90%+ accuracy rate with its credit risk analysis. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Data engineering and architecture
Location: LL21 C/D Level: Intermediate
Kurt Brown (Netflix)
Kurt Brown explains how to get the most out of your data infrastructure with 20 principles and practices used at Netflix. Kurt covers each in detail and explores how they relate to the technologies used at Netflix, including S3, Spark, Presto, Druid, R, Python, and Jupyter. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Zhen Fan (JD.com), Wei Ting Chen (Intel)
Zhen Fan and Wei Ting Chen explain how JD.com uses Spark on Kubernetes in a production environment and why the company chose Spark on Kubernetes for its AI workloads. You'll learn how to run Spark with Kubernetes and gain an understanding of the advantages this provides. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Data-driven business management, Strata Business Summit
Location: 210 A/E Level: Non-technical
Frances Haugen (Pinterest), Patrick Phelps (Pinterest)
Data science is most powerful when combined with deep domain knowledge, but those with domain knowledge don't work on data-focused teams. So how do you empower employees with diverse backgrounds and skill sets to be effective users of data? Frances Haugen and Patrick Phelps dive into the social side of data and share strategies for unlocking otherwise unobtainable insights. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Law, ethics, and governance, Strata Business Summit
Location: 210 C/G Level: Beginner
John Mertic (The Linux Foundation), Maryna Strelchuk (ING)
John Mertic and Maryna Strelchuk detail the benefits of a vendor-neutral approach to data governance, explain the need for an open metadata standard, and share how companies like ING, IBM, Hortonworks, and more are delivering solutions to this challenge as an open source initiative. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Strata Business Summit
Location: 210 D/H Level: Beginner
Ayin Vala (Foundation for Precision Medicine)
Complex diseases like Alzheimer’s cannot be cured by pharmaceutical or genetic sciences alone, and current treatments and therapies lead to mixed successes. Ayin Vala explains how to use the power of big data and AI to treat challenging diseases with personalized medicine, which takes into account individual variability in medicine intake, lifestyle, and genetic factors for each patient. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Debasish Ghosh (Lightbend )
Debasish Ghosh explores the role that approximation data structures play in processing streaming data. Typically, streams are unbounded in space and time, and processing has to be done online using sublinear space. Debasish covers the probabilistic bounds that these data structures offer and shows how they can be used to implement solutions for fast and streaming architectures. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Sponsored
Location: 230 B
Advanced analytics and AI workloads require a scalable and optimized architecture, from hardware and storage to software and applications. Kevin Huiskes and Radhika Rangarajan share best practices for accelerating analytics and AI and explain how businesses globally are leveraging Intel’s technology portfolio, along with optimized frameworks and libraries, to build AI workloads at scale. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Siddharth Teotia (Dremio)
Query processing technology has rapidly evolved since the iconic C-Store paper was published in 2005, with a focus on designing query processing algorithms and data structures that efficiently utilize CPU and leverage the changing trends in hardware to deliver optimal performance. Siddharth Teotia outlines the different types of vectorized query processing in Dremio using Apache Arrow. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Data science and machine learning
Location: Expo Hall 1 Level: Non-technical
Secondary topics:  Expo Hall
Harish Doddi (Datatron Technologies), Jerry Xu (Datatron Technologies)
Deploying machine learning models and deep learning models in production is hard. Harish Doddi and Jerry Xu outline the enterprise data science lifecycle, covering how production model deployment flow works, challenges, best practices, and lessons learned. Along the way, they explain why monitoring models in the production should be mandatory. Read more.

2:40pm

Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018 Secondary topics:  Data Integration and Data Pipelines
Sean Ma (Trifacta)
Organizations leverage reporting, analytic, and machine learning pipelines to drive decision making and power critical operational systems. Sean Ma discusses methods for detecting, visualizing, and resolving inconsistencies between source and target data models across these pipelines. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018 Secondary topics:  Graphs and Time-series
Baron Schwartz (VividCortex)
Anomaly detection is white hot in the monitoring industry, but many don't really understand or care about it, while others repeat the same pattern many times. Why? And what can we do about it? Baron Schwartz explains how he arrived at a "post-anomaly detection" point of view. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018
Evan Sparks (Determined AI)
Deep learning has shown tremendous improvements in a number of areas and has justifiably generated enormous excitement. However, several key challenges—from prohibitive hardware requirements to immature software offerings—are impeding widespread enterprise adoption. Evan Sparks details fundamental challenges facing organizations looking to adopt deep learning and shares possible solutions. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018
Joseph Bradley (Databricks)
Joseph Bradley discusses common paths to productionizing Apache Spark MLlib models and shares engineering challenges and corresponding best practices. Along the way, Joseph covers several deployment scenarios, including batch scoring, Structured Streaming, and real-time low-latency serving. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018
Sponsored
Location: LL21 A
Chuck Yarbrough (Hitachi Vantara)
Intelligently managing the data pipeline is the key to driving business acceleration and reducing costs. Chuck Yarbrough outlines ways to gain control over the data pipeline. Along the way, you’ll learn how cloud, big data, and machine learning models intersect and how streaming and cloud integration can help create the connected enterprise. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018
Mohamed AbdelHady (Microsoft), Zoran Dzunic (Microsoft)
Mohamed AbdelHady and Zoran Dzunic demonstrate how to build a domain-specific entity extraction system from unstructured text using deep learning. In the model, domain-specific word embedding vectors are trained on a Spark cluster using millions of PubMed abstracts and then used as features to train a LSTM recurrent neural network for entity extraction. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018
Data engineering and architecture
Location: LL21 C/D Level: Intermediate
Mark Grover (Lyft), Arup Malakar (Lyft)
Mark Grover and Arup Malakar offer an overview of how Lyft leverages application metrics, logs, and auditing to monitor and troubleshoot its data platform and share how the company dogfoods the platform to provide security, auditing, alerting, and replayability. They also detail some of the internal services and tools Lyft has developed to make its data more robust, scalable, and self-serving. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018
Data engineering and architecture
Location: LL21 E/F Level: Non-technical
Ellen Friedman (MapR Technologies)
DataOps—a culture and practice for building data-intensive applications, including machine learning pipelines—expands DevOps philosophy to include data-heavy roles such as data engineering and data science. DataOps is based on cross-functional collaboration resulting in fast time to value and an agile workflow. Ellen Friedman offers an overview of DataOps and explains how to implement it. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018
Strata Business Summit
Location: 210 A/E
Michael Chui (McKinsey Global Institute)
After decades of extravagant promises, artificial intelligence is finally starting to deliver real-life benefits to early adopters. However, we're still early in the cycle of adoption. Michael Chui explains where investment is going, patterns of AI adoption and value capture by enterprises, and how the value potential of AI across sectors and business functions is beginning to emerge. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018
Data-driven business management, Strata Business Summit
Location: 210 C/G Level: Non-technical
Katie Malone (Civis Analytics), Skipper Seabold (Civis Analytics)
A huge challenge for data science managers is determining priorities for their teams, which often have more good ideas than they have time. Katie Malone and Skipper Seabold share a framework that their large and diverse data science team uses to identify, discuss, select, and manage data science projects for a fast-moving startup. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018
Law, ethics, and governance, Strata Business Summit
Location: 210 D/H Level: Beginner
Or Herman-Saffar (Dell), Ran Taig (Dell EMC)
What if we could predict when and where crimes will be committed? Or Herman-Saffar and Ran Taig offer an overview of Crimes in Chicago, a publicly published dataset of reported incidents of crime that have occurred in Chicago since 2001. Or and Ran explain how to use this data to explore committed crimes to find interesting trends and make predictions for the future. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018
Henry Cai (Pinterest), Yi Yin (Pinterest)
With the rise of large-scale real-time computation, there is a growing need to link legacy MySQL systems with real-time platforms. Henry Cai and Yi Yin offer an overview of WaterMill, Pinterest's continuous DB ingestion system for streaming SQL data into near-real-time computation pipelines to support dynamic personalized recommendations and search indices. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018
Ritesh Agrawal (Uber), Anirban Deb (Uber)
Presto has emerged as the de facto query engine to quickly process petabytes of data. However, rogue SQL queries can waste a significant amount of critical compute resource and reduce Presto's throughput. Ritesh Agrawal and Anirban Deb explain how Uber uses machine learning to identify and stop rogue queries, saving both computational power and money. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018 Secondary topics:  Expo Hall, Graphs and Time-series
Yu Xu (TigerGraph)
Graph databases are the fastest growing category in data management. However, most graph queries only traverse two hops in big graphs due to limitations in most graph databases. Real-world applications require deep link analytics that traverse far more than three hops. Yu Xu offers an overview of a fraud detection system that manages 100 billion graph elements to detect risk and fraudulent groups. Read more.

3:20pm

3:20pm–4:20pm Wednesday, 03/07/2018
Location: Hall 1, 2, 3
Afternoon break sponsored by IBM (1h)

4:20pm

Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018 Secondary topics:  Data Integration and Data Pipelines
Dorna Bandari (Jetlore)
Dorna Bandari offers an overview of the machine learning pipeline at B2B AI startup Jetlore and explains why even small B2B startups in AI should invest in a flexible machine learning pipeline. Dorna covers the design choices, the trade-offs made when implementing and maintaining the pipeline, and how it has accelerated Jetlore's product development and growth. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018
Data science and machine learning
Location: LL20 A Level: Intermediate
Secondary topics:  Graphs and Time-series
Joseph Richards (GE Digital)
Deploying ML software applications for use cases in the industrial internet presents a unique set of challenges. Data-driven problems require approaches that are highly accurate, robust, fast, scalable, and fault tolerant. Joseph Richards shares GE's approach to building production-grade ML applications and explores work across GE in industries such as power, aviation, and oil and gas. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018
Sponsored
Location: LL20 B
Procter & Gamble relies heavily on data, particularly for BI. Running compute where the data lives is critical for performance, and the company has found added benefits to this architecture, which complements its Hadoop and BI needs. Terry McFadden offers an overview of P&G's modern analytics architecture and explains how it differs from traditional approaches. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018 Secondary topics:  Graphs and Time-series
Vlad A Ionescu (ShiftLeft), Fabian Yamaguchi (ShiftLeft)
Vlad Ionescu and Fabian Yamaguchi outline Code Property Graph (CPG), a unique approach that allows the functional elements of code to be represented in an interconnected graph of data and control flows, which enables semantic information about code to be stored scalably on distributed graph databases over the web while allowing them to be rapidly accessed. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018
Data science and machine learning
Location: LL20 D Level: Intermediate
Rachita Chandra (IBM Watson Health)
Rachita Chandra outlines challenges and considerations for transforming a research prototype built for a single machine to a deployable healthcare solution that leverages Spark in a distributed environment. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018
Data science and machine learning
Location: LL21 B Level: Intermediate
Secondary topics:  Graphs and Time-series
Andrea Pasqua (Uber), Anny Chen (Uber)
Time series forecasting and anomaly detection is of utmost importance at Uber. However, the scale of the problem, the need for speed, and the importance of accuracy make anomaly detection a challenging data science problem. Andrea Pasqua and Anny Chen explain how the use of recurrent neural networks is allowing Uber to meet this challenge. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018
Carlo Torniai (Pirelli Tyre)
Carlo Torniai shares the architectural challenges Pirelli faced in building Pirelli Connesso, an IoT cloud-based system providing information on tire operating conditions, consumption, and maintenance, and highlights the operative approaches that enabled the integration of contributions across cross-functional teams. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018
Rahim Daya (Pinterest)
Personalization is a powerful tool for building sticky and impactful product experiences. Rahim Daya shares Pinterest's frameworks for building personalized user experiences, from sourcing the right contextual data to designing and evaluating personalization algorithms that can delight the user. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018
Data-driven business management, Strata Business Summit
Location: 210 A/E Level: Intermediate
Ted Malaska (Blizzard Entertainment), Jonathan Seidman (Cloudera)
Recent years have seen dramatic advancements in the technologies available for managing and processing data. While these technologies provide powerful tools to build data applications, they also require new skills. Ted Malaska and Jonathan Seidman explain how to evaluate these new technologies and build teams to effectively leverage these technologies and achieve ROI with your data initiatives. Read more.
4:20pm–5:00pm Wednesday, 03/07/2018
Location: 210 C/G
TBC
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018
Data-driven business management, Strata Business Summit
Location: 210 D/H Level: Intermediate
Kapil Surlaker (LinkedIn), Ya Xu (LinkedIn)
Metrics measurement and experimentation play crucial roles in every product decision at LinkedIn. Kapil Surlaker and Ya Xu explain why, to meet the company's needs, LinkedIn built the UMP and XLNT platforms for metrics computation and experimentation, respectively, which have allowed the company to perform measurement and experimentation very efficiently at scale while preserving trust in data. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018 Secondary topics:  Graphs and Time-series
Sijie Guo (Streamlio)
Apache BookKeeper, a scalable, fault-tolerant, and low-latency storage service optimized for real-time workloads, has been widely adopted by enterprises like Twitter, Yahoo, and Salesforce to store and serve mission-critical data. Sijie Guo explains how Apache BookKeeper satisfies the needs of stream storage. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018
Sponsored
Location: 230 B
Alexander Ryabov (Wargaming)
Alexander Ryabov explains how Wargaming is winning the battle for bigger profits in the virtual world of online gaming using a best-in-class business intelligence solution to equip its business units with decision-making tools. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018
Data engineering and architecture
Location: 230 C Level: Intermediate
Gian Merlino (Imply)
Gian Merlino discusses the SQL layer recently added to the open source Druid project. It's based on Apache Calcite, which bills itself as "the foundation for your next high-performance database." Gian explains how Druid and Calcite are integrated and why you should stop worrying and learn to love relational algebra in your own projects. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018
Data engineering and architecture
Location: Expo Hall 1
Secondary topics:  Expo Hall
One of the key application domains leveraging live data is smart cities, but success depends on the availability of generic platforms that support high throughput and ultralow latency. Arun Kejariwal and Francois Orsini offer an overview of Satori's live data platform and walk you through a country-scale case study of its implementation. Read more.

5:10pm

Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018 Secondary topics:  Data Integration and Data Pipelines
Abe Gong (Superconductive Health), James Campbell (USG)
Data science and engineering have been missing out on one of the biggest productivity boosters in modern software development: automated testing. Abe Gong and James Campbell discuss the concept of pipeline tests and offer an overview of Great Expectations, an open source Python framework for bringing data pipelines and products under test. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018 Secondary topics:  Graphs and Time-series
Roger Barga (Amazon Web Services), Nina Mishra (Amazon Web Services), Sudipto Guha (Amazon Web Services), Ryan Nienhuis (Amazon Web Services)
Roger Barga, Nina Mishra, Sudipto Guha, and Ryan Nienhuis detail continuous machine learning algorithms that discover useful information in streaming data. They focus on explainable machine learning, including anomaly detection with attribution, the ability to reduce false positives through user feedback, and the detection of anomalies in directed graphs. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018
Data science and machine learning
Location: LL20 C Level: Intermediate
Mike Conover (SkipFlag)
Mike Conover offers an overview of the essential techniques for understanding and working with natural language. From off-the-shelf neural networks and snappy preprocessing libraries to architectural patterns for bulletproof productionization, this talk will be of interest to anyone who uses language on a regular basis. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018
Balasubramanian Narasimhan (Stanford University), John-Mark Agosta (Microsoft), Philip Lavori (Stanford University)
Clinical collaboration benefits from pooling data to train models from large datasets, but it's hampered by concerns about sharing data. Balasubramanian Narasimhan, John-Mark Agosta, and Philip Lavori outline a privacy-preserving alternative that creates statistical models equivalent to one from the entire dataset. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018
Sergey Ermolin (Intel), Suqiang Song (Mastercard)
Sergey Ermolin and Suqiang Song demonstrate how to use Spark BigDL wide and deep and neural collaborative filtering (NCF) algorithms to predict a user’s probability of shopping at a particular offer merchant during a campaign period. Along the way, they compare the deep learning results with those obtained by MLlib’s alternating least squares (ALS) approach. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018
Thomas Phelan (BlueData)
Recent headline-grabbing data breaches demonstrate that protecting data is essential for every enterprise. The best-of-breed approach for big data is HDFS configured with Transparent Data Encryption (TDE). However, TDE can be difficult to configure and manage—issues that are only compounded when running on Docker containers. Thomas Phelan explores these challenges and how to overcome them. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018
Data engineering and architecture
Location: LL21 E/F Level: Beginner
Ted Dunning (MapR Technologies)
Ted Dunning offers an overview of the rendezvous architecture, developed to be the "continuous integration" system for machine learning models. It allows always-hot zero latency rollout and rollback of new models and supports extensive metrics and diagnostics so models can be compared as they process production data. It can even hot-swap the framework itself with no downtime. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018
Strata Business Summit
Location: 210 A/E
Alysa Z. Hutnik (Kelley Drye & Warren LLP), Crystal Skelton (Kelley Drye & Warren LLP)
Big data promises enormous benefits for companies. But what about privacy, data protection, and consumer laws? Having a solid understanding of the legal and self-regulatory rules of the road are key to maximizing the value of your data while avoiding data disasters. Alysa Hutnik and Crystal Skelton share legal best practices and practical tips to avoid becoming a big data “don’t.” Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018
Data-driven business management, Strata Business Summit
Location: 210 C/G Level: Non-technical
Angela Zutavern (Booz Allen Hamilton)
How can you most effectively use machine intelligence to drive strategy? By merging it in the right way with the human ingenuity of leaders throughout your organization. Angela Zutavern shares insights from her work with pioneering companies, government agencies, and nonprofits that are successfully navigating this partnership by becoming “mathematical corporations.” Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018
Location: 210 D/H
Derek Ruths (CAI)
Unreasonable sales forecasts, badly overstocked inventory, misguided investments . . . bad analyses happen all the time, leading to bad decisions and costing businesses millions of dollars. Derek Ruths shares the five most common issues that lead to bad data-informed thinking. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018
Fabian Hueske (data Artisans), Shuyi Chen (Uber)
Fabian Hueske and Shuyi Chen explore SQL's role in the world of streaming data and its implementation in Apache Flink and cover fundamental concepts, such as streaming semantics, event time, and incremental results. They also share their experience using Flink SQL in production at Uber, explaining how Uber leverages Flink SQL to solve its unique business challenges. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018
Data engineering and architecture
Location: 230 C Level: Advanced
Ash Munshi (Pepperdata)
Ash Munshi shares techniques for labeling big data apps using runtime measurements of CPU, memory, I/O, and network and details a deep neural network to help operators understand the types of apps running on the cluster and better predict runtimes, tune resource utilization, and increase efficiency. These methods are new and are the first approach to classify multivariate time series. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018
Data science and machine learning
Location: Expo Hall 1
Secondary topics:  Expo Hall
Rodney Mullen (Almost Skateboards)
The essence of modern skating is learning tricks that couple with specific terrain. Activision’s video game franchise testifies to the nearly endless possibilities. Rodney Mullen offers a nuanced look at how skaters nudge the endpoints of disparate submovements to create new combinations that may shine a different light on ideas in machine learning—plus it’s a lot of fun. Read more.

5:50pm

Add to your personal schedule
5:50pm–6:50pm Wednesday, 03/07/2018
Location: Hall 1, 2, 3
Join us for vendor-hosted libations (plus snacks) after sessions on Wednesday. Read more.

7:00pm

Add to your personal schedule
7:00pm–9:30pm Wednesday, 03/07/2018
Location: San Pedro Market
Join us at San Pedro Square Market for an exciting evening filled with cocktails, food, and entertainment at Data After Dark at Strata in San Jose. Read more.

Thursday, 03/08/2018

8:00am

Add to your personal schedule
8:00am–8:30am Thursday, 03/08/2018
Location: Concourse foyer
Gather before keynotes on Thursday morning for a speed networking event. Enjoy casual conversation while meeting fellow attendees. If one of your goals at Strata is to meet new people, this session will jumpstart your networking with fellow attendees. Read more.

8:45am

Add to your personal schedule
8:45am–8:50am Thursday, 03/08/2018
Location: Grand Ballroom 220
Ben Lorica (O'Reilly Media), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes. Read more.

8:50am

Add to your personal schedule
8:50am–9:00am Thursday, 03/08/2018
Location: Grand Ballroom 220
Janelle Shane (aiweirdness.com)
At AIweirdness.com Janelle Shane posts the results of neural network experiments gone delightfully wrong. But machine learning mistakes can also be very embarrassing or even dangerous. Using silly datasets as examples, Janelle talks about some ways that algorithms fail. Read more.

9:10am

Add to your personal schedule
9:10am–9:30am Thursday, 03/08/2018
Location: Grand Ballroom 220 Level: Intermediate
Eric Colson (Stitch Fix)
While companies often use data science as a supportive function, the emergence of new business models has made it possible for some companies to differentiate via data science. Eric Colson explores what it means to differentiate by data science and explains why companies must now think very differently about the role and placement of data science in the organization. Read more.

9:30am

Add to your personal schedule
9:30am–9:40am Thursday, 03/08/2018
Location: Grand Ballroom 220
Amr Awadallah (Cloudera)
Amr Awadallah explains why the cloud requires a different approach to machine learning and analytics and what you can do about it. Read more.

9:45am

Add to your personal schedule
9:45am–10:05am Thursday, 03/08/2018
Location: Grand Ballroom 220
Ajey Gore (GO-JEK)
Ajey Gore details GO-JEK's evolution from a small bike-hailing startup to a technology-focused unicorn in the areas of transportation, lifestyle, payments, and social enterprise and explains how the company is focusing its attention beyond urban Indonesia to impact more than a million people across the country's rural areas. Read more.

10:05am

Add to your personal schedule
10:05am–10:25am Thursday, 03/08/2018
Location: Grand Ballroom 220
Seth Stephens-Davidowitz (Everybody Lies | NY Times)
Seth Stephens-Davidowitz explains how to use Google searches to uncover behaviors or attitudes that may be hidden from traditional surveys, such as racism, sexuality, child abuse, and abortion. Read more.

10:30am

10:30am–11:00am Thursday, 03/08/2018
Location: Hall 1, 2, 3
Break (30m)

11:00am

Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Ask Me Anything
Location: 212 A-B
Burcu Baran (LinkedIn), Wei Di (LinkedIn)
Join Burcu Baran and Wei Di to discuss big data in business analytics, machine learning in business analytics, and achieving actionable insights from big data. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Clare Gollnick (Terbium Labs)
At the heart of the reproducibility crisis in the sciences is the widespread misapplication of statistics. Data science relies on the same statistical methodology as these scientific fields. Can we avoid the same crisis of integrity? Clare Gollnick considers the philosophy of data science and shares a framework that explains (and even predicts) the likelihood of success of a data project. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018 Secondary topics:  Graphs and Time-series
Ryan Boyd (Neo4j)
Ryan Boyd explains how he and his team reconstructed a subset of the Twitter network of Russian troll accounts and applied graph analytics to the data using the Neo4j graph database to uncover how these accounts were spreading fake news. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Data science and machine learning
Location: LL20 C Level: Beginner
Mike Ruberry (ZestFinance)
What does it mean to explain a machine learning model, and why is it important? Mike Ruberry offers an overview of several modern explainability methods, including traditional feature contributions, LIME, and DeepLift. Each of these techniques presents a different perspective, and their clever application can reveal new insights and solve business requirements. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Data science and machine learning
Location: LL20 D Level: Intermediate
Josh Wills (Slack)
Josh Wills describes recent data science and machine learning projects at Slack. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Data science and machine learning
Location: LL21 B Level: Intermediate
Karthik Ramasamy (Uber), Lenny Evans (Uber)
Stolen credit cards are a major problem faced by many companies, including Uber. Karthik Ramasamy and Lenny Evans detail a new weapon against stolen credit cards that uses computer vision to scan credit cards, verifying possession of the physical card with basic fake card detection capabilities. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Data engineering and architecture
Location: LL21 C/D Level: Intermediate
Francesca Lazzeri (Microsoft), Fidan Boylu Uz (Microsoft)
Francesca Lazzeri and Fidan Boylu Uz explain how to operationalize LSTM networks to predict the remaining useful life of aircraft engines. They use simulated aircraft sensor values to predict when an aircraft engine will fail in the future so that maintenance can be planned in advance. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Greg Rahn (Cloudera)
For many organizations, the cloud will likely be the destination of their next big data warehouse. Greg Rahn shares considerations when evaluating the cloud for analytics and big data warehousing in order to help you get the most from the cloud. You'll leave with an understanding of different architectural approaches and impacts for moving analytic workloads to the cloud. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Strata Business Summit
Location: 210 A/E
Mike Olson (Cloudera)
Mike Olson shares examples of real-world machine learning applications, explores a variety of challenges in putting these capabilities into production—including the speed with with technology is moving, cloud versus in-data-center consumption, security and regulatory compliance, and skills and agility in getting data and answers into the right hands—and outlines proven ways to meet them. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018 Secondary topics:  Graphs and Time-series
Michael Schrenk (Self-Employed)
Big data becomes much more powerful when it has context. Fortunately, creative data scientists can create needed context though the use of metadata. Michael Schrenk explains how metadata is created and used to gain competitive advantages, predict troop strength, or even guess Social Security numbers. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Law, ethics, and governance, Strata Business Summit
Location: 210 D/H Level: Beginner
Sugreev Chawla (Thorn)
Sugreev Chawla offers an overview of Spotlight, a tool created by Thorn, a nonprofit that uses technology to fight online child sexual exploitation. It allows law enforcement to process millions of escort ads per month in an effort to fight sex trafficking, using graph analysis, time series analysis and NLP techniques to surface important networks of ads and characterize their behavior over time. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Tyler Akidau (Google)
What does it mean to execute streaming queries in SQL? What is the relationship of streaming queries to classic relational queries? Are streams and tables the same thing? And how does all of this relate to the programmatic frameworks we’re all familiar with? Tyler Akidau answers these questions and more as he walks you through key concepts underpinning data processing in general. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Data engineering and architecture
Location: 230 C Level: Intermediate
Jiangjie Qin (LinkedIn)
LinkedIn runs more than 1,800+ Kafka brokers that deliver more than two trillion messages a day. Running Kafka at such a scale makes automated operations a necessity. Jiangjie Qin shares lessons learned from operating Kafka at scale with minimum human intervention. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018 Secondary topics:  Expo Hall
Dean Wampler (Lightbend)
Dean Wampler explores two microservice streaming applications based on Kafka to compare and contrast using Akka Streams and Kafka Streams for data processing. Dean discusses the strengths and weaknesses of each tool for particular design needs and contrasts them with Spark Streaming and Flink, so you'll know when to chose them instead. Read more.

11:50am

Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
Ask Me Anything
Location: 212 A-B
Dean Wampler (Lightbend), Boris Lublinsky (Lightbend)
Join Dean Wampler and Boris Lublinsky to discuss all things streaming, from architecture and implementation to streaming engines and frameworks. Be sure to bring your questions about techniques for serving machine learning models in production, traditional big data systems, or software architecture in general. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018 Secondary topics:  Graphs and Time-series
Ram Shankar Siva Kumar (Microsoft (Azure Security Data Science))
How should you best debug a security data science system: change the ML approach, redefine the security scenario, or start over from scratch? Ram Shankar answers this question by sharing the results of failed experiments and the lessons learned when building ML detections for cloud lateral movement, identifying anomalous executables, and automating incident response process. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
Data engineering and architecture
Location: LL20 B Level: Beginner
Umur Cubukcu (Citus Data)
PostgreSQL is often regarded as the world’s most advanced open source database—and it’s on fire. Umur Cubukcu moves beyond the typical list of features in the next release to explore why so many new projects “just use Postgres” as their system of record (or system of engagement) at scale. Along the way, you'll learn how PostgreSQL's extension APIs are fueling innovations in relational databases. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
Data science and machine learning
Location: LL20 C Level: Intermediate
Stephen O'Sullivan (Data Whisperers)
Stephen O'Sullivan takes you along the data science journey, from on-boarding data (using a number of data/object stores) to understanding and choosing the right data format for the data assets to using query engines (and basic query tuning). You'll learn some new skills to help you be more productive and reduce contention with the data engineering team. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
Data science and machine learning
Location: LL20 D Level: Intermediate
Adam Greenhall explains how Lyft uses simulation to test out new algorithms, help develop new features, and study the economics of ride-sharing markets as they grow. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
Jennie Wang (Intel), Valentina Pedoia (UCSF), Berk Norman (UCSF), Yulia Tell (Intel)
Damage to the meniscus is a physically limiting injury that can lead to further medical complications. Automatically classifying this damage at the time of an MRI scan would allow quicker and more accurate diagnosis. Jennie Wang, Valentina Pedoia, Berk Norman, and Yulia Tell offer an overview of their classification system built with 3D convolutional neural networks using BigDL on Apache Spark. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
dong meng (MapR)
Deep learning model performance relies on underlying data. Dong Meng offers an overview of a converged data platform that serves as a data infrastructure, providing a distributed filesystem, key-value storage and streams, and Kubernetes as orchestration layer to manage containers to train and deploy deep learning models using GPU clusters. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
Szehon Ho (Criteo), Pawel Szostek (Criteo)
Hive is the main data transformation tool at Criteo, and hundreds of analysts and thousands of automated jobs run Hive queries every day. Szehon Ho and Pawel Szostek discuss the evolution of Criteo's Hive platform from an error-prone add-on installed on some spare machines to a best-in-class installation capable of self-healing and automatically scaling to handle its growing load. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
Data-driven business management, Strata Business Summit
Location: 210 A/E Level: Intermediate
David Talby (Pacific AI)
Machine learning and data science systems often fail in production in unexpected ways. David Talby shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
Data-driven business management, Strata Business Summit
Location: 210 C/G Level: Beginner
Paco Nathan (O'Reilly Media)
Human in the loop (HITL) has emerged as a key design pattern for managing teams where people and machines collaborate. Such systems are mostly automated, with exceptions referred to human experts, who help train the machines further. Paco Nathan offers an overview of HITL from the perspective of a business manager, focusing on use cases within O'Reilly Media. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
Data-driven business management, Strata Business Summit
Location: 210 D/H Level: Intermediate
With so many business intelligence tools in the Hadoop ecosystem and no common measure to identify the efficiency of each tool, where do you begin to build or modify your enterprise data lake strategy? Sagar Kewalramani shares real-world BI problems and how they were resolved with Hadoop tools and demonstrates how to build an effective data lake strategy with open source tools and components. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
Karthik Ramasamy (Streamlio), Sanjeev Kulkarni (Streamlio)
Stream processing systems must support a number of different types of processing semantics due to the diverse nature of streaming applications. Karthik Ramasamy and Sanjeev Kulkarni explore effectively once, exactly once, and other types of stateful processing techniques, explain how they are implemented in Heron, and discuss how applications can benefit. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
Data engineering and architecture
Location: 230 C Level: Intermediate
Secondary topics:  Graphs and Time-series
Alexis Roos (Salesforce), Noah Burbank (Salesforce)
In the customer age, being able to extract relevant communications information in real time and cross-reference it with context is key. Alexis Roos and Noah Burbank explain how Salesforce uses data science and engineering to enable salespeople to monitor their emails in real time to surface insights and recommendations using a graph modeling contextual data. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018 Secondary topics:  Expo Hall
Mike Driscoll (Metamarkets)
There’s a make-or-break step ahead for AI development. AI tools shouldn’t be designed to replace humans; they should be built with them in mind. We need to focus on translating data from machine learning models into beautiful, intuitive visuals. Mike Driscoll shares advice for creators of next-gen predictive algorithms from his experience turning big data into interactive visualizations. Read more.

12:30pm

Add to your personal schedule
12:30pm–1:50pm Thursday, 03/08/2018
Location: Hall 1, 2, 3
Topic Table discussions are a great way to informally network with people in similar industries or interested in the same topics. Read more.
Add to your personal schedule
12:30pm–1:50pm Thursday, 03/08/2018
Location: San Jose Ballroom, Marriott
Join Strata Business Summit speakers and attendees for a networking lunch on Thursday. Read more.

1:50pm

1:50pm–2:30pm Thursday, 03/08/2018
Location: 212 A-B
TBC
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018
Ram Sriharsha (Databricks)
How do you scale geospatial analytics on big data? And while you're at it, can you make it easy to use while achieving state-of-the-art performance on a single node? Ram Sriharsha offers an overview of Magellan—a geospatial optimization engine that seamlessly integrates with Spark—and explains how it provides scalability and performance without sacrificing simplicity. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018
Data engineering and architecture
Location: LL20 B Level: Intermediate
Josef Viehhauser (BMW Group), Tobias Bürger (BMW Group)
The BMW Group IT team drives the usage of data-driven technologies and forms the nucleus of a data-centric culture inside of the organization. Josef Viehhauser and Tobias Bürger discuss the E-to-E relationship of data and models and share best practices for scaling applications in real-world environments. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018
Veronica Mapes (Pinterest), Garner Chung (Pinterest)
Veronica Mapes and Garner Chung detail the human evaluation platform Pinterest developed to better serve its deep learning and operational teams when its needs grew beyond platforms like Mechanical Turk. Along the way, they cover tricks for increasing data reliability and judgement reproducibility and explain how Pinterest integrated end-user-sourced judgements into its in-house platform. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018
Jennifer Prendki (Atlassian)
Jennifer Prendki explains how to develop machine learning models even if the data is protected by privacy and compliance laws and cannot be used without anonymizing, covering techniques ranging from contextual bandits to document vector representation. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018
Delip Rao (R7 Speech Science)
Spoken conversations have rich information beyond what was said in words. Delip Rao details the potential of spoken conversational datasets, including identifying speakers and their demographic attributes, understanding intent and dynamics between speakers, and so on. Delip also discusses some of the latest science, including some of the work developed at R7. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018
Emre Velipasaoglu (Lightbend)
Most machine learning algorithms are designed to work on stationary data, but real-life streaming data is rarely stationary. Models lose prediction accuracy over time if they are not retrained. Without model quality monitoring, retraining decisions are suboptimal and costly. Emre Velipasaoglu evaluates monitoring methods for applicability in modern fast data and streaming applications. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018
Data engineering and architecture
Location: LL21 E/F Level: Intermediate
Chris Harland (Textio)
The number of resources explaining how to build a machine learning model from data greatly overshadows information on how to make real data products from such models, creating a gap between what machine learning engineers and data scientists know is possible and what users experience. Using examples from Textio's augmented writing platform, Chris Harland walks you through building a data product. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018
Law, ethics, and governance, Strata Business Summit
Location: 210 A/E Level: Intermediate
Mark Donsky (Cloudera), Steven Ross (Cloudera)
In May 2018, the General Data Protection Regulation (GDPR) goes into effect for firms doing business in the EU, but many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). Mark Donsky outlines the capabilities your data environment needs to simplify compliance with GDPR and future regulations. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018
Location: 210 C/G
Alex Rosenblat (Data & Society Research Institute )
Ride-hail drivers work alone, but they’re banding together online to compare notes, uncover new policies, and help each other navigate a workplace characterized by information scarcity. Alex Rosenblat explores how ride-hail workers are using online forums to create their own workplace culture as employment relationships grow more remote and algorithms replace human managers. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018
Michael Lysaght (Weight Watchers), Steven Levine (Weight Watchers )
For organizations stuck in a myriad of legacy infrastructure, the path to AI and deep learning seems impossible. Michael Lysaght, Steven Levine, and Nicolas Chikhani discuss Weight Watchers's transition from a traditional BI organization to one that uses data effectively, covering the company's needs, the changes that were required, and the technologies and architecture used to achieve its goals. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018
Data engineering and architecture
Location: 230 A Level: Intermediate
Secondary topics:  Graphs and Time-series
Michael Freedman (TimescaleDB | Princeton)
Michael Freedman offers an overview of TimescaleDB, a new scale-out database designed for time series workloads yet open-sourced and engineered up as a plugin to Postgres. Unlike most time series newcomers, TimescaleDB supports full SQL while achieving fast ingest and complex queries. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018
Holden Karau (Google), Rachel Warren (Salesforce Einstein)
Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka). Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018 Secondary topics:  Expo Hall, Graphs and Time-series
Roy Ben-Alta (Amazon Web Services), Ira Cohen (Anodot)
Many domains, such as mobile, web, the IoT, ecommerce, and more, have turned to analyzing streaming data. However, this presents challenges both in transforming the raw data to metrics and automatically analyzing the metrics in to produce the insights. Roy Ben-Alta and Ira Cohen share a solution implemented using Amazon Kinesis as the real-time pipeline feeding Anodot's anomaly detection solution. Read more.

2:40pm

Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Ask Me Anything
Location: 212 A-B
Dr. Vijay Srinivas Agneeswaran (SapientRazorfish), Abhishek Kumar (SapientRazorfish)
Join Dr. Vijay Srinivas Agneeswaran and Abhishek Kumar to discuss, recommender systems, deep learning, why deep learning for recommender systems TensorFlow and deep learning based recommender systems in TensorFlow. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Data science and machine learning
Location: LL20 A Level: Intermediate
Ian Cook (Cloudera)
The popular R package dplyr provides a consistent grammar for data manipulation that can abstract over diverse data sources. Ian Cook shows how you can use dplyr to query large-scale data using different processing engines including Spark and Impala. He demonstrates the R package sparklyr (from RStudio) and the new R package implyr (from Cloudera) and shares tips for making dplyr code portable. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Data science and machine learning, Law, ethics, and governance
Location: LL20 C Level: Intermediate
Pramit Choudhary (DataScience.com)
Pramit Choudhary explores the usefulness of a generative approach that applies Bayesian inference to generate human-interpretable decision sets in the form of "if. . .and else" statements. These human interpretable decision lists with high posterior probabilities might be the right way to balance between model interpretability, performance, and computation. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Data science and machine learning
Location: LL20 D Level: Intermediate
Simon Hughes (Dice.com), Yuri Bykov (Dice.com)
Dice.com recently released several free career advisory tools for technology professionals, including a salary predictor, a tool that recommends the next skills to learn, and a career path explorer. Simon Hughes and Yuri Bykov offer an overview of the machine learning algorithms behind these tools and the technologies used to build, deploy, and monitor these solutions in production. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Sponsored
Location: LL21 A
Tendu Yogurtcu (Syncsort)
Chefs must be able to trust the authenticity, quality, and origin of their ingredients; data analysts must be able to do the same of their data—and what happens to it along the way. Tendü Yoğurtçu explains how to seamlessly track the lineage and quality of your data—on and off the cluster, on-premises or in the cloud—to deliver meaningful insights and meet regulatory compliance requirements. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Data science and machine learning
Location: LL21 B Level: Intermediate
Patrick Harrison (S&P Global)
Word vector embeddings are everywhere, but relatively few understand how they produce their remarkable results. Patrick Harrison opens up the black box of a popular word embedding algorithm and walks you through of how it works its magic. Along the way, Patrick also covers core neural network concepts, including hidden layers, loss gradients, backpropagation, and more. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Michelle Casbon (Google Cloud Platform Developer Relations)
Michelle Casbon explains how to speed up the development of ML models by using open source tools such as Kubernetes, Docker, Scala, Apache Spark, and Weave Flux, detailing how to build resilient systems so that you can spend more of your time on product improvement rather than triage and uptime. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Ajay Mothukuri (Sapient), Arunkumar Ramanatha (Sapient), Dr. Vijay Srinivas Agneeswaran (SapientRazorfish)
Ajay Mothukuri, Arunkumar Ramanatha, and Vijay Srinivas Agneeswaran explain how to use open source blockchain technologies such as Hyperledger to implement the European Union's General Data Protection Regulation (GDPR) regulation. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Data-driven business management, Strata Business Summit
Location: 210 A/E Level: Non-technical
Anjali Thakur (Accenture)
Whether you are a technology or a services provider, understanding your value in the ecosystem and focusing on the right partners to reach your market goals is critical. Anjali Thakur shares examples of teaming models and leading practices for accelerating value from your ecosystem strategy. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Data-driven business management, Strata Business Summit
Location: 210 C/G Level: Beginner
Brian Karfunkel (Pinterest)
When software companies use A/B tests to evaluate product changes and fail to accurately estimate the long-term impact of such experiments, they risk optimizing for the users they have at the expense of the users they want to have. Brian Karfunkel explains how to estimate an experiment’s impact over time, thus mitigating this risk and giving full credit to experiments targeted at noncore users. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Matt Derda (Trifacta), Harrison Lynch, Sr. (Consensus Corporation)
In this session, Consensus will discuss how it leverages the combined power of data wrangling and machine learning to more efficiently identify and reduce retail fraud. The session will also explain how adopting data wrangling technology Trifacta has reduced data wrangling from six weeks to one week. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018 Secondary topics:  Graphs and Time-series
Fabian Hueske (data Artisans), Flavio Junqueira (Dell EMC)
Flavio Junqueira and Fabian Hueske detail an open source streaming data stack consisting of Pravega (stream storage) and Apache Flink (computation on streams) that offers an unprecedented way of handling “everything as a stream” that includes unbounded streaming storage and unified batch and streaming abstraction and dynamically accommodates workload variations in a novel way. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Sponsored
Location: 230 B
Andreas Pfadler (TalkingData)
Andreas Pfadler offers an overview of current technological trends for on-device deep learning and edge computing. Along the way, Andreas explores major players and platforms and computational challenges and solutions. Andreas concludes with a discussion of TalkingData's vision for the future of mobile deep learning. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Tomer Shiran (Dremio), Jacques Nadeau (Dremio)
Most organizations manage 5 to 15 copies of their data in multiple systems and formats to support different analytical use cases. Tomer Shiran and Jacques Nadeau introduce a new approach called data reflections, which dramatically reduces the need for data copies, demonstrate an open source implementation built with Apache Calcite, and explore two production case studies. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018 Secondary topics:  Expo Hall
Chris Fregly (PipelineAI)
Chris Fregly demonstrates how to extend existing Spark-based data pipelines to include TensorFlow model training and deploying and offers an overview of TensorFlow’s TFRecord format, including libraries for converting to and from other popular file format’s such as Parquet, CSV, JSON, and Avro stored in HDFS and S3. Read more.

3:20pm

3:20pm–4:20pm Thursday, 03/08/2018
Location: Hall 1, 2, 3
Break (1h)

4:20pm

Add to your personal schedule
4:20pm–5:00pm Thursday, 03/08/2018
Data science and machine learning
Location: LL20 A Level: Intermediate
Keno Fischer (Julia Computing)
Julia is rapidly becoming a popular language at the forefront of scientific discovery. Keno Fischer explores one of the most ambitious use cases for Julia: using machine learning to catalog astronomical objects to derive a catalog from multiterabyte size astronomical image datasets. This work was a collaboration between MIT, UC Berkley, LBNL, and Julia Computing. Read more.
4:20pm–5:00pm Thursday, 03/08/2018
Location: LL20 C
TBC
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/08/2018
Goodman Gu (Atlassian)
Machine learning is a pivotal technology. However, bringing an ML application to life often requires overcoming bottlenecks not just in the model code but in operationalizing the end-to-end system itself. Goodman Gu shares a case study from a leading SaaS company that quickly and easily built, trained, optimized, and deployed an XGBoost churn prediction ML app at scale with Amazon SageMaker. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/08/2018
Jeff Dean (Google)
The Google Brain team conducts research on difficult problems in artificial intelligence and builds large-scale computer systems for machine learning research, both of which have been applied to dozens of Google products. Jeff Dean highlights some of Google Brain's projects with an eye toward how they can be used to solve challenging problems. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/08/2018
Kimoon Kim (Pepperdata), Ilan Filonenko (Bloomberg LP)
There is growing interest in running Spark natively on Kubernetes, and Spark data is often stored in HDFS. Kimoon Kim and Ilan Filonenko explain how to make Spark on Kubernetes work seamlessly with HDFS by addressing the challenges such as HDFS data locality and secure HDFS support. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/08/2018
Shenghu Yang (Lyft)
Lyft’s business grown over 100x in the past four years. Shenghu Yang explains how Lyft’s data pipeline has evolved over the years to serve its ever-growing analytics use cases, migrating from the world's largest AWS Redshift clusters to Apache Hive and Presto for solving scalability and concurrency hard limits. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/08/2018
Data-driven business management
Location: 210 A/E Level: Non-technical
Jesse Anderson (Big Data Institute)
There's been an explosion of new architectures, but is this because engineers love new things or is there a good business reason for these changes? Jesse Anderson explores new architectures and the actual business problems they solve. You may find out that your team would be far more productive if you moved to these architectures. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/08/2018
Felix Gorodishter (GoDaddy)
GoDaddy ingests and analyzes over 100,000 data points per second. Felix Gorodishter discusses the company's big data journey from ingest to automation, how it is evolving its systems to scale to over 10 TB of new data per day, and how it uses tools like anomaly detection to produce valuable insights, such as the worth of a reminder email. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/08/2018
Data-driven business management, Strata Business Summit
Location: 210 D/H Level: Beginner
Marcin Pilarczyk (Ryanair)
Managing fuel at a company flying 120 millions passengers yearly is not a trivial task. Marcin Pilarczyk explores the main aspects of fuel management of a modern airline and offers an overview of machine learning methods supporting long-term planning and daily decisions. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/08/2018
Matteo Merli (Streamlio)
Traditionally, messaging systems have offered at-least-once delivery semantics, leaving the task of implementing idempotent processing to the application developers. Matteo Merli explains how to add effectively once semantics to Apache Pulsar using a message deduplication layer that can ensure those stricter semantics with guaranteed accuracy and no performance penalty. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/08/2018
Tomer Kaftan (University of Washington)
Tomer Kaftan offers an overview of Cuttlefish, a lightweight framework prototyped in Apache Spark that helps developers adaptively improve the performance of their data processing applications by inserting a few library calls into their code. These calls construct tuning primitives that use reinforcement learning to adaptively modify execution as they observe application performance over time. Read more.