San FranciscoLondon New York

Presented By
O’Reilly + Cloudera

Make Data Work

March 25-28, 2019
San Francisco, CA

Schedule

Monday, 03/25/2019

8:30am

8:30am–9:00am Monday, 03/25/2019

Location: 2nd floor lobby

Early morning coffee (30m)

9:00am

Big data for managers

9:00am–5:00pm Monday, 03/25/2019

Training

Strata Business Summit
Location: 2010

Secondary topics: AI and machine learning in the enterprise

Michael Li (The Data Incubator), Rich Ott (The Pragmatic Institute)

Average rating:

(4.50, 4 ratings)

Michael Li and Rich Ott offer a nontechnical overview of AI and data science. Learn common techniques, how to apply them in your organization, and common pitfalls to avoid. You’ll pick up the language and develop a framework to be able to effectively engage with technical experts and utilize their input and analysis for your business’s strategic priorities and decision making. Read more.

Machine learning from scratch in TensorFlow

9:00am–5:00pm Monday, 03/25/2019

Training

Data Science, Machine Learning & AI
Location: 2014

Secondary topics: Deep Learning

Robert Schroll (The Data Incubator)

Average rating:

(4.50, 2 ratings)

The TensorFlow library provides for the use of computational graphs, with automatic parallelization across resources. This architecture is ideal for implementing neural networks. Robert Schroll offers an overview of TensorFlow's capabilities in Python, demonstrating how to build machine learning algorithms piece by piece and how to use TensorFlow's Keras API with several hands-on applications. Read more.

Hands-on data science with Python

9:00am–5:00pm Monday, 03/25/2019

Training

Data Science, Machine Learning & AI
Location: 2016

Don Fox (The Data Incubator)

Average rating:

(4.75, 12 ratings)

Don Fox walks you through developing a machine learning pipeline, from prototyping to production. You'll learn about data cleaning, feature engineering, model building and evaluation, and deployment and then extend these models into two applications from real-world datasets. All work will be done in Python. Read more.

Building a serverless big data application on AWS

9:00am–5:00pm Monday, 03/25/2019

Training

Data Engineering & Architecture
Location: 2018

Secondary topics: AI and Data technologies in the cloud, Storage

Jorge Lopez (Amazon Web Services), Roy Hasson (Amazon Web Services), Rajeev Chakrabarti (Amazon Web Services), Jesse Gebhardt (Amazon Web Services), Gautam Srinivasan (Amazon Web Services), Anthony Nguyen (Amazon Web Services)

Average rating:

(4.50, 4 ratings)

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join in to learn how to incorporate serverless concepts into your big data architectures, looking at design patterns to ingest, store, and analyze your data. You'll then build a big data application using AWS technologies such as S3, Athena, Kinesis, and more. Read more.

Expand your data science and machine learning skills with Python, R, SQL, Spark, and TensorFlow

9:00am–5:00pm Monday, 03/25/2019

Training

Data Science, Machine Learning & AI
Location: 2020

Secondary topics: Deep Learning

Ian Cook (Cloudera)

Average rating:

(4.00, 1 rating)

Advancing your career in data science requires learning new languages and frameworks—but learners face an overwhelming array of choices, each with different syntaxes, conventions, and terminology. Ian Cook simplifies the learning process by elucidating the abstractions common to these systems. Through hands-on exercises, you'll overcome obstacles to getting started using new tools. Read more.

Professional Kafka development

9:00am–5:00pm Monday, 03/25/2019

Training

Data Engineering & Architecture
Location: 3016

Secondary topics: Streaming, realtime analytics, and IoT

Jesse Anderson (Big Data Institute)

Average rating:

(3.00, 1 rating)

Jesse Anderson leads a deep dive into Apache Kafka. You'll learn how Kafka works and how to create real-time systems with it. You'll also discover how to create consumers and publishers in Kafka and how to use Kafka Streams, Kafka Connect, and KSQL as you explore the Kafka ecosystem. Read more.

Forecasting financial time series with deep learning on Azure

9:00am–5:00pm Monday, 03/25/2019

Training

Data Science, Machine Learning & AI
Location: 3018

Secondary topics: Deep Learning, Financial Services, Temporal data and time-series analytics

Francesca Lazzeri (Microsoft), Jen Ren (Microsoft)

Francesca Lazzeri and Jen Ren walk you through the core steps for using Azure Machine Learning services to train your machine learning models both locally and on remote compute resources. Read more.

10:30am

10:30am–11:00am Monday, 03/25/2019

Location: 2nd floor lobby

Morning break (30m)

12:30pm

12:30pm–1:30pm Monday, 03/25/2019

Location: 2nd floor lobby

Lunch (1h)

3:00pm

3:00pm–3:30pm Monday, 03/25/2019

Location: 2nd floor lobby

Afternoon break (30m)

Tuesday, 03/26/2019

7:30am

7:30am–9:00am Tuesday, 03/26/2019

Location: 2nd floor lobby

Early morning coffee (1h 30m)

9:00am

AI privacy and ethical compliance toolkit

9:00am–12:30pm Tuesday, 03/26/2019

Tutorial

Data Science, Machine Learning & AI
Location: 2001

Secondary topics: Ethics, Security and Privacy

Iman Saleh (Intel), Cory Ilo (Intel), Cindy Tseng (Intel)

Average rating:

(5.00, 3 ratings)

From healthcare to smart home to autonomous vehicles, new applications of autonomous systems are raising ethical concerns about a host of issues, including bias, transparency, and privacy. Iman Saleh, Cory Ilo, and Cindy Tseng demonstrate tools and capabilities that can help data scientists address these concerns and bridge the gap between ethicists, regulators, and machine learning practitioners. Read more.

Recurrent neural networks without a PhD

9:00am–12:30pm Tuesday, 03/26/2019

Tutorial

Data Science, Machine Learning & AI
Location: 2002

Secondary topics: Deep Learning, Temporal data and time-series analytics

Martin Gorner (Google)

Average rating:

(4.50, 4 ratings)

Martin Gorner leads a hands-on introduction to recurrent neural networks and TensorFlow. Join in to discover what makes RNNs so powerful for time series analysis. Read more.

Managing data science in the enterprise

9:00am–12:30pm Tuesday, 03/26/2019

Tutorial

Executive Briefing and best practices, Strata Business Summit
Location: 2003

Secondary topics: AI and machine learning in the enterprise

Joshua Poduska (Domino Data Lab), Kimberly Shenk (NakedPoppy), Mac Steele (Domino)

Average rating:

(4.60, 15 ratings)

The honeymoon era of data science is ending, and accountability is coming. Successful data science leaders must deliver measurable impact on an increasing share of an enterprise's KPIs. Joshua Poduska, Kimberly Shenk, and Mac Steele explain how leading organizations take a holistic approach to people, process, and technology to build a sustainable competitive advantage. Read more.

Introduction to Flink via Flink SQL

9:00am–12:30pm Tuesday, 03/26/2019

Tutorial

Data Engineering & Architecture, Streaming and IoT
Location: 2004

Secondary topics: Streaming, realtime analytics, and IoT

Fabian Hueske (Ververica)

Average rating:

(5.00, 1 rating)

Fabian Hueske offers an overview of Apache Flink via the SQL interface, covering stream processing and Flink's various modes of use. Then you'll use Flink to run SQL queries on data streams and contrast this with the Flink DataStream API. Read more.

Architecting a data platform for enterprise use

9:00am–12:30pm Tuesday, 03/26/2019

Tutorial

Data Engineering & Architecture
Location: 2005

Mark Madsen (Teradata), Todd Walter (Archimedata)

Average rating:

(4.21, 28 ratings)

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that isn't subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure. Read more.

Foundations for successful data projects

9:00am–12:30pm Tuesday, 03/26/2019

Tutorial

Data Engineering & Architecture
Location: 2006

Secondary topics: AI and machine learning in the enterprise

Jonathan Seidman (Cloudera), Ted Malaska (Capital One)

Average rating:

(4.00, 6 ratings)

The enterprise data management space has changed dramatically in recent years, and this had led to new challenges for organizations in creating successful data practices. Jonathan Seidman and Ted Malaska share guidance and best practices from planning to implementation based on years of experience working with companies to deliver successful data projects. Read more.

Hands-on machine learning with Kafka-based streaming pipelines

9:00am–12:30pm Tuesday, 03/26/2019

Tutorial

Data Engineering & Architecture, Streaming and IoT
Location: 2007

Secondary topics: Data Integration and Data Pipelines, Data preparation, data governance, and data lineage, Model lifecycle management

Boris Lublinsky (Lightbend), Dean Wampler (Anyscale)

Average rating:

(3.85, 13 ratings)

Boris Lublinsky and Dean Wampler walk you through using ML in streaming data pipeline and doing periodic model retraining and low-latency scoring in live streams. You'll explore using Kafka as a data backplane, the pros and cons of microservices versus systems like Spark and Flink, tips for TensorFlow and SparkML, performance considerations, model metadata tracking, and other techniques. Read more.

Hands-on with Cloudera SDX: Setting up your own shared data experience

9:00am–12:30pm Tuesday, 03/26/2019

Tutorial

Data Engineering & Architecture
Location: 2008

Secondary topics: Data preparation, data governance, and data lineage, Storage

Santosh Kumar (Cloudera), Andre Araujo (Cloudera), Wim Stoop (Cloudera)

Average rating:

(5.00, 1 rating)

Cloudera SDX provides unified metadata control, simplifies administration, and maintains context and data lineage across storage services, workloads, and operating environments. Santosh Kumar, Andre Araujo, and Wim Stoop offer an overview of SDX before diving deep into the moving parts and guiding you through setting it up. You'll leave with the skills to set up your own SDX. Read more.

Natural language understanding at scale with Spark NLP

9:00am–12:30pm Tuesday, 03/26/2019

Tutorial

Data Science, Machine Learning & AI
Location: 2009

Secondary topics: Deep Learning, Text and Language processing and analysis

David Talby (Pacific AI), Alex Thomas (John Snow Labs), Claudiu Branzan (Accenture)

Average rating:

(4.75, 8 ratings)

David Talby, Alex Thomas, and Claudiu Branzan lead a hands-on introduction to scalable NLP using the highly performant, highly scalable open source Spark NLP library. You’ll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve. Read more.

9:00am–12:30pm Tuesday, 03/26/2019

Location: 2011

Tutorial TBC

Data Case Studies

9:00am–5:00pm Tuesday, 03/26/2019

Location: 2022

Alex Kudriashova (Astro Digital), Jonathan Francis (Starbucks), JoLynn Lavin (General Mills), Robin Way (Corios), June Andrews (GE), Kyungtaak Noh (SK Telecom), Taposh DuttaRoy (Kaiser Permanente), Sabrina Dahlgren (Kaiser Permanente), Craig Rowley (Columbia Sportswear), Ambal Balakrishnan (IBM), Benjamin Glicksberg (UCSF), Patrick Lucey (Stats Perform), Rhonda Textor (True Fit)

Hear practical insights from household brands and global companies: the challenges they tackled, approaches they took, and the benefits—and drawbacks—of their solutions. Read more.

Strata Data Ethics Summit

9:00am–5:00pm Tuesday, 03/26/2019

Location: 2024

Susan Etlinger (Altimeter Group), Alistair Croll (Solve For Interesting), Susan Etlinger (Altimeter Group), Jake Metcalf (Ethical Resolve), Emanuel Moss (Data & Society), Bradley Voytek (UC San Diego ), Jonathan Foster (Microsoft), Yiannis Kanellopoulos (Code4Thought), Kathy Baxter (Salesforce), Bulbul Gupta (Socos Labs), Brian Rieger (Labelbox), Carole Piovesan (INQ Data Law), Jana Eggers (Nara Logics), Irina Raicu (Santa Clara University), Brian Green (Santa Clara University), Alistair Croll (Solve For Interesting), Susan Etlinger (Altimeter Group), Tim O'Reilly (O'Reilly Media), Bradley Voytek (UC San Diego ), Jana Eggers (Nara Logics), Jonathan Foster (Microsoft), Brian Rieger (Labelbox), Rachel Thomas (fast.ai), Yiannis Kanellopoulos (Code4Thought), Rumman Chowdhury (Accenture), Kathy Baxter (Salesforce), Carole Piovesan (INQ Data Law), Stuart Buck (Arnold Ventures)

In this day-long event, academics, practitioners, and innovators dive deep into the thorny issues of data, privacy, bias, and morality that are at the forefront of today's headlines. Read more.

10:30am

10:30am–11:00am Tuesday, 03/26/2019

Location: 2nd floor lobby

Morning break (30m)

12:30pm

12:30pm–1:30pm Tuesday, 03/26/2019

Location: 2nd and 3rd floor lobbies

Lunch (1h)

1:30pm

Practical techniques for interpretable machine learning

1:30pm–5:00pm Tuesday, 03/26/2019

Tutorial

Data Science, Machine Learning & AI
Location: 2001

Secondary topics: Ethics

Patrick Hall (bnh.ai | H2O.ai)

Average rating:

(4.00, 9 ratings)

If machine learning can lead to financial gains for your organization, why isn’t everyone doing it? One reason is training machine learning systems with transparent inner workings and auditable predictions is difficult. Patrick Hall details the good, bad, and downright ugly lessons learned from his years of experience implementing solutions for interpretable machine learning. Read more.

The hitchhiker's guide to deep learning-based recommenders in production

1:30pm–5:00pm Tuesday, 03/26/2019

Tutorial

Data Science, Machine Learning & AI
Location: 2002

Secondary topics: Deep Learning, Media, Marketing, Advertising, Model lifecycle management

Abhishek Kumar (Publicis Sapient), pramod singh (Walmart Labs )

Average rating:

(4.17, 6 ratings)

Abhishek Kumar and Pramod Singh walk you through deep learning-based recommender and personalization systems they've built for clients. Join in to learn how to use TensorFlow Serving and MLflow for end-to-end productionalization, including model serving, Dockerization, reproducibility, and experimentation, and Kubernetes for deployment and orchestration of ML-based microarchitectures. Read more.

Successfully deploy machine learning while managing its risks

1:30pm–5:00pm Tuesday, 03/26/2019

Tutorial

Executive Briefing and best practices, Strata Business Summit
Location: 2003

Secondary topics: AI and machine learning in the enterprise, Ethics, Security and Privacy

Andrew Burt (bnh.ai), Steven Touw (Immuta), richard geering (Immuta), Joseph Regensburger (Immuta), Alfred Rossi (Immuta)

Average rating:

(5.00, 2 ratings)

As ML becomes increasingly important for businesses and data science teams alike, managing its risks is quickly becoming one of the biggest challenges to the technology’s widespread adoption. Join Andrew Bur, Steven Touw, Richard Geering, Joseph Regensburger, and Alfred Rossi for a hands-on overview of how to train, validate, and audit machine learning models (ML) in practice. Read more.

Learning Presto: SQL on anything

1:30pm–5:00pm Tuesday, 03/26/2019

Tutorial

Data Engineering & Architecture
Location: 2004

Secondary topics: Streaming, realtime analytics, and IoT

Matt Fuller (Starburst)

Average rating:

(3.57, 7 ratings)

Used by Facebook, Netflix, Airbnb, LinkedIn, Twitter, Uber, and others, Presto has become the ubiquitous open source software for SQL on anything. Presto was built from the ground up for fast interactive SQL analytics against disparate data sources ranging in size from GBs to PBs. Join Matt Fuller to learn how to use Presto and explore use cases and best practices you can implement today. Read more.

Architecture and algorithms for end-to-end streaming data processing

1:30pm–5:00pm Tuesday, 03/26/2019

Tutorial

Data Engineering & Architecture
Location: 2005

Secondary topics: AI and Data technologies in the cloud, Data Integration and Data Pipelines, Storage, Streaming, realtime analytics, and IoT

Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio)

Average rating:

(2.67, 12 ratings)

Many industry segments have been grappling with fast data (high-volume, high-velocity data). Arun Kejariwal and Karthik Ramasamy walk you through the state-of-the-art systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage—for real-time data and algorithms to extract insights (e.g., heavy hitters and quantiles) from data streams. Read more.

Streamlining a machine learning project team

1:30pm–5:00pm Tuesday, 03/26/2019

Tutorial

Data Engineering & Architecture
Location: 2006

Secondary topics: AI and machine learning in the enterprise

Sourav Dey (Manifold), Alex Ng (Manifold)

Average rating:

(4.25, 4 ratings)

Many teams are still run as if data science is mainly about experimentation, but those days are over. Now it must offer turnkey solutions to take models into production. Sourav Day and Alex Ng explain how to streamline an ML project and help your engineers work as an integrated part of your production teams, using a Lean AI process and the Orbyter package for Docker-first data science. Read more.

Cross-cloud model training and serving with Kubeflow

1:30pm–5:00pm Tuesday, 03/26/2019

Tutorial

Data Engineering & Architecture
Location: 2007

Secondary topics: AI and Data technologies in the cloud, Model lifecycle management

Holden Karau (Independent), Francesca Lazzeri (Microsoft), Trevor Grant (IBM)

Average rating:

(3.00, 2 ratings)

Holden Karau, Francesca Lazzeri, and Trevor Grant offer an overview of Kubeflow and walk you through using it to train and serve models across different cloud environments (and on-premises). You'll use a script to do the initial setup work, so you can jump (almost) straight into training a model on one cloud and then look at how to set up serving in another cluster/cloud. Read more.

Running multidisciplinary big data workloads in the cloud

1:30pm–5:00pm Tuesday, 03/26/2019

Tutorial

Data Engineering & Architecture
Location: 2008

Secondary topics: AI and Data technologies in the cloud

Jason Wang (Cloudera), Brandon Freeman (Cloudera), Michael Kohs (Cloudera), Akihiro Ishikawa (Cloudera), Toby Ferguson (Cloudera)

Average rating:

(3.20, 5 ratings)

There are many challenges with moving multidisciplinary big data workloads to the cloud and running them. Jason Wang, Brandon Freeman, Michael Kohs, Akihiro Nishikawa, and Toby Ferguson explore cloud architecture and its challenges and walk you through using Cloudera Altus to build data warehousing and data engineering clusters and run workloads that share metadata between them using Cloudera SDX. Read more.

Analytics Zoo: Distributed TensorFlow and Keras on Apache Spark

1:30pm–5:00pm Tuesday, 03/26/2019

Tutorial

Data Science, Machine Learning & AI
Location: 2009

Secondary topics: Deep Learning, Temporal data and time-series analytics

Jason Dai (Intel), Yuhao Yang (Intel), Jiao(Jennie) Wang (Intel), Guoqiong Song (Intel)

Average rating:

(3.00, 6 ratings)

Jason Dai, Yuhao Yang, Jennie Wang, and Guoqiong Song explain how to build and productionize deep learning applications for big data with Analytics Zoo—a unified analytics and AI platform that seamlessly unites Spark, TensorFlow, Keras, and BigDL programs into an integrated pipeline—using real-world use cases from JD.com, MLSListings, the World Bank, Baosight, and Midea/KUKA. Read more.

Using the full spectrum of data science to drive business decisions

1:30pm–5:00pm Tuesday, 03/26/2019

Tutorial

Data Science, Machine Learning & AI
Location: 2011

Secondary topics: AI and machine learning in the enterprise

Chi-Yi Kuan (LinkedIn), Tiger Zhang (LinkedIn), Xiaojing Dong (LinkedIn), Burcu Baran (LinkedIn), Emily Huang (LinkedIn)

7:30am

7:30am–8:45am Wednesday, 03/27/2019

Location: 3rd floor lobby

Early morning coffee (1h 15m)

Sustaining machine learning in the enterprise

9:30am–9:40am Wednesday, 03/27/2019

Keynote

Location: Ballroom

Secondary topics: AI and machine learning in the enterprise

Ben Lorica (O'Reilly)

Average rating:

(4.21, 29 ratings)

Keynote with Ben Lorica Read more.

9:40am

Cyberconflict: A new era of war, sabotage, and fear

9:40am–10:00am Wednesday, 03/27/2019

Keynote

Location: Ballroom

Secondary topics: Security and Privacy

David Sanger (The New York Times)

Average rating:

(4.32, 50 ratings)

David Sanger explains how the rise of cyberweapons has transformed geopolitics like nothing since the invention of the atomic bomb. From crippling infrastructure to sowing discord and doubt, cyber is now the weapon of choice for democracies, dictators, and terrorists. Read more.

10:00am

AI and cryptography: Challenges and opportunities

10:00am–10:20am Wednesday, 03/27/2019

Keynote

Location: Ballroom

Secondary topics: Security and Privacy

Shafi Goldwasser (UC Berkeley | MIT | Weizmann Institute of Science | Duality)

Average rating:

(3.41, 22 ratings)

Keynote with Shafi Goldwasser Read more.

10:30am

10:30am–11:00am Wednesday, 03/27/2019

Location: Expo Hall (Exhibit Hall - Level 1)

Morning break sponsored by Dataiku (30m)

11:00am

Scaling data lineage at Netflix to improve data infrastructure reliability and efficiency

11:00am–11:40am Wednesday, 03/27/2019

Session

Data Engineering & Architecture
Location: 2001

Secondary topics: Data Integration and Data Pipelines, Data preparation, data governance, and data lineage, Media, Marketing, Advertising

Jitender Aswani (Netflix), Di Lin (Netflix), Girish Lingappa (Netflix)

Average rating:

(3.40, 15 ratings)

Hundreds of thousands of ETL pipelines ingest over a trillion events daily to populate millions of data tables downstream at Netflix. Jitender Aswani, Girish Lingappa, and Di Lin discuss Netflix’s internal data lineage service, which was essential for enhancing platform’s reliability, increasing trust in data, and improving data infrastructure efficiency. Read more.

Building the AI engine for retail in the new era

11:00am–11:40am Wednesday, 03/27/2019

Session

Data Engineering & Architecture
Location: 2002

Secondary topics: AI and machine learning in the enterprise, Automation in data science and big data, Data Platforms, Retail and e-commerce, Storage, Temporal data and time-series analytics

JIAN CHANG (Alibaba Group), Sanjian Chen (Alibaba Group)

Average rating:

(4.50, 4 ratings)

Jian Chang and Sanjian Chen outline the design of the AI engine on Alibaba's TSDB service, which enables fast and complex analytics of large-scale retail data. They then share a successful case study of the Fresh Hema Supermarket, a major “new retail” platform operated by Alibaba Group, highlighting solutions to the major technical challenges in data cleaning, storage, and processing. Read more.

How to compete in the AI arms race (sponsored by Oracle Cloud Infrastructure)

11:00am–11:40am Wednesday, 03/27/2019

Session

Sponsored
Location: 2003

Ian Swanson (Oracle)

Average rating:

(3.00, 2 ratings)

Being an AI-driven enterprise earlier than a competitor is an opportunity within your reach. Join in to find out how, as Ian Swanson dives into problem domains, platform differentiators, ease of use, automation, and scale and shares best practices on quick starts with the right infrastructure choices. Read more.

Cost-effective Presto on AWS with Spot nodes

11:00am–11:40am Wednesday, 03/27/2019

Session

Data Engineering & Architecture
Location: 2004

Secondary topics: AI and Data technologies in the cloud

Shubham Tagra (Qubole)

Average rating:

(3.50, 8 ratings)

Did you know you can run Presto in AWS at a tenth of the cost with AWS Spot nodes, with just a few architectural enhancements to Presto. Shubham Tagra explores the gaps in Presto architecture, explains how to use Spot nodes, covers enhancements, and showcases the improvements in terms of reliability and TCO achieved through them. Read more.

The death of coding: How AI redefines our relationship with computers (sponsored by IBM)

11:00am–11:40am Wednesday, 03/27/2019

Session

Data Science, Machine Learning & AI, Sponsored
Location: 2005

Sam Lightstone (IBM)

Average rating:

(4.50, 4 ratings)

Sam Lightstone discusses how AI is fundamentally changing computer science and the practice of coding. Join in to discover what machine learning means today and explore recent advances in hardware and software and breakthrough innovations. Read more.

Live Aggregators: A scalable, cost-effective, and reliable way of aggregating billions of messages in real time

11:00am–11:40am Wednesday, 03/27/2019

Session

Data Engineering & Architecture
Location: 2006

Secondary topics: Data Integration and Data Pipelines, Storage, Streaming, realtime analytics, and IoT

Osman Sarood (Mist Systems), Chunky Gupta (Mist Systems)

Average rating:

(4.67, 3 ratings)

Osman Sarood and Chunky Gupta discuss Mist’s real-time data pipeline, focusing on Live Aggregators (LA)—a highly reliable and scalable in-house real-time aggregation system that can autoscale for sudden changes in load. LA is 80% cheaper than competing streaming solutions due to running over AWS Spot Instances and having 70% CPU utilization. Read more.

Building a data science team at Levi’s (sponsored by Dataiku)

11:00am–11:40am Wednesday, 03/27/2019

Session

Sponsored
Location: 2014

Secondary topics: Jupyter

Alan Chin (IBM), LUCIANO RESENDE (IBM)

Average rating:

(4.75, 4 ratings)

Alan Chin and Luciano Resende explain how to introduce Jupyter Enterprise Gateway into new and existing notebook environments to enable a "bring your own notebook" model while simultaneously optimizing resources consumed by the notebook kernels running across managed clusters within the enterprise. Read more.

Deep learning applications for non-engineers

11:00am–11:40am Wednesday, 03/27/2019

Session

Data Science, Machine Learning & AI
Location: 2016

Secondary topics: AI and Data technologies in the cloud, Deep Learning, Open Data, Data Generation and Data Networks

Jeremy Howard ( fast.ai | USF | doc.ai and platform.ai)

Average rating:

(4.80, 5 ratings)

Jeremy Howard describes how to leverage the latest research from the deep learning and HCI communities to train neural networks from scratch—without code or preexisting labels. He then shares case studies in fashion, retail and ecommerce, travel, and agriculture where these approaches have been used. Read more.

Scaling visualization for big data and analytics in the cloud

11:00am–11:40am Wednesday, 03/27/2019

Session

Strata Business Summit, Visualization and UX
Location: 2018

Secondary topics: AI and Data technologies in the cloud, Visualization, Design, and UX

Jaipaul Agonus (FINRA), Daniel Monteiro do Carmo Rosa (FINRA)

Average rating:

(3.40, 5 ratings)

Jaipaul Agonus and Daniel Monteiro do Carmo Rosa detail big data analytics and visualization practices and tools used by FINRA to support machine learning and other surveillance activities that the Market Regulation Department conducts in the AWS cloud. Read more.

Executive Briefing: From the edge to AI—Taking control of your data for fun and profit

11:00am–11:40am Wednesday, 03/27/2019

Session

Executive Briefing and best practices, Strata Business Summit
Location: 2020

Secondary topics: AI and Data technologies in the cloud, AI and machine learning in the enterprise, Security and Privacy

Mike Olson (Cloudera)

Average rating:

(3.80, 5 ratings)

It's easier than ever to collect data, but managing it securely in compliance with regulations and legal constraints is harder. Mike Olson discusses the risks and the issues that matter most and explains how an enterprise data cloud that embraces your data center and the public cloud in combination can address them, delivering real business results for your organization. Read more.

Uncovering the next generation of data architecture for insights at the speed of thought (sponsored by Actian)

11:00am–11:40am Wednesday, 03/27/2019

Session

Sponsored
Location: 2022

Raghu Chakravarthi (Actian)

Average rating:

(4.33, 3 ratings)

Raghu Chakravarth explores key considerations when building an Agile data warehouse and outlines a reference architecture for hybrid data. Read more.

Recommendation engines and mobile gaming

11:00am–11:40am Wednesday, 03/27/2019

Session

Case studies, Strata Business Summit
Location: 2024

Secondary topics: Media, Marketing, Advertising

Bysshe Easton (KIXEYE), Thomas Dobbs (KIXEYE)

Average rating:

(4.50, 2 ratings)

As a fully closed model economy, games offer a unique opportunity to use analytics to create unique purchase opportunities for customers. Bysshe Easton and Thomas Dobbs explain how KIXEYE uses machine learning to create personalized offer recommendations for its customers, resulting in significantly increased monetization and retention. Read more.

Machine learning on encrypted data: Challenges and opportunities

11:00am–11:40am Wednesday, 03/27/2019

Session

Data Science, Machine Learning & AI, Expo Hall
Location: Expo Hall

Secondary topics: AI and Data technologies in the cloud, Security and Privacy

Alon Kaufman (Duality), Vinod Vaikuntanathan (MIT and Duality Technologies)

Average rating:

(3.75, 4 ratings)

Alon Kaufman and Vinod Vaikuntanathan discuss the challenges and opportunities of machine learning on encrypted data and describe the state of the art in this space. Read more.

11:50am

How Intuit reduced time to reliable insights for data pipelines

11:50am–12:30pm Wednesday, 03/27/2019

Session

Data Engineering & Architecture
Location: 2001

Secondary topics: Data Integration and Data Pipelines, Financial Services

Sandeep U (Intuit)

Average rating:

(4.57, 7 ratings)

How efficient is your data platform? The single metric Intuit uses is time to reliable insights: the total of time spent to ingest, transform, catalog, analyze, and publish. Sandeep Uttamchandani shares three design patterns/frameworks Intuit implemented to deal with three challenges to determining time to reliable insights: time to discover, time to catalog, and time to debug for data quality. Read more.

The journey toward a self-service data platform at Netflix

11:50am–12:30pm Wednesday, 03/27/2019

Session

Data Engineering & Architecture
Location: 2002

Secondary topics: Data Platforms, Media, Marketing, Advertising

Kurt Brown (Netflix)

Average rating:

(4.22, 9 ratings)

The Netflix data platform is a massive-scale, cloud-only suite of tools and technologies. It includes big data tech (Spark and Flink), enabling services (federated metadata management), and machine learning support. But with power comes complexity. Kurt Brown explains how Netflix is working toward an easier, "self-service" data platform without sacrificing any enabling capabilities. Read more.

From data to discovery: The power of choice and control (sponsored by SAS)

11:50am–12:30pm Wednesday, 03/27/2019

Session

Sponsored
Location: 2003

Sarah Gates (SAS)

Average rating:

(3.50, 2 ratings)

SAS empowers you with choice and control, helping you uncover insights from any data for better, faster decisions regardless of language.  Sarah Gates shares methods for accelerating the analytics lifecycle, improving data preparation, quality, and governance, automating and speeding up time-consuming tasks, and quickly creating, selecting, and deploying models—be it one or thousands. Read more.

Accelerating analytical antelopes: Integrating Apache Kudu's RPC into Apache Impala

11:50am–12:30pm Wednesday, 03/27/2019

Session

Data Engineering & Architecture
Location: 2004

Secondary topics: Streaming, realtime analytics, and IoT

Lars Volker (Cloudera), Michael Ho (Cloudera)

Average rating:

(4.50, 6 ratings)

In recent years, Apache Impala has been deployed to clusters that are large enough to hit architectural limitations in the stack. Lars Volker and Michael Ho cover the efforts to address the scalability limitations in the now legacy Thrift RPC framework by using Apache Kudu's RPC, which was built from the ground up to support asynchronous communication, multiplexed connections, TLS, and Kerberos. Read more.

Serverless analytics in AWS Glue (sponsored by Amazon Web Services)

11:50am–12:30pm Wednesday, 03/27/2019

Session

Sponsored
Location: 2005

Mehul Shah (Amazon Web Services )

Average rating:

(5.00, 2 ratings)

Mehul Shah offers an overview of serverless computing and details AWS Glue's severless analytics features for data science, data discovery, data cleaning and transformation, and data lake management. Read more.

Enabling insights and analytics with data streaming architectures and pipelines using Kafka and Hadoop

11:50am–12:30pm Wednesday, 03/27/2019

Session

Data Engineering & Architecture
Location: 2006

Secondary topics: Data Integration and Data Pipelines, Data Platforms, Health and Medicine, Streaming, realtime analytics, and IoT

Mohammad Quraishi (Cigna)

Average rating:

(4.60, 5 ratings)

In a large global health services company, streaming data for processing and sharing comes with its own challenges. Data science and analytics platforms need data fast, from relevant sources, to act on this data quickly and share the insights with consumers with the same speed and urgency. Join Mohammad Quraishi to learn why streaming data architectures are a necessity—Kafka and Hadoop are key. Read more.

Transforming AI, ML, and BI on big data at Verizon (sponsored by Kyvos Insights)

11:50am–12:30pm Wednesday, 03/27/2019

Session

Sponsored
Location: 2014

Secondary topics: Jupyter

Omoju Miller (GitHub)

Average rating:

(3.50, 10 ratings)

GitHub has a relatively nascent ML group. Its major challenge is to integrate ML product building processes into a mature product engineering org. This means that it's responsible for building end-to-end ML products, from ETL to production. Omoju Miller details the many roles Jupyter notebooks play in the building of ML products. Read more.

Artificial intelligence on human behavior: New insights into customer segmentation

11:50am–12:30pm Wednesday, 03/27/2019

Session

Data Science, Machine Learning & AI
Location: 2016

Secondary topics: AI and machine learning in the enterprise, Deep Learning, Media, Marketing, Advertising, Retail and e-commerce

Melinda Han Williams (Dstillery)

Average rating:

(4.86, 14 ratings)

Customer segmentation based on coarse survey data is a staple of traditional market research. Melinda Han Williams explains how Dstillery uses neural networks to model the digital pathways of 100M consumers and uses the resulting embedding space to cluster customer populations into fine-grained behavioral segments and inform smarter consumer insights—in the process, creating a map of the internet. Read more.

Yay, we are going to deploy an AI/ML-powered app. But wait! Where do I deploy?

11:50am–12:30pm Wednesday, 03/27/2019

Session

Strata Business Summit
Location: 2018

Swatee Singh (American Express)

Average rating:

(4.00, 3 ratings)

Organizations developing artificial intelligence and machine learning (AI/ML)-powered applications face two existential questions: Should they consider a fully or partially hybrid cloud environment for AI/ML deployments, and which public cloud will give them the most features and capabilities? Swatee Singh discusses available options for companies facing these challenges. Read more.

The ethics of analytics

11:50am–12:30pm Wednesday, 03/27/2019

Session

Law and Ethics, Strata Business Summit
Location: 2020

Secondary topics: Ethics

Bill Franks (International Institute For Analytics)

Average rating:

(4.67, 3 ratings)

Concerns are constantly being raised today about what data is appropriate to collect and how (or if) it should be analyzed. There are many ethical, privacy, and legal issues to consider, and no clear standards exist in many cases as to what is fair and what is foul. Bill Franks explores a variety of dilemmas and provides some guidance on how to approach them. Read more.

High-performance data lakes for AI workloads using object storage (sponsored by Minio)

11:50am–12:30pm Wednesday, 03/27/2019

Session

Sponsored
Location: 2022

Scott Mcclellan (PRGX)

Average rating:

(5.00, 1 rating)

Recently, Scott Mcclellan's team—which analyzes over six petabytes of data using Hadoop technology—created a high-performance data lake using object storage for consumption by big data workloads. Scott shares his experience deploying object storage for AI workloads. Read more.

Shortcuts that short-circuit talent pipelines: Data-driven optimization of hiring

11:50am–12:30pm Wednesday, 03/27/2019

Session

Culture and organization, Strata Business Summit
Location: 2024

Secondary topics: AI and machine learning in the enterprise

Maryam Jahanshahi (TapRecruit)

Average rating:

(4.80, 5 ratings)

Hiring teams largely rely on both intuition and experience to scout talent for data science and data engineering roles. Drawing on results from analyzing over 15 million jobs and their outcomes, Maryam Jahanshahi interrogates these “common sense” judgments to determine whether they help or hurt hiring of data scientists and engineers. Read more.

Applying deep learning at Google for recommendations

11:50am–12:30pm Wednesday, 03/27/2019

Session

Data Science, Machine Learning & AI, Expo Hall
Location: Expo Hall

Secondary topics: AI and Data technologies in the cloud, Deep Learning, Media, Marketing, Advertising, Retail and e-commerce

Ron Bodkin (Google)

Average rating:

(4.33, 6 ratings)

Google uses deep learning extensively in new and existing products. Join Ron Bodkin to learn how Google has used deep learning for recommendations at YouTube, in the Play store, and for customers in Google Cloud. You'll explore the role of embeddings, recurrent networks, contextual variables, and wide and deep learning and discover how to do candidate generation and ranking with deep learning. Read more.

12:30pm

Better Together Diversity Networking Lunch (sponsored by Walmart Labs)

12:30pm–2:40pm Wednesday, 03/27/2019

Event

Location: 3016

Average rating:

(5.00, 4 ratings)

If you’d like to make new professional connections and hear ideas for supporting diversity in the tech community, come to the diversity and inclusion networking lunch on Wednesday. Read more.

Wednesday Topic Tables at Lunch

12:30pm–2:40pm Wednesday, 03/27/2019

Event

Location: Expo Hall (Exhibit Hall - Level 1)

Average rating:

(5.00, 1 rating)

Topic Table discussions are a great way to informally network with people in similar industries or interested in the same topics. Read more.

Wednesday Business Summit Lunch

12:30pm–2:40pm Wednesday, 03/27/2019

Event

Location: Expo Hall

Average rating:

(5.00, 1 rating)

Join fellow executives, business leaders, and strategists for a networking lunch on Wednesday for Strata Business Summit attendees and speakers. Read more.

2:40pm

Adaptive ETL to optimize query performance at Lyft

2:40pm–3:20pm Wednesday, 03/27/2019

Session

Data Engineering & Architecture
Location: 2001

Secondary topics: Data Integration and Data Pipelines, Transportation and Logistics

James Taylor (Lyft)

Average rating:

(3.56, 9 ratings)

James Taylor offers an overview of an automated feedback loop at Lyft to adapt ETL based on the aggregate cost of queries run across the cluster. He also discusses future work to enhance the system through the use of materialized views to reduce the number of ad hoc joins and sorting performed by the most expensive queries by transparently rewriting queries when possible. Read more.

Goodbye, data lake: Why continuous analytics yield higher ROI

2:40pm–3:20pm Wednesday, 03/27/2019

Session

Data Engineering & Architecture
Location: 2002

Secondary topics: AI and Data technologies in the cloud, Data Integration and Data Pipelines

Yaron Haviv (iguazio)

Average rating:

(4.00, 2 ratings)

Faced with the need to handle increasing volumes of data, alternative datasets ("alt data"), and AI, many enterprises are working to design or redesign their big data architectures, but traditional batch platforms fail to generate sufficient ROI. Yaron Haviv shares a continuous analytics approach that yields faster answers for the business while remaining simpler and less expensive for IT. Read more.

The new frontier: Marsh’s data voyage into the public cloud (sponsored by Impetus)

2:40pm–3:20pm Wednesday, 03/27/2019

Session

Sponsored
Location: 2003

Stephen Dantu (Marsh)

Average rating:

(4.00, 1 rating)

Stephen Dantu shares insurance broker Marsh’s pioneering journey into the public cloud and explains why this move was necessary to unleash new opportunities and future-proof the company. Read more.

Real-time analytics at Uber: Bring SQL into everything

2:40pm–3:20pm Wednesday, 03/27/2019

Session

Data Engineering & Architecture
Location: 2004

Secondary topics: Data Platforms, Storage, Streaming, realtime analytics, and IoT, Transportation and Logistics

Zhenxiao Luo (Twitter)

Average rating:

(4.09, 11 ratings)

From determining the most convenient rider pickup points to predicting the fastest routes, Uber uses data-driven analytics to create seamless trip experiences. Zhenxiao Luo explains how Uber supports real-time analytics with deep learning on the fly, without any data copying. Read more.

Modernizing Ab inBev’s data architecture to improve predictive analytics and forecast (sponsored by Talend)

2:40pm–3:20pm Wednesday, 03/27/2019

Session

Sponsored
Location: 2005

Harinder Singh (AB inBev)

Average rating:

(4.50, 4 ratings)

Harinder Singh explains how, over the course of two years, the world’s largest brewer completely modernized its data architecture and moved it to the cloud. By accelerating data analytics and freeing up the time of its data scientists, AB inBev has been able to better anticipate demand and production, streamline logistics, and develop new beverages that have become best-sellers. Read more.

Cruise Control: Effortless management of Kafka clusters

2:40pm–3:20pm Wednesday, 03/27/2019

Session

Data Engineering & Architecture
Location: 2006

Secondary topics: Streaming, realtime analytics, and IoT

Adem Efe Gencer (LinkedIn)

Average rating:

(3.50, 2 ratings)

Adem Efe Gencer explains how LinkedIn alleviated the management overhead of large-scale Kafka clusters using Cruise Control. Read more.

Strategies for leveraging legacy data for real time, cloud, and cluster (sponsored by Syncsort)

2:40pm–3:20pm Wednesday, 03/27/2019

Session

Sponsored
Location: 2014

Secondary topics: Jupyter

M Pacer (Netflix)

Average rating:

(4.57, 7 ratings)

M Pacer discusses two meanings of "Talking with Jupyter": talking to others with Jupyter notebooks and talking to Jupyter in the language of its standards, formats, and protocols. M describes tools, workflows, and patterns that make both kinds of talking with Jupyter easier while unlocking new ways of interacting with the Jupyter ecosystem. Read more.

Dilated neural networks for time series forecasting

2:40pm–3:20pm Wednesday, 03/27/2019

Session

Data Science, Machine Learning & AI
Location: 2016

Secondary topics: Deep Learning, Temporal data and time-series analytics

Chenhui Hu (Microsoft)

Average rating:

(4.67, 6 ratings)

Dilated neural networks are a class of recently developed neural networks that achieve promising results in time series forecasting. Chenhui Hu discusses representative network architectures of dilated neural networks and demonstrates their advantages in terms of training efficiency and forecast accuracy by applying them to solve sales forecasting and financial time series forecasting problems. Read more.

Understanding the data universe with a data catalog

2:40pm–3:20pm Wednesday, 03/27/2019

Session

Executive Briefing and best practices, Strata Business Summit
Location: 2018

Secondary topics: Data preparation, data governance, and data lineage

John Haddad (Informatica)

Average rating:

(4.60, 5 ratings)

Just like a powerful space telescope that scans the universe, a data catalog scans the data universe to help data scientists and analysts find data, collaborate, and curate data for analytic and data governance projects. John Haddad explains how a data catalog can help you find the data you need and trust for analytic and data governance projects. Read more.

The collision between AI and underground infrastructure

2:40pm–3:20pm Wednesday, 03/27/2019

Session

Strata Business Summit
Location: 2020

Greg Quist (SmartCover Systems)

Average rating:

(4.00, 1 rating)

SmartCover Systems has been providing an IoT solution to its customers for 15 years, using techniques honed in defense and remote sensing, gathering more than 200 million hours of sewer data. Greg Quist shares case studies and results from applying the IoT and AI to underground infrastructure. Read more.

Managing globally distributed data for deep learning using TensorFlow on YARN (sponsored by WANdisco)

2:40pm–3:20pm Wednesday, 03/27/2019

Session

Sponsored
Location: 2022

Jagane Sundar (WANdisco)

Average rating:

(4.50, 2 ratings)

Jagane Sundar shares a system for replicating data across geographically distributed data centers and discusses the benefits of consistently replicating data that is used by TensorFlow for training. Read more.

How to make fewer bad decisions

2:40pm–3:20pm Wednesday, 03/27/2019

Session

Culture and organization, Strata Business Summit
Location: 2024

Secondary topics: AI and machine learning in the enterprise

Eric Colson (Stitch Fix), Daragh Sibley (Stitch Fix)

Average rating:

(4.79, 14 ratings)

A/B testing has revealed the fallibility in human intuition that typically drives business decisions. Eric Colson and Daragh Sibley describe some types of systematic errors domain experts commit, explain how cognitive biases arise from heuristic reasoning processes, and share several mechanisms to mitigate these human limitations and improve decision making. Read more.

Natural language understanding in task-oriented conversational AI

2:40pm–3:20pm Wednesday, 03/27/2019

Session

Data Science, Machine Learning & AI, Expo Hall
Location: Expo Hall

Secondary topics: Deep Learning, Text and Language processing and analysis

Sonal Gupta (Facebook)

Average rating:

(4.40, 5 ratings)

Sonal Gupta explores practical systems for building a conversational AI system for task-oriented queries and details a way to do more advanced compositional understanding, which can understand cross-domain queries, using hierarchical representations. Read more.

3:20pm

3:20pm–4:20pm Wednesday, 03/27/2019

Location: Expo Hall (Exhibit Hall - Level 1)

Afternoon break sponsored by IBM (1h)

4:20pm

Managing Uber's data workflows at scale

4:20pm–5:00pm Wednesday, 03/27/2019

Session

Data Engineering & Architecture
Location: 2001

Secondary topics: Data Integration and Data Pipelines, Transportation and Logistics

Alex Kira (Uber)

Average rating:

(4.00, 13 ratings)

Uber operates at scale, with thousands of microservices serving millions of rides a day, leading to 100+ PB of data. Alex Kira details Uber's journey toward a unified and scalable data workflow system used to manage this data and shares the challenges faced and how the company has rearchitected the system to make it highly available and horizontally scalable. Read more.

Reducing stream processing complexity using Apache Pulsar Functions

4:20pm–5:00pm Wednesday, 03/27/2019

Session

Data Engineering & Architecture
Location: 2002

Secondary topics: AI and Data technologies in the cloud, Data Integration and Data Pipelines, Retail and e-commerce

Jowanza Joseph (Pluralsight), Karthik Ramasamy (Streamlio)

Average rating:

(4.00, 1 rating)

After two years of running streaming pipelines through Kinesis and Spark at One Click Retail, Jowanza Joseph and Karthik Ramasamy decided to explore a new platform that would take advantage of Kubernetes and support a simpler data processing DSL. Join in to discover why they chose Apache Pulsar and learn tips and tricks for using Pulsar Functions. Read more.

Break through the limits of your current database (sponsored by MemSQL)

4:20pm–5:00pm Wednesday, 03/27/2019

Session

Sponsored
Location: 2003

Franck Leveneur (Wag!)

Average rating:

(3.00, 1 rating)

MySQL is great but has limits. When you need key-value pair storage with geospatial and JSON support, easy and fast ingestion from various streams, aggregate queries against 100+ million rows in under one second, and more, there's only one solution. Franck Leveneur explains how on-demand dog walking service Wag! uses MemSQL to take its real-time data access and reporting to the next level. Read more.

From flat files to deconstructed databases: The evolution and future of the big data ecosystem

4:20pm–5:00pm Wednesday, 03/27/2019

Session

Data Engineering & Architecture
Location: 2004

Secondary topics: Data Integration and Data Pipelines, Storage, Streaming, realtime analytics, and IoT

Julien Le Dem (WeWork)

Average rating:

(4.83, 6 ratings)

Big data infrastructure has evolved from flat files in a distributed filesystem to an efficient ecosystem to a fully deconstructed and open source database with reusable components. Julien Le Dem discusses the key open source components of the big data ecosystem and explains how they relate to each other and how they make the ecosystem more of a database and less of a filesystem. Read more.

Augmented OLAP for big data from on-premises to multicloud (sponsored by Kyligence)

4:20pm–5:00pm Wednesday, 03/27/2019

Session

Sponsored
Location: 2005

Yang Li (Kyligence)

Average rating:

(4.00, 1 rating)

Augmenting data management and analytics platforms with artificial intelligence and machine learning is game changing for analysts, engineers, and other users. It enables companies to optimize their storage, speed, and spending. Yang Li details the Kyligence platform, which is evolving to the next level with augmented capabilities such as intelligent modeling, smart pushdowns, and more. Read more.

Put Kafka in jail with Strimzi

4:20pm–5:00pm Wednesday, 03/27/2019

Session

Data Engineering & Architecture
Location: 2006

Secondary topics: Streaming, realtime analytics, and IoT

Sean Glover (Lightbend)

Average rating:

(4.00, 1 rating)

The best way to run stateful services with complex operational needs like Kafka is to use the operator pattern. Sean Glover offers an overview of the Strimzi Kafka Operator, a popular new open source Operator-based Apache Kafka implementation on Kubernetes. Read more.

MLflow: An open platform to simplify the machine learning lifecycle

4:20pm–5:00pm Wednesday, 03/27/2019

Session

Data Engineering & Architecture
Location: 2008

Secondary topics: Model lifecycle management

Corey Zumar (Databricks)

Average rating:

(4.89, 9 ratings)

Developing applications that leverage machine learning is difficult. Practitioners need to be able to reproduce their model development pipelines, as well as deploy models and monitor their health in production. Corey Zumar offers an overview of MLflow, which simplies this process by managing, reproducing, and operationalizing machine learning through a suite of model tracking and deployment APIs. Read more.

Spark NLP: How Roche automates knowledge extraction from pathology and radiology reports

4:20pm–5:00pm Wednesday, 03/27/2019

Session

Data Science, Machine Learning & AI
Location: 2009

Secondary topics: Deep Learning, Health and Medicine, Text and Language processing and analysis

Yogesh Pandit (Roche), Saif Addin Ellafi (John Snow Labs), Vishakha Sharma (Roche Molecular Solutions)

Average rating:

(4.67, 3 ratings)

Yogesh Pandit, Saif Addin Ellafi, and Vishakha Sharma discuss how Roche applies Spark NLP for healthcare to extract clinical facts from pathology reports and radiology. They then detail the design of the deep learning pipelines used to simplify training, optimization, and inference of such domain-specific models at scale. Read more.

Time series forecasting using statistical and machine learning models: When and how

4:20pm–5:00pm Wednesday, 03/27/2019

Session

Data Science, Machine Learning & AI
Location: 2010

Secondary topics: Financial Services, Temporal data and time-series analytics

Ying Yau (Walmart Labs)

Average rating:

(3.29, 7 ratings)

Time series forecasting techniques are applied in a wide range of scientific disciplines, business scenarios, and policy settings. Jeffrey Yau discusses the applications of statistical time series models, such as ARIMA, VAR, and regime-switching models, and machine learning models, such as random forest and neural network-based models, to forecasting problems. Read more.

Scaling model training: From flexible training APIs to resource management with Kubernetes

4:20pm–5:00pm Wednesday, 03/27/2019

Session

Data Science, Machine Learning & AI
Location: 2011

Secondary topics: Automation in data science and big data, Financial Services, Model lifecycle management

Kelley Rivoire (Stripe)

Average rating:

(4.33, 3 ratings)

Production ML applications benefit from reproducible, automated retraining, and deployment of ever-more predictive models trained on ever-increasing amounts of data. Kelley Rivoire explains how Stripe built a flexible API for training machine learning models that's used to train thousands of models per week on Kubernetes, supporting automated deployment of new models with improved performance. Read more.

Jupyter Book: Online interactive books with the Jupyter Notebook

4:20pm–5:00pm Wednesday, 03/27/2019

Session

Sponsored
Location: 2014

Secondary topics: Jupyter

Chris Holdgraf (Berkeley Institute for Data Science)

Average rating:

(4.75, 4 ratings)

Chris Holdgraf shares recent tools from the Jupyter project in partnership with UC Berkeley that facilitate communication with Jupyter and get us closer to displaying notebook-style content in a more discoverable and reader-friendly form—allowing you to turn collections of notebooks into an online book and connect this content with the cloud in order to make your online content interactive. Read more.

User-based real-time product recommendations leveraging deep learning using Analytics Zoo on Apache Spark and BigDL

4:20pm–5:00pm Wednesday, 03/27/2019

Session

Data Science, Machine Learning & AI
Location: 2016

Secondary topics: Deep Learning, Retail and e-commerce

Luyang Wang (Restaurant Brands International), Jing (Nicole) Kong (Office Depot), Guoqiong Song (Intel), Maneesha Bhalla (Office Depot)

Average rating:

(4.00, 2 ratings)

User-based real-time recommendation systems have become an important topic in ecommerce. Lu Wang, Nicole Kong, Guoqiong Song, and Maneesha Bhalla demonstrate how to build deep learning algorithms using Analytics Zoo with BigDL on Apache Spark and create an end-to-end system to serve real-time product recommendations. Read more.

Apache Superset: An open source data visualization platform

4:20pm–5:00pm Wednesday, 03/27/2019

Session

Business Analytics and Visualization, Strata Business Summit
Location: 2018

Secondary topics: Visualization, Design, and UX

Maxime Beauchemin (Lyft)

Average rating:

(4.50, 4 ratings)

Maxime Beauchemin offers an overview of Apache Superset, discussing the project's open source development dynamics, security, architecture, and underlying technologies as well as the key items on its roadmap. Read more.

Executive Briefing: Overview of data governance

4:20pm–5:00pm Wednesday, 03/27/2019

Session

Executive Briefing and best practices, Strata Business Summit
Location: 2020

Secondary topics: AI and machine learning in the enterprise, Data preparation, data governance, and data lineage

Paco Nathan (derwen.ai)

Average rating:

(3.67, 6 ratings)

Effective data governance is foundational for AI adoption in enterprise, but it's an almost overwhelming topic. Paco Nathan offers an overview of its history, themes, tools, process, standards, and more. Join in to learn what impact machine learning has on data governance and vice versa. Read more.

Applied AI and NLP for enterprise contract intelligence (sponsored by ThoughtTrace)

4:20pm–5:00pm Wednesday, 03/27/2019

Session

Sponsored
Location: 2022

Joel Hron (ThoughtTrace), Nick Vandivere (ThoughtTrace)

Average rating:

(4.00, 1 rating)

Building a SaaS AI company targeted at enterprise users presents unique challenges, both technical and nontechnical. Joel Hron and Nick Vandivere walk you through ThoughtTrace's journey, highlighting its beginnings as a company and sharing the challenging use cases the company tackled first. Read more.

VC dimension: How and why investors fund AI startups

4:20pm–5:00pm Wednesday, 03/27/2019

Session

Strata Business Summit
Location: 2024

Ashley Fontana (Zetta), Katherine Boyle (General Catalyst), Sarah Catanzaro (Amplify Partners), Arif Janmohamed (Lightspeed Venture Partners), Lan Xuezhao (Basis Set Ventures)

Average rating:

(4.00, 1 rating)

What does it mean to be an AI investor? How is this approach different from traditional venture capital? Ash Fontana and Katherine Boyle share their perspectives on investments in machine intelligence and data science. Read more.

Toward deep and representation learning for talent search at LinkedIn

4:20pm–5:00pm Wednesday, 03/27/2019

Session

Data Science, Machine Learning & AI, Expo Hall
Location: Expo Hall

Secondary topics: Deep Learning, Graph technologies and analytics, Text and Language processing and analysis

Gungor Polatkan (LinkedIn)

Average rating:

(4.33, 3 ratings)

Talent search systems at LinkedIn strive to match the potential candidates to the hiring needs of a recruiter expressed in terms of a search query. Gungor Polatkan shares the results of the company's deployment of deep learning models on a real-world production system serving 500M+ users through LinkedIn Recruiter. Read more.

5:10pm

Cloud native data pipelines with Apache Kafka

5:10pm–5:50pm Wednesday, 03/27/2019

Session

Data Engineering & Architecture
Location: 2001

Secondary topics: AI and Data technologies in the cloud, Data Integration and Data Pipelines

Gwen Shapira (Confluent)

Average rating:

(4.64, 11 ratings)

As microservices, data services, and serverless APIs proliferate, data engineers need to collect and standardize data in an increasingly complex and diverse system. Gwen Shapira discusses how data engineering requirements have changed in a cloud native world and shares architectural patterns that are commonly used to build flexible, scalable, and reliable data pipelines. Read more.

Serverless workflows for orchestration hybrid cluster-based and serverless processing

5:10pm–5:50pm Wednesday, 03/27/2019

Session

Data Engineering & Architecture
Location: 2002

Secondary topics: AI and Data technologies in the cloud, Data Integration and Data Pipelines

Rustem Feyzkhanov (Instrumental)

Average rating:

(3.50, 8 ratings)

Serverless implementation of core processing is quickly becoming a production-ready solution. However, companies with existing processing pipelines may find it hard to go completely serverless. Serverless workflows unite the serverless and cluster worlds, with the benefits of both approaches. Rustem Feyzkhanov demonstrates how serverless workflows change your perception of software architecture. Read more.

Go serverless with Elasticsearch: Eliminate scaling and performance bottlenecks for faster data search (sponsored by Vizion.ai)

5:10pm–5:50pm Wednesday, 03/27/2019

Session

Sponsored
Location: 2003

Geoff Tudor (Vizion.ai)

Average rating:

(1.00, 2 ratings)

Elasticsearch is powerful. In its current form, it's also nontrivial and rather expensive to deploy. Not very "elastic." Fortunately, innovations like serverless and microservices are eliminating these barriers, lowering upfront costs, and reducing complexity. Geoff Tudor explains how this is unfolding in the market. Read more.

When SQL users run wild: Resource management features and techniques to tame Apache Impala

5:10pm–5:50pm Wednesday, 03/27/2019

Session

Data Engineering & Architecture
Location: 2004

Tim Armstrong (Cloudera)

Average rating:

(4.80, 5 ratings)

As the popularity and utilization of Apache Impala deployments increases, clusters often become victims of their own success when demand for resources exceeds the supply. Tim Armstrong dives into the latest resource management features in Impala to maintain high cluster availability and optimal performance and provides examples of how to configure them in your Impala deployment. Read more.

Critical turbine maintenance: Monitoring and diagnosing planes and power plants in real time

5:10pm–5:50pm Wednesday, 03/27/2019

Session

Data Engineering & Architecture, Streaming and IoT
Location: 2006

Secondary topics: Streaming, realtime analytics, and IoT, Transportation and Logistics

June Andrews (GE), John Rutherford (GE)

Average rating:

(4.50, 2 ratings)

GE produces a third of the world's power and 60% of its airplane engines—a critical portion of the world's infrastructure that requires meticulous monitoring of the hundreds of sensors streaming data from each turbine. June Andrews and John Rutherford explain how GE's monitoring and diagnostics teams released the first real-time ML systems used to determine turbine health into production. Read more.

Persistent storage for machine learning in KubeFlow

5:10pm–5:50pm Wednesday, 03/27/2019

Session

Data Engineering & Architecture
Location: 2008

Secondary topics: AI and Data technologies in the cloud, Model lifecycle management, Storage

Skyler Thomas (MapR), Terry He (MapR Technologies)

Average rating:

(4.75, 4 ratings)

KubeFlow separates compute and storage to provide the ability to deploy best-of-breed open source systems for machine learning to any cluster running Kubernetes, whether on-premises or in the cloud. Skyler Thomas and Terry He explore the problems of state and storage and explain how distributed persistent storage can logically extend the compute flexibility provided by KubeFlow. Read more.

The magic behind your Lyft ride prices: A case study on machine learning and streaming

5:10pm–5:50pm Wednesday, 03/27/2019

Session

Data Science, Machine Learning & AI
Location: 2009

Secondary topics: Data Platforms, Streaming, realtime analytics, and IoT, Transportation and Logistics

Rakesh Kumar (Lyft), Thomas Weise (Lyft)

Average rating:

(4.00, 3 ratings)

Rakesh Kumar and Thomas Weise explore how Lyft dynamically prices its rides with a combination of various data sources, ML models, and streaming infrastructure for low latency, reliability, and scalability—allowing the pricing system to be more adaptable to real-world changes. Read more.

Federated learning

5:10pm–5:50pm Wednesday, 03/27/2019

Session

Data Science, Machine Learning & AI
Location: 2010

Secondary topics: Security and Privacy

Mike Lee Williams (Cloudera Fast Forward Labs)

Average rating:

(4.00, 1 rating)

Imagine building a model whose training data is collected on edge devices such as cell phones or sensors. Each device collects data unlike any other, and the data cannot leave the device because of privacy concerns or unreliable network access. This challenging situation is known as federated learning. Mike Lee Williams discusses the algorithmic solutions and the product opportunities. Read more.

Talking to the machines: Monitoring production machine learning systems

5:10pm–5:50pm Wednesday, 03/27/2019

Session

Data Science, Machine Learning & AI
Location: 2011

Secondary topics: Automation in data science and big data, Model lifecycle management, Temporal data and time-series analytics

Ting-Fang Yen (DataVisor)

Average rating:

(4.00, 3 ratings)

Ting-Fang Yen details an approach for monitoring production machine learning systems that handle billions of requests daily by discovering detection anomalies, such as spurious false positives, as well as gradual concept drifts when the model no longer captures the target concept. Join in to explore new tools for detecting undesirable model behaviors early in large-scale online ML systems. Read more.

From Jupyter to production: Accelerating solutions to business problems in production

5:10pm–5:50pm Wednesday, 03/27/2019

Session

Sponsored
Location: 2014

Secondary topics: Jupyter

Manu Mukerji (8x8), Justin Driemeyer (8x8)

Average rating:

(3.43, 7 ratings)

Project Jupyter is very popular for data science, data exploration, and visualization. Manu Mukerji and Justin Driemeyer explain how to use it for AI/ML in a production environment. Read more.

Real-time analytics on deep learning: When TensorFlow met Presto at Uber

5:10pm–5:50pm Wednesday, 03/27/2019

Session

Data Science, Machine Learning & AI
Location: 2016

Secondary topics: Data Platforms, Deep Learning, Streaming, realtime analytics, and IoT

Zhenxiao Luo (Twitter)

Average rating:

(4.00, 4 ratings)

From determining the most convenient rider pickup points to predicting the fastest routes, Uber uses data-driven analytics to create seamless trip experiences. Inside Uber, analysts are using deep learning and big data to train models, make predictions, and run analytics in real time. Zhenxiao Luo explains how Uber runs real-time analytics with deep learning. Read more.

An alternative approach to adding data science to an organization: Use Jupyter and start with the domain experts

5:10pm–5:50pm Wednesday, 03/27/2019

Session

Culture and organization, Strata Business Summit
Location: 2018

Secondary topics: AI and machine learning in the enterprise

Dave Stuart (Department of Defense )

Average rating:

(4.38, 8 ratings)

Many organizations look to add data science to their skill portfolios through the hiring of data science experts. Dave Stuart shares a complementary way to build a data science-savvy workforce that nets tremendous value by using Jupyter to add introductory data science practices to domain experts and business analysts. Read more.

Executive Briefing: Upskilling your business teams to scale analytics in your organization

5:10pm–5:50pm Wednesday, 03/27/2019

Session

Executive Briefing and best practices, Strata Business Summit
Location: 2020

Secondary topics: AI and machine learning in the enterprise

BARKHA GVALANI (GV)

Average rating:

(2.50, 4 ratings)

How do you decide if you should invest in upskilling business teams? The question is no longer "if" but "when" and "how." Barkha Gvalani shares a framework for developing and delivering analytics training to nontechnical users. Read more.

IBM and Cloudera: Bringing AI and ML to the governed data lake (sponsored by IBM)

5:10pm–5:50pm Wednesday, 03/27/2019

Session

Location: 2022

Satheesh Bandaram (IBM), Saumitra Buragohain (Cloudera)

Average rating:

(4.00, 1 rating)

Satheesh Bandaram and Saumitra Buragohain detail how IBM and Cloudera are advancing AI and ML for their customers with solutions to build on-premises or cloud-based secure governed data lakes. Read more.

Purchase, play, and upgrade data for video game players

5:10pm–5:50pm Wednesday, 03/27/2019

Session

Case studies, Strata Business Summit
Location: 2024

Secondary topics: Media, Marketing, Advertising

Eric Bradlow (The Wharton School), Zachery Anderson (Electronic Arts)

Average rating:

(3.00, 1 rating)

Eric Bradlow and Zachery Anderson discuss the Wharton Customer Analytics Initiative research opportunity process and explain how some of EA’s solved some of its business problems by sharing its data with 11 teams of researchers from around the world. Read more.

Point, click, predict

5:10pm–5:50pm Wednesday, 03/27/2019

Session

Data Science, Machine Learning & AI, Expo Hall
Location: Expo Hall

Secondary topics: AI and Data technologies in the cloud, AI and machine learning in the enterprise, Automation in data science and big data, Data Platforms, Model lifecycle management

Kevin Moore (Salesforce)

Average rating:

(4.50, 2 ratings)

Kevin Moore walks you through how TransmogrifAI—Salesforce's open source AutoML library built on Spark—automatically generates models that are automatically customized to a company's dataset and use case and provides insights into why the model is making the predictions it does. Read more.

5:50pm

Booth Crawl

5:50pm–6:50pm Wednesday, 03/27/2019

Event

Location: Expo Hall (Exhibit Hall - Level 1)

Average rating:

(5.00, 2 ratings)

Make your way from booth to booth while you check out all the exhibitors in the Expo Hall on Wednesday after sessions end. Read more.

6:50pm

6:50pm–7:30pm Wednesday, 03/27/2019

Location: On your own

Dinner (40m)

7:30pm

Data After Dark

7:30pm–9:30pm Wednesday, 03/27/2019

Event

Location: SPIN, 690 Folsom St., San Francisco

Average rating:

(5.00, 2 ratings)

Don't miss an exciting evening filled with cocktails, food, and entertainment at Data After Dark at Strata San Francisco. Read more.

Thursday, 03/28/2019

8:00am

8:00am–8:45am Thursday, 03/28/2019

Location: 3rd floor lobby

Break (45m)

Strata Data Awards: Winners Announced

9:35am–9:45am Thursday, 03/28/2019

Keynote

Location: Ballroom

Average rating:

(2.62, 8 ratings)

The Strata Data Awards recognize the most innovative startups, leaders, and data science projects from Strata sponsors and exhibitors around the world. Join us during keynotes for the announcement of the winners. Read more.

9:45am

Forecasting uncertainty at Airbnb

9:45am–9:55am Thursday, 03/28/2019

Keynote

Location: Ballroom

Secondary topics: Data Platforms, Temporal data and time-series analytics

Theresa Johnson (Airbnb)

Average rating:

(4.22, 18 ratings)

Airbnb uses AI and machine learning in many parts of its user-facing business. But it's also advancing the state of AI-powered internal tools. Theresa Johnson details the AI powering Airbnb's next-generation end-to-end metrics forecasting platform, which leverages machine learning, Bayesian inference, TensorFlow, Hadoop, and web technology. Read more.

9:55am

It’s in the game: A rare look into how EA brought data science into the creative process of game design

9:55am–10:10am Thursday, 03/28/2019

Keynote

Location: Ballroom

Zachery Anderson (Electronic Arts)

Average rating:

(4.54, 24 ratings)

Developing games at EA is where creativity meets AI, analytics, and machine learning, combining an understanding of player motivations with the means to improve the game design process. Zachery Anderson leads a tour of EA’s history combining data with development, taking you through the early days of balancing gameplay to the future of personalized games for everyone. Read more.

10:10am

Likewar: How social media is changing the world…and how the world is changing social media

10:10am–10:25am Thursday, 03/28/2019

Keynote

Location: Ballroom

Secondary topics: Security and Privacy

Peter Singer (New America)

Average rating:

(4.80, 20 ratings)

Terrorists live-stream their attacks, “Twitter wars” sell music albums and produce real-world casualties, and viral misinformation alters not just the result of battles but the very fate of nations. The result is that war, tech, and politics have blurred into a new kind of battle space that plays out on our smartphones. P. W. Singer explains. Read more.

10:30am

10:30am–11:00am Thursday, 03/28/2019

Location: Expo Hall (Exhibit Hall - Level 1)

Morning break sponsored by Google Cloud (30m)

11:00am

Disrupting data discovery

11:00am–11:40am Thursday, 03/28/2019

Session

Data Engineering & Architecture
Location: 2001

Secondary topics: Data preparation, data governance, and data lineage, Transportation and Logistics

Mark Grover (Lyft), Tao Feng (Lyft)

Average rating:

(4.40, 10 ratings)

Lyft has reduced the time it takes to discover data by 10x by building its own data portal, Amundsen. Mark Grover and Tao Feng offer a demo of Amundsen and lead a deep dive into its architecture, covering how it leverages centralized metadata, PageRank, and a comprehensive data graph to achieve its goal. They also explore the future roadmap, unsolved problems, and its collaboration model. Read more.

ML and AI at scale at PayPal

11:00am–11:40am Thursday, 03/28/2019

Session

Data Engineering & Architecture
Location: 2002

Secondary topics: Data Platforms, Data preparation, data governance, and data lineage, Financial Services

Subhadra Tatavarti (PayPal), Chen Kovacs (Paypal)

Average rating:

(4.12, 8 ratings)

The PayPal data ecosystem is large, with 250+ PB of data transacting in 200+ countries. Given this massive scale and complexity, discovering and access to the right datasets in a frictionless environment is a challenge. Subhadra Tatavarti and Chen Kovacs explain how PayPal’s data platform team is helping solve this problem with a combination of self-service integrated and interoperable products. Read more.

Presto: Tuning performance of SQL-on-anything analytics

11:00am–11:40am Thursday, 03/28/2019

Session

Data Engineering & Architecture
Location: 2004

Secondary topics: Storage, Streaming, realtime analytics, and IoT

Kamil Bajda-Pawlikowski (Starburst), Martin Traverso (Presto Software Foundation)

Average rating:

(3.33, 3 ratings)

Kamil Bajda-Pawlikowski and Martin Traverso explore Presto's recently introduced cost-based optimizer, which must account for heterogeneous inputs with differing and often incomplete data statistics, and detail use cases for Presto across several industries. They also share recent Presto advancements, such as geospatial analytics at scale, and the project roadmap going forward. Read more.

Walmart's journey from business intelligence to artificial intelligence (sponsored by Walmart Labs)

11:00am–11:40am Thursday, 03/28/2019

Session

Sponsored
Location: 2005

Prakhar Mehrotra (Walmart Labs)

Average rating:

(4.14, 7 ratings)

Prakhar Mehrotra shares Walmart’s digital transformation journey and explains how the company is using recent advancements in machine learning to power core retail operations like pricing, assortment, and replenishment. Along the way, Prakhar demonstrates how to leverage human expertise and use it as feedback to improve your algorithms. Read more.

How Zhaopin.com built its enterprise event bus using Apache Pulsar

11:00am–11:40am Thursday, 03/28/2019

Session

Data Engineering & Architecture, Streaming and IoT
Location: 2006

Secondary topics: Data Platforms, Media, Marketing, Advertising, Streaming, realtime analytics, and IoT

Sijie Guo (StreamNative), Penghui Li (Zhaopin)

Average rating:

(4.00, 1 rating)

Using a messaging system to build an event bus is very common. However, certain use cases demand a messaging system with a certain set of features. Sijie Guo and Penghui Li discuss the event bus requirements for Zhaopin.com, one of China's biggest online recruitment services providers, and explain why the company chose Apache Pulsar. Read more.

Cloud programming simplified: A Berkeley view on serverless computing

11:00am–11:40am Thursday, 03/28/2019

Session

Data Engineering & Architecture
Location: 2007

Secondary topics: AI and Data technologies in the cloud, Data Integration and Data Pipelines

Eric Jonas (UC Berkeley)

Average rating:

(4.50, 2 ratings)

Eric Jonas offers a quick history of cloud computing, including an accounting of the predictions of the 2009 "Berkeley View of Cloud Computing" paper, explains the motivation for serverless computing, describes applications that stretch the current limits of serverless, and then lists obstacles and research opportunities required for serverless computing to fulfill its full potential. Read more.

Optimizing computing cluster resource utilization with an in-memory distributed filesystem

11:00am–11:40am Thursday, 03/28/2019

Session

Data Engineering & Architecture
Location: 2008

Secondary topics: Data Platforms, Retail and e-commerce, Storage

Yue Li (MemVerge), Shouwei Chen (Rutgers University)

Average rating:

(5.00, 4 ratings)

JD.com recently designed a brand-new architecture to optimize Spark computing clusters. Yue Li and Shouwei Chen detail the problems the team faced when building it and explain how the company benefits from the in-memory distributed filesystem now. Read more.

Creating a bionic newsroom

11:00am–11:40am Thursday, 03/28/2019

Session

Data Science, Machine Learning & AI
Location: 2009

Secondary topics: Media, Marketing, Advertising

Boris Yakubchik (Forbes), Salah Zalatimo (Forbes)

Average rating:

(4.50, 2 ratings)

Boris Yakubchik and Salah Zalatimo offer an overview of Bertie, Forbes's new publishing platform—an AI assistant that learns from writers and suggests improvements—and detail Bertie’s features, architecture, and ultimate goals, paying special attention to how the company implemented an ensemble of machine learning models that, together, make up the AI assistant's skill set and personality. Read more.

Framework to quantitatively assess ML safety: Technical implementation and best practices

11:00am–11:40am Thursday, 03/28/2019

Session

Data Science, Machine Learning & AI
Location: 2010

Secondary topics: AI and machine learning in the enterprise, Health and Medicine, Security and Privacy

Ram Shankar Siva Kumar (Microsoft (Azure Security))

Average rating:

(4.33, 3 ratings)

How can we guarantee that the ML system we develop is adequately protected from adversarial manipulation? Ram Shankar Kumar shares a framework and corresponding best practices to quantitatively assess the safety of your ML systems. Read more.

Applications of mixed effects random forests

11:00am–11:40am Thursday, 03/28/2019

Session

Data Science, Machine Learning & AI
Location: 2011

Sourav Dey (Manifold)

Average rating:

(4.75, 4 ratings)

Clustered data is all around us. The best way to attack it? Mixed effect models. Sourav Dey explains how the mixed effects random forests (MERF) model and Python package marries the world of classical mixed effect modeling with modern machine learning algorithms and shows how it can be extended to be used with other advanced modeling techniques like gradient boosting machines and deep learning. Read more.

The future of the firm: Starting now

11:00am–11:40am Thursday, 03/28/2019

Session

Future of the Firm
Location: 2014

Josh Bersin (Bersin by Deloitte)

Average rating:

(5.00, 3 ratings)

Josh Bersin explains how firms are transforming for the digital era, covering the death of the traditional organizational hierarchy, new models of leadership and management, changes in the way people learn and progress, new models of pay, and the importance of trust and transparency as a central business value. Read more.

Detecting coordinated fraud attacks using deep learning

11:00am–11:40am Thursday, 03/28/2019

Session

Data Science, Machine Learning & AI
Location: 2016

Secondary topics: Deep Learning, Security and Privacy

Fang Yu (DataVisor)

Average rating:

(3.75, 4 ratings)

Online fraud flourishes as online services become ubiquitous in our daily life. Fang Yu explains how DataVisor leverages cutting-edge deep learning technologies to address the challenges in large-scale fraud detection. Read more.

Data Science University: Transforming a Fortune 5 workforce

11:00am–11:40am Thursday, 03/28/2019

Session

Culture and organization, Strata Business Summit
Location: 2018

Secondary topics: AI and machine learning in the enterprise, Health and Medicine

Marc Paradis (UnitedHealth Group)

Average rating:

(4.75, 4 ratings)

Data Science University (DSU) was established to bring analytics education to UnitedHealth Group, the world’s largest healthcare company, with over 270,000 employees. Marc Paradis explains how DSU was built out over time in an era of rapidly changing analytics technology and capabilities in an industry ripe for disruption, covering the challenges faced and lessons learned. Read more.

Executive Briefing: Forcing the legal and ethical hands of companies that collect, use, and analyze data

11:00am–11:40am Thursday, 03/28/2019

Session

Law and Ethics, Strata Business Summit
Location: 2020

Secondary topics: Financial Services, Security and Privacy

Nick Curcuru (Mastercard)

Average rating:

(4.50, 2 ratings)

Data—in part, harvested personal data—brings industries unprecedented insights about customer behavior. We know more about our customers and neighbors than at any other time in history, but we need to avoid crossing the "creepy" line. Nick Curcuru discusses how ethical behavior drives trust, especially in today's IoT age. Read more.

How to protect big data in a containerized environment

11:00am–11:40am Thursday, 03/28/2019

Session

Data Engineering & Architecture
Location: 2024

Secondary topics: AI and Data technologies in the cloud, Security and Privacy

Thomas Phelan (HPE BlueData)

Average rating:

(4.50, 2 ratings)

Recent headline-grabbing data breaches demonstrate that protecting data is essential for every enterprise. The best-of-breed approach for big data is HDFS configured with Transparent Data Encryption (TDE). But TDE is difficult to configure and manage—particularly when run in Docker containers. Thomas Phelan discusses these challenges and explains how to overcome them. Read more.

The future of machine learning is decentralized

11:00am–11:40am Thursday, 03/28/2019

Session

Data Science, Machine Learning & AI, Expo Hall
Location: Expo Hall

Secondary topics: Security and Privacy, Storage

Alex Ingerman (Google)

Average rating:

(4.67, 12 ratings)

Federated learning is an approach for training ML models across a fleet of participating devices without collecting their data in a central location. Alex Ingerman offers an overview of federated learning, compares traditional and federated ML workflows, and explores the current and upcoming use cases for decentralized machine learning, with examples from Google's deployment of this technology. Read more.

11:50am

Journey to the cloud: Architecting for the cloud through customer stories

11:50am–12:30pm Thursday, 03/28/2019

Session

Data Engineering & Architecture
Location: 2001

Secondary topics: AI and Data technologies in the cloud, Data Platforms, Storage

Jason Wang (Cloudera), Sushant Rao (Cloudera)

Average rating:

(4.00, 2 ratings)

Jason Wang and Sushant Rao offer an overview of cloud architecture, then go into detail on core cloud paradigms like compute (virtual machines), cloud storage, authentication and authorization, and encryption and security. They conclude by bringing these concepts together through customer stories to demonstrate how real-world companies have leveraged the cloud for their big data platforms. Read more.

Building Rakuten analytics: A story of evolutions

11:50am–12:30pm Thursday, 03/28/2019

Session

Data Engineering & Architecture
Location: 2002

Secondary topics: Data Platforms, Retail and e-commerce

Juan Paulo Gutierrez (Rakuten)

Average rating:

(4.75, 4 ratings)

Juan Paulo Gutierrez explains how a small team in Tokyo went through several evolutions as they built an analytics service to help 200+ businesses accelerate their decision-making process. Join in to hear about the background, challenges, architecture, success stories, and best practices as they built and productionalized Rakuten Analytics. Read more.

Solving the enterprise data dilemma (sponsored by erwin)

11:50am–12:30pm Thursday, 03/28/2019

Session

Sponsored
Location: 2003

Adam Famularo (erwin, Inc.)

Average rating:

(4.00, 1 rating)

Adam Famularo showcases erwin's combination of data management and data governance to produce actionable insights. Erwin customer Nasdaq then shares a real-world use case. You'll learn how to answer tough data questions, how to maintain a metadata landscape, and how to use data management and governance to produce actionable insights. Read more.

Flink SQL in action

11:50am–12:30pm Thursday, 03/28/2019

Session

Data Engineering & Architecture, Streaming and IoT
Location: 2004

Secondary topics: Data Integration and Data Pipelines, Streaming, realtime analytics, and IoT

Fabian Hueske (Ververica)

Average rating:

(4.30, 10 ratings)

Processing streaming data with SQL is becoming increasingly popular. Fabian Hueske explains why SQL queries on streams should have the same semantics as SQL queries on static data. He then shares a selection of common use cases and demonstrates how easily they can be addressed with Flink SQL. Read more.

Intelligent design patterns for cloud-based analytics and BI (sponsored by Arcadia Data)

11:50am–12:30pm Thursday, 03/28/2019

Session

Sponsored
Location: 2005

Priyank Patel (Arcadia Data)

Average rating:

(4.00, 1 rating)

With cloud object storage, you'd expect business intelligence (BI) applications to benefit from the scale of data and real-time analytics. However, traditional BI in the cloud surfaces non-obvious challenges. Priyank Patel reviews service-oriented cloud design (storage, compute, catalog, security, SQL) and shows how native cloud BI provides analytic depth, low cost, and high performance. Read more.

How Netflix measures app performance on 250 million unique devices across 190 countries

11:50am–12:30pm Thursday, 03/28/2019

Session

Data Engineering & Architecture
Location: 2006

Secondary topics: Data Platforms, Media, Marketing, Advertising, Streaming, realtime analytics, and IoT

Vivek Pasari (Netflix), Jitender Aswani (Netflix)

Average rating:

(3.14, 7 ratings)

Netflix has over 125 million members spread across 191 countries. Each day its members interact with its client applications on 250 million+ devices under highly variable network conditions. These interactions result in over 200 billion daily data points. Vivek Pasari dives into the data engineering and architecture that enables application performance measurement at this scale. Read more.

Serverless for data and AI

11:50am–12:30pm Thursday, 03/28/2019

Session

Data Engineering & Architecture, Data Science, Machine Learning & AI, Streaming and IoT
Location: 2007

Secondary topics: AI and Data technologies in the cloud, Data Integration and Data Pipelines, Data Platforms

Avner Braverman (Binaris)

Average rating:

(4.00, 3 ratings)

What is serverless, and how can it be utilized for data analysis and AI? Avner Braverman outlines the benefits and limitations of serverless with respect to data transformation (ETL), AI inference and training, and real-time streaming. This is a technical talk, so expect demos and code. Read more.

Scanner: Efficient video analysis at scale

11:50am–12:30pm Thursday, 03/28/2019

Session

Data Engineering & Architecture
Location: 2008

Secondary topics: Deep Learning, Media, Marketing, Advertising

Fait Poms (Stanford University), Will Crichton (Stanford University)

Average rating:

(4.75, 4 ratings)

Video is now the largest source of data on the internet, so we need tools to make it easier to process and analyze. Alex Poms and Will Crichton offer an overview of Scanner, the first open source distributed system for building large-scale video processing applications, and explore real-world use cases. Read more.

Deploying data science for national economic statistics

11:50am–12:30pm Thursday, 03/28/2019

Session

Data Science, Machine Learning & AI
Location: 2009

Secondary topics: Temporal data and time-series analytics

Jeff Chen (US Bureau of Economic Analysis)

Average rating:

(4.50, 2 ratings)

Jeff Chen shares strategies for overcoming time series challenges at the intersection of macroeconomics and data science, drawing from machine learning research conducted at the Bureau of Economic Analysis aimed at improving its flagship product the gross domestic product. Read more.

Masquerading malicious DNS traffic

11:50am–12:30pm Thursday, 03/28/2019

Session

Data Science, Machine Learning & AI
Location: 2010

Secondary topics: Security and Privacy, Temporal data and time-series analytics

David Rodriguez (Cisco Systems)

Average rating:

(4.50, 2 ratings)

Malicious DNS traffic patterns are inconsistent and typically thwart anomaly detection. David Rodriguez explains how Cisco uses Apache Spark and Stripe’s Bayesian inference software, Rainier, to fit the underlying time series distribution for millions of domains and outlines techniques to identify artificial traffic volumes related to spam, malvertising, and botnets (masquerading traffic). Read more.

Infinite segmentation: Scalable mutual information ranking on real-world graphs

11:50am–12:30pm Thursday, 03/28/2019

Session

Data Science, Machine Learning & AI
Location: 2011

Secondary topics: AI and Data technologies in the cloud, AI and machine learning in the enterprise, Media, Marketing, Advertising

Ken Johnston (Microsoft), Ankit Srivastava (Microsoft)

Average rating:

(4.50, 2 ratings)

Today, normal growth isn't enough—you need hockey-stick levels of growth. Sales and marketing orgs are looking to AI to "growth hack" their way to new markets and segments. Ken Johnston and Ankit Srivastava explain how to use mutual information at scale across massive data sources to help filter out noise and share critical insights with new cohort of users, businesses, and networks. Read more.

The brave new world of computational propaganda

11:50am–12:30pm Thursday, 03/28/2019

Session

Future of the Firm
Location: 2014

Renee DiResta (New Knowledge)

Average rating:

(5.00, 1 rating)

Renee Diresta, lead author of the US Senate report about Russian disinformation operations, will discuss how influence operations are manifesting in 2019 as they've moved beyond politics. Read more.

Modern techniques for building robust deep networks

11:50am–12:30pm Thursday, 03/28/2019

Session

Data Science, Machine Learning & AI
Location: 2016

Secondary topics: Deep Learning, Temporal data and time-series analytics

Sricharan Kumar (Intuit )

Average rating:

(4.29, 7 ratings)

Machine learning is delivering immense value across industries. However, in some instances, machine learning models can produce overconfident results—with the potential for catastrophic outcomes. Kumar Sricharan explains how to address this challenge through Bayesian machine learning and highlights real-world examples to illustrate its benefits. Read more.

Scaling data infrastructure in the fashion world; or, “What is this? Business intelligence for ants?”

11:50am–12:30pm Thursday, 03/28/2019

Session

Culture and organization, Strata Business Summit
Location: 2018

Secondary topics: Data Platforms, Retail and e-commerce

Francesco Mucio (Francescomuc.io)

Average rating:

(4.00, 2 ratings)

Francesco Mucio tells the story of how Zalando went from an old-school BI company to an AI-driven company built on a solid data platform. Along the way, he shares what Zalando learned in the process and the challenges that still lie ahead. Read more.

Executive Briefing: Why machine-learned models crash and burn in production and what to do about it

11:50am–12:30pm Thursday, 03/28/2019

Session

Executive Briefing and best practices, Strata Business Summit
Location: 2020

Secondary topics: Model lifecycle management

David Talby (Pacific AI)

Average rating:

(4.90, 10 ratings)

Machine learning and data science systems often fail in production in unexpected ways. David Talby shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries. Read more.

Automation of root cause analysis for big data stack applications

11:50am–12:30pm Thursday, 03/28/2019

Session

Data Engineering & Architecture
Location: 2024

Secondary topics: Automation in data science and big data, Deep Learning

Alkis Simitsis (Micro Focus), Shivnath Babu (Unravel Data Systems | Duke University)

Average rating:

(2.67, 3 ratings)

Alkis Simitsis and Shivnath Babu share an automated technique for root cause analysis (RCA) for big data stack applications using deep learning techniques, using Spark and Impala. The concepts they discuss apply generally to the big data stack. Read more.

Decentralized governance of data

11:50am–12:30pm Thursday, 03/28/2019

Data Science, Machine Learning & AI, Expo Hall
Location: Expo Hall

Secondary topics: Open Data, Data Generation and Data Networks, Security and Privacy

Roger Chen (Computable)

Average rating:

(2.00, 1 rating)

Data remains a linchpin of success for machine learning yet too often is a scarce resource. And even when data is available, trust issues arise about the quality and ethics of collection. Roger Chen explores new models for generating and governing training data for AI applications. Read more.

12:30pm

Thursday Topic Tables at Lunch

12:30pm–1:50pm Thursday, 03/28/2019

Event

Location: Expo Hall (Exhibit Hall - Level 1)

Average rating:

(5.00, 1 rating)

Topic Table discussions are a great way to informally network with people in similar industries or interested in the same topics. Read more.

Thursday Business Summit Lunch

12:30pm–1:50pm Thursday, 03/28/2019

Event

Location: Expo Hall

Average rating:

(3.50, 2 ratings)

Join Strata Business Summit speakers and attendees for a networking lunch on Thursday. Read more.

1:50pm

Challenges in addressing bias, fairness, and transparency in AI

1:50pm–2:30pm Thursday, 03/28/2019

Session

Data Engineering & Architecture
Location: 2001

Krishna Gade (Fiddler Labs)

Average rating:

(4.67, 3 ratings)

Join Krishna Gade to learn how to address engineering and organizational challenges for AI fairness and operationalize these concepts in a production AI system—and crucially, create a culture of trust in AI. Read more.

Loosely coupled data with Apache Arrow Flight

1:50pm–2:30pm Thursday, 03/28/2019

Session

Data Engineering & Architecture
Location: 2002

Jacques Nadeau (Dremio)

Average rating:

(4.60, 5 ratings)

Apache Arrow Flight is a new initiative focused on providing high-performance communication within data engineering and data science infrastructure. Jacques Nadeau explains how Flight works and where it has been integrated. He also discusses how Flight can be used to abstract physical data management from logical access and sharse benchmarks of workloads that have been improved by Flight. Read more.

Rethinking big data analytics with Google Cloud (sponsored by Google Cloud)

1:50pm–2:30pm Thursday, 03/28/2019

Session

Sponsored
Location: 2003

Jordan Tigani (Google )

Average rating:

(4.00, 3 ratings)

Google Cloud Platform combines powerful serverless solutions for enterprise data warehousing, streaming analytics, managed Spark and Hadoop, modern BI, planet-scale data lake, and AI. Jordan Tigani details Google Cloud’s vision and engineering strategy, which can help you move big data analytics solutions to the next level of benefits. Read more.

Spark adaptive execution: Unleash the power of Spark SQL

1:50pm–2:30pm Thursday, 03/28/2019

Session

Data Engineering & Architecture
Location: 2004

Secondary topics: Streaming, realtime analytics, and IoT

Haifeng Chen (Intel)

Average rating:

(4.00, 3 ratings)

Spark SQL is widely used, but it still suffers from stability and performance challenges in highly dynamic environments with large-scale data. Haifeng Chen shares a Spark adaptive execution engine built to address these challenges. It can handle task parallelism, join conversion, and data skew dynamically during runtime, guaranteeing the best plan is chosen using runtime statistics. Read more.

Performant time series data management and analytics with Postgres

1:50pm–2:30pm Thursday, 03/28/2019

Session

Data Engineering & Architecture
Location: 2006

Secondary topics: Streaming, realtime analytics, and IoT

Matvey Arye (TimescaleDB)

Average rating:

(3.75, 4 ratings)

Matvey Arye offers an overview of two newly released features of TimescaleDB—automated adaptation of time-partitioning intervals and continuous aggregations in near real time—and discusses how these capabilities ease time series data management. Along the way, he also shares real-world use cases, including TimescaleDB's use with other technologies such as Kafka. Read more.

Ludwig, a code-free deep learning toolbox

1:50pm–2:30pm Thursday, 03/28/2019

Session

Data Science, Machine Learning & AI
Location: 2007

Secondary topics: Deep Learning, Transportation and Logistics

Piero Molino (Uber AI)

Average rating:

(4.60, 5 ratings)

Piero Molino offers an overview of Ludwig, a deep learning toolbox that allows you to train models and use them for prediction without the need to write code. It's unique in its ability to help make deep learning easier to understand for nonexperts and enable faster model improvement iteration cycles for experienced machine learning developers and researchers alike. Read more.

Faster ML over joins of tables

1:50pm–2:30pm Thursday, 03/28/2019

Session

Data Engineering & Architecture
Location: 2008

Secondary topics: Automation in data science and big data, Storage, Streaming, realtime analytics, and IoT

Arun Kumar (University of California, San Diego)

Average rating:

(4.00, 2 ratings)

Arun Kumar details recent techniques to accelerate ML over data that is the output of joins of multiple tables. Using ideas from query optimization and learning theory, Arun demonstrates how to avoid joins before ML to reduce runtimes and memory and storage footprints. Along the way, he explores open source software prototypes and sample ML code in both R and Python. Read more.

Use the Jupyter Notebook to integrate adversarial attacks into a model training pipeline to detect vulnerabilities

1:50pm–2:30pm Thursday, 03/28/2019

Session

Data Science, Machine Learning & AI
Location: 2009

Secondary topics: Security and Privacy

Animesh Singh (IBM), Tommy Li (IBM)

Average rating:

(4.50, 2 ratings)

Animesh Singh and Tommy Li explain how to implement state-of-the-art methods for attacking and defending classifiers using the open source Adversarial Robustness Toolbox. The library provides AI developers with interfaces that support the composition of comprehensive defense systems using individual methods as building blocks. Read more.

Using graph metrics to detect lateral movement in enterprise cybersecurity data

1:50pm–2:30pm Thursday, 03/28/2019

Session

Data Science, Machine Learning & AI
Location: 2010

Secondary topics: Graph technologies and analytics, Security and Privacy

Louis DiValentin (Accenture), Dillon Cullinan (Accenture)

Average rating:

(3.00, 3 ratings)

Louis DiValentin and Dillon Cullinan explain how Accenture's Cyber Security Lab built security analytics models to detect attempted lateral movement in networks by transforming enterprise-scale security data into a graph format, generating graph analytics for individual users, and building time series detection models that visualize the changing graph metrics for security operators. Read more.

How to determine the optimal anomaly detection method for your application

1:50pm–2:30pm Thursday, 03/28/2019

Session

Data Science, Machine Learning & AI
Location: 2011

Secondary topics: Temporal data and time-series analytics

Jonathan Merriman (Verint Intelligent Self Service), Cynthia Freeman (Verint Intelligent Self-Service)

Average rating:

(3.89, 9 ratings)

Anomaly detection has many applications, such as tracking business KPIs or fraud spotting in credit card transactions. Unfortunately, there's no one best way to detect anomalies across a variety of domains. Jonathan Merriman and Cynthia Freeman introduce a framework to determine the best anomaly detection method for the application based on time series characteristics. Read more.

The conscience of a company

1:50pm–2:30pm Thursday, 03/28/2019

Session

Future of the Firm
Location: 2014

Moderated by:

Tim O'Reilly (O'Reilly Media)

Panelists:

Janet Haven (Data & Society), Catherine Bracy (TechEquity Collaborative)

Average rating:

(3.67, 3 ratings)

Tim O'Reilly will be joined by Janet Haven, executive director of Data & Society, and Catherine Bracy, director of the TechEquity Collaborative, to discuss ways in which tech employees are flexing their muscles as the conscience of their companies. Read more.

On a deep journey toward five nines

1:50pm–2:30pm Thursday, 03/28/2019

Session

Data Science, Machine Learning & AI
Location: 2016

Secondary topics: Deep Learning, Financial Services, Temporal data and time-series analytics

Aashish Sheshadri (PayPal)

Average rating:

(4.50, 2 ratings)

Deep learning using sequence-to-sequence networks (Seq2Seq) has demonstrated unparalleled success in neural machine translation. A less explored but highly sought-after area of forecasting can leverage recent gains made in Seq2Seq networks. Aashish Sheshadri explains how PayPal has applied deep networks to monitoring and alerting intelligence. Read more.

What the reproducibility problem means for your business

1:50pm–2:30pm Thursday, 03/28/2019

Session

Strata Business Summit
Location: 2018

Secondary topics: AI and machine learning in the enterprise

Stuart Buck (Arnold Ventures)

Average rating:

(4.50, 4 ratings)

Academic research has been plagued by a reproducibility crisis in fields ranging from medicine to psychology. Stuart Buck explains how to take precautions in your data analysis and experiments so as to avoid those reproducibility problems. Read more.

Executive Briefing: The 6 keys to successful data spelunking

1:50pm–2:30pm Thursday, 03/28/2019

Session

Executive Briefing and best practices, Strata Business Summit
Location: 2020

Secondary topics: Open Data, Data Generation and Data Networks

Ken Johnston (Microsoft), Ankit Srivastava (Microsoft)

Average rating:

(4.80, 5 ratings)

At the rate data sources are multiplying, business value can often be developed faster by joining data sources rather than mining a single source to the very end. Ken Johnston and Ankit Srivastava share four years of hands-on practical experience sourcing and integrating massive numbers of data sources to build the Microsoft Business Intelligence Graph (M360 BIG). Read more.

How EPFL captured the feel of the Montreux Jazz Festival with its immersive 3D VR to three-geo archive

1:50pm–2:30pm Thursday, 03/28/2019

Session

Visualization and UX
Location: 2024

Secondary topics: AI and Data technologies in the cloud, Visualization, Design, and UX

Stefaan Vervaet (Western Digital Corporation), Alain Dufaux (École Polytechnique Fédérale de Lausanne (EPFL))

Average rating:

(5.00, 1 rating)

The École Polytechnique Fédérale de Lausanne (EPFL) spearheaded the official digital archival of 15,000+ hours of A/V content captured from the Montreux Jazz Festival since 1967. Stefaan Vervaet and Alain Dufaux explain how EPFL created an immersive 3D VR experience. From capture and store to delivery and experience, they detail the evolution of the workflow that made it all possible. Read more.

2:40pm

Apache Spark 2.4 and beyond

2:40pm–3:20pm Thursday, 03/28/2019

Session

Data Engineering & Architecture
Location: 2001

Xiao Li (Databricks), Wenchen Fan (Databricks)

Average rating:

(3.25, 4 ratings)

Xiao Li and Wenchen Fan offer an overview of the major features and enhancements in Apache Spark 2.4 and give insight into upcoming releases. Then you'll get the chance to ask all your burning Spark questions. Read more.

Transforming behavioral analytics at Atlassian

2:40pm–3:20pm Thursday, 03/28/2019

Session

Data Engineering & Architecture
Location: 2002

Secondary topics: Data Platforms, Data preparation, data governance, and data lineage

Rohan Dhupelia (Atlassian), Jimmy Li (Atlassian)

Average rating:

(4.67, 3 ratings)

Analytics is easy, but good analytics is hard. Atlassian knows this all too well. Rohan Dhupelia and Jimmy Li explain how the company's push to become truly data driven has transformed the way it thinks about behavioral analytics, from how it defined its events to how it ingests and analyzes them. Read more.

How to survive future data warehousing challenges with the help of a hybrid cloud

2:40pm–3:20pm Thursday, 03/28/2019

Session

Data Engineering & Architecture
Location: 2004

Secondary topics: AI and Data technologies in the cloud

Eva Andreasson (Cloudera), Mark Brine (Cloudera), Michael Kohs (Cloudera)

Average rating:

(2.00, 3 ratings)

Michael Kohs, Eva Andreasson, and Mark Brine explain how Cloudera’s Finance Department used a hybrid model to speed up report delivery and reduce cost of end-of-quarter reporting. They also share guidelines for deploying modern data warehousing in a hybrid cloud environment, outlining when you should choose a private cloud service over a public one, the available options, and some dos and dont's. Read more.

Bullet: Querying streaming data in transit with sketches

2:40pm–3:20pm Thursday, 03/28/2019

Session

Data Engineering & Architecture
Location: 2006

Secondary topics: Storage, Streaming, realtime analytics, and IoT

Akshai Sarma (Yahoo), Nathan Speidel (Yahoo)

Average rating:

(3.67, 3 ratings)

Akshai Sarma and Nathan Speidel offer an overview of Bullet, a scalable, pluggable, light multitenant query system on any data flowing through a streaming system without storing it. Bullet efficiently supports intractable operations like top K, count distincts, and windowing without any storage using sketch-based algorithms. Read more.

Creating a data engineering culture at USAA

2:40pm–3:20pm Thursday, 03/28/2019

Session

Culture and organization
Location: 2007

Secondary topics: AI and machine learning in the enterprise, Financial Services

Jesse Anderson (Big Data Institute), Thomas Goolsby (USAA)

Average rating:

(3.67, 6 ratings)

What happens when you have a data science organization but no data engineering organization? Jesse Anderson and Thomas Goolsby explain what happened at USAA without data engineering, how they fixed it, and the results since. Read more.

Clusters in Kubernetes on a cluster: Building a multitenant environment for the field

2:40pm–3:20pm Thursday, 03/28/2019

Session

Data Engineering & Architecture
Location: 2008

Secondary topics: AI and Data technologies in the cloud, Storage

Paul Curtis (Weaveworks)

Average rating:

(4.50, 2 ratings)

What do you do when your technology doesn’t easily fit on a single laptop and consists of many components? Paul Curtis explains how MapR Technologies rolled out a containerized, scalable, globally available, and easily updatable environment using a combination of Kubernetes to orchestrate, shared data fabric to store and persist, and AppLariat to provide the user interface. Read more.

Machine learning prediction of blood alcohol content: A digital signature of behavior

2:40pm–3:20pm Thursday, 03/28/2019

Session

Data Science, Machine Learning & AI
Location: 2009

Secondary topics: Health and Medicine

Kirstin Aschbacher (UCSF Cardiology)

Average rating:

(4.20, 5 ratings)

Some people use digital devices to track their blood alcohol content (BAC). A BAC-tracking app that could anticipate when a person is likely to have a high BAC could offer coaching in a time of need. Kirstin Aschbacher shares a machine learning approach that predicts user BAC levels with good precision based on minimal information, thereby enabling targeted interventions. Read more.

How to train your model (and catch label leakage)

2:40pm–3:20pm Thursday, 03/28/2019

Session

Data Science, Machine Learning & AI
Location: 2010

Secondary topics: AI and Data technologies in the cloud, AI and machine learning in the enterprise, Automation in data science and big data

Till Bergmann (Salesforce)

Average rating:

(3.67, 6 ratings)

A problem in predictive modeling data is label leakage. At enterprise companies such as Salesforce, this problem takes on monstrous proportions as the data is populated by diverse business processes, making it hard to distinguish cause from effect. Till Bergmann explains how Salesforce—which needs to churn out thousands of customer-specific models for any given use case—tackled this problem. Read more.

Personalizing the guest-booking experience  at Airbnb

2:40pm–3:20pm Thursday, 03/28/2019

Session

Data Science, Machine Learning & AI
Location: 2011

Secondary topics: Retail and e-commerce

Kapil Gupta (Airbnb)

Average rating:

(3.50, 4 ratings)

Kapil Gupta explains how Airbnb approaches the personalization of travelers’ booking experiences using machine learning. Read more.

Future of the firm: How are executives preparing now?

2:40pm–3:20pm Thursday, 03/28/2019

Session

Future of the Firm
Location: 2014

Moderated by:

Josh Bersin (Bersin by Deloitte)

Panelists:

Nancy Vitale (Genentech), Josh Alwitt (Publicis Sapient), Erin Flynn (Optimizely)

Average rating:

(4.50, 2 ratings)

In this panel session, executives will discuss how their companies are adapting to the workforce, business, and economic trends shaping the future of business. Read more.

Anomaly detection using deep learning to measure the quality of large datasets

2:40pm–3:20pm Thursday, 03/28/2019

Session

Data Science, Machine Learning & AI
Location: 2016

Secondary topics: Data preparation, data governance, and data lineage, Deep Learning

Sridhar Alla (BlueWhale), Syed Nasar (Cloudera)

Average rating:

(2.86, 7 ratings)

Any business big or small depends on analytics, whether the goal is revenue generation, churn reduction, or sales and marketing. No matter the algorithm and the techniques used, the result depends on the accuracy and consistency of the data being processed. Sridhar Alla and Syed Nasar share techniques used to evaluate the the quality of data and the means to detect the anomalies in the data. Read more.

Community and regional data sharing policy frameworks: Frontier stories

2:40pm–3:20pm Thursday, 03/28/2019

Session

Case studies, Strata Business Summit
Location: 2018

Secondary topics: Health and Medicine, Open Data, Data Generation and Data Networks

Mei Fung (People Centered Internet)

Average rating:

(4.67, 3 ratings)

Data sharing necessitates stakeholders and populations of people to come together to learn the benefits, risks, challenges, and known and unknown "unknowns." Data sharing policies and frameworks require increasing levels of trust, which takes time to build. Join Mei Fung for trail-blazing stories from Solano County, California, and ASEAN (SE Asia), which offer important insights Read more.

Executive Briefing: Big data in the era of heavy worldwide privacy regulations

2:40pm–3:20pm Thursday, 03/28/2019

Session

Executive Briefing and best practices, Strata Business Summit
Location: 2020

Secondary topics: Security and Privacy

Mark Donsky (Okera), Nikki Rouda (Amazon Web Services)

Average rating:

(4.33, 3 ratings)

The implications of new privacy regulations for data management and analytics, such as the General Data Protection Regulation (GDPR) and the upcoming California Consumer Protection Act (CCPA), can seem complex. Mark Donsky and Nikki Rouda highlight aspects of the rules and outline the approaches that will assist with compliance. Read more.

Building and scaling a security detection platform: A Netflix Original

2:40pm–3:20pm Thursday, 03/28/2019

Session

Data Engineering & Architecture
Location: 2024

Secondary topics: Data Platforms, Media, Marketing, Advertising, Security and Privacy

John Bennett (Netflix), Siamac Mirzaie (Netflix)

Average rating:

(3.33, 3 ratings)

Data has become a foundational pillar for security teams operating in organizations of all shapes and sizes. This new norm has created a need for platforms that enable engineers to harness data for various security purposes. John Bennett and Siamac Mirzaie offer an overview of Netflix's internal platform for quickly deploying data-based detection capabilities in the corporate environment. Read more.

3:20pm

3:20pm–3:50pm Thursday, 03/28/2019

Location: Foyer

Afternoon break (30m)

3:50pm

Scaling Apache Spark on Kubernetes at Lyft

3:50pm–4:30pm Thursday, 03/28/2019

Session

Data Engineering & Architecture
Location: 2001

Secondary topics: Data Integration and Data Pipelines, Data Platforms

Li Gao (Lyft), Bill Graham (Lyft)

Average rating:

(4.00, 2 ratings)

Li Gao and Bill Graham discuss the challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale. Read more.

ROCKSET: The design and implementation of a data system for low-latency queries for search and analytics

3:50pm–4:30pm Thursday, 03/28/2019

Session

Data Engineering & Architecture
Location: 2002

Secondary topics: AI and Data technologies in the cloud, Storage, Streaming, realtime analytics, and IoT

Igor Canadi (Rockset), Dhruba Borthakur (Rockset)

Average rating:

(4.00, 1 rating)

Most existing big data systems prefer sequential scans for processing queries. Igor Canadi and Dhruba Borthakur challenge this view, offering an overview of converged indexing: a single system called ROCKSET that builds inverted, columnar, and document indices. Converged indexing is economically feasible due to the elasticity of cloud-resources and write optimized storage engines. Read more.

Database migrations don't have to be painful, but the road will be bumpy

3:50pm–4:30pm Thursday, 03/28/2019

Session

Data Engineering & Architecture
Location: 2004

Secondary topics: Data Platforms

Adrian Lungu (Adobe), Serban Teodorescu (Adobe)

Average rating:

(4.75, 4 ratings)

Adrian Lungu and Serban Teodorescu explain how—inspired by the green-blue deployment technique—the Adobe Audience Manager team developed an active-passive database migration procedure that allows them to test database clusters in production, minimizing the risks without compromising the innovation. Read more.

Data science at Deutsche Telekom: Predicting global travel patterns and network demand

3:50pm–4:30pm Thursday, 03/28/2019

Session

Data Engineering & Architecture
Location: 2006

Secondary topics: AI and machine learning in the enterprise, Data Platforms, Security and Privacy

Vaclav Surovec (Deutsche Telekom), Gabor Kotalik (Deutsche Telekom)

Average rating:

(4.00, 1 rating)

Knowledge of customers' location and travel patterns is important for many companies, including German telco service operator Deutsche Telekom. Václav Surovec and Gabor Kotalik explain how a commercial roaming project using Cloudera Hadoop helped the company better analyze the behavior of its customers from 10 countries and provide better predictions and visualizations for management. Read more.

How Walgreens transformed supply chain management with Kyvos, Tableau, and big data

3:50pm–4:30pm Thursday, 03/28/2019

Session

Business Analytics and Visualization
Location: 2007

Neerav Jain (Walgreens), Anne Cruz (Walgreens), Vikas Hardia (Kyvos )

Average rating:

(2.75, 4 ratings)

Walgreens recently faced the challenge of analyzing 466 billion rows of data from 20,000 suppliers and 9,000 stores, which strained its existing systems when dealing with the scale and cardinality of data. Neerav Jain, Vikas Hardia, and Anne Cruz describe how they used Kyvos and Tableau to transform Walgreens's supply chain with instant, interactive analysis on two-year data. Read more.

Spark-PMoF: Accelerating big data analytics with Persistent Memory over Fabric

3:50pm–4:30pm Thursday, 03/28/2019

Session

Data Engineering & Architecture
Location: 2008

Secondary topics: Storage, Streaming, realtime analytics, and IoT

Yuan Zhou (Intel), haodong tang (Intel), Jian Zhang (Intel)

Average rating:

(3.33, 3 ratings)

Yuan Zhou, Haodong Tang, and Jian Zhang offer an overview of Spark-PMOF and explain how it improves Spark analytics performance. Read more.

Nutrition data science

3:50pm–4:30pm Thursday, 03/28/2019

Session

Data Science, Machine Learning & AI
Location: 2009

Secondary topics: Health and Medicine

Noah Gift (UC Davis ), Michelle Davenport (Quantitative Nutrition)

Average rating:

(2.89, 9 ratings)

Noah Gift and Michelle Davenport explore exciting ideas in nutrition using data science; specifically, they analyze the detrimental relationship between sugar and longevity, obesity, and chronic diseases. Read more.

Testing ad content with survey experiments

3:50pm–4:30pm Thursday, 03/28/2019

Session

Data Science, Machine Learning & AI
Location: 2010

Secondary topics: AI and Data technologies in the cloud, AI and machine learning in the enterprise, Media, Marketing, Advertising

Patrick Miller (Civis Analytics)

Average rating:

(3.40, 5 ratings)

Brands that test the content of ads before they are shown to an audience can avoid spending resources on the 11% of ads that cause backlash. Using a survey experiment to choose the best ad typically improves effectiveness of marketing campaigns by 13% on average, and up to 37% for particular demographics. Patrick Miller explores data collection and statistical methods for analysis and reporting. Read more.

The next step in the evolution of data science with RAPIDS

3:50pm–4:30pm Thursday, 03/28/2019

Session

Data Science, Machine Learning & AI
Location: 2011

Secondary topics: Graph technologies and analytics

Bartley Richardson (NVIDIA), Joshua Patterson (NVIDIA)

Average rating:

(4.00, 2 ratings)

RAPIDS is the next big step in data science, combining the ease of use of common APIs and the power and scalability of GPUs. Bartley Richardson and Joshua Patterson offer an overview of RAPIDS and and explore cuDF, cuGraph, and cuML—a trio of RAPIDS tools that enable data scientists to work with data in a familiar interface and apply graph analytics and traditional machine learning techniques. Read more.

Digital transformation writ large

3:50pm–4:30pm Thursday, 03/28/2019

Session

Future of the Firm
Location: 2014

Jeffrey Wong (EY)

Average rating:

(3.33, 3 ratings)

Jeffrey Wong explains how an old-world firm leveraged technology to transform everything and thrive in our new world of continuous change—anticipating, scaling, and adapting to meet internal needs and client expectations. Read more.

Analytics Zoo: Distributed TensorFlow in production on Apache Spark

3:50pm–4:30pm Thursday, 03/28/2019

Session

Data Science, Machine Learning & AI
Location: 2016

Secondary topics: Data Platforms, Deep Learning

Yuhao Yang (Intel), Jiao(Jennie) Wang (Intel)

Average rating:

(2.67, 3 ratings)

Yuhao Yang and Jennie Wang demonstrate how to run distributed TensorFlow on Apache Spark with the open source software package Analytics Zoo. Compared to other solutions, Analytics Zoo is built for production environments and encourages more industry users to run deep learning applications with the big data ecosystems. Read more.

The Paradise Papers and West Africa Leaks: Behind the scenes with the ICIJ

3:50pm–4:30pm Thursday, 03/28/2019

Session

Business Analytics and Visualization, Strata Business Summit
Location: 2018

Secondary topics: Graph technologies and analytics, Media, Marketing, Advertising, Storage, Text and Language processing and analysis

Pierre Romera (International Consortium of Investigative Journalists (ICIJ))

Average rating:

(4.67, 6 ratings)

The ICIJ was the team behind the Panama Papers and Paradise Papers. Pierre Romera offers a behind-the-scenes look into the ICIJ's process and explores the challenges in handling 1.4 TB of data (in many different formats)—and making it available securely to journalists all over the world. Read more.

Executive Briefing: What it takes to use machine learning in fast data pipelines

3:50pm–4:30pm Thursday, 03/28/2019

Session

Executive Briefing and best practices, Strata Business Summit
Location: 2020

Secondary topics: Streaming, realtime analytics, and IoT

Dean Wampler (Anyscale)

Average rating:

(4.33, 6 ratings)

Your team is building machine learning capabilities. Dean Wampler demonstrates how to integrate these capabilities in streaming data pipelines so you can leverage the results quickly and update them as needed and covers challenges such as how to build long-running services that are very reliable and scalable and how to combine a spectrum of very different tools, from data science to operations. Read more.

Real-time monitoring of Twitter's network infrastructure with Heron

3:50pm–4:30pm Thursday, 03/28/2019

Session

Data Engineering & Architecture
Location: 2024

Secondary topics: Data Integration and Data Pipelines, Security and Privacy, Streaming, realtime analytics, and IoT

J Delange (Twitter), N Lu (Twitter)

Average rating:

(2.67, 3 ratings)

Julien Delange and Neng Lu explain how Twitter uses the Heron stream processing engine to monitor and analyze its network infrastructure—implementing a new data pipeline that ingests multiple sources and processes about 1 billion tuples to detect network issues and generate usage statistics. Join in to learn the key technologies used, the architecture, and the challenges Twitter faced. Read more.

4:40pm

Understanding Spark tuning with auto-tuning; or, Magical spells to stop your pager going off at 2:00am

4:40pm–5:20pm Thursday, 03/28/2019

Session

Data Engineering & Architecture
Location: 2001

Secondary topics: Automation in data science and big data

Holden Karau (Independent), Rachel B Warren (Salesforce Einstein)

Average rating:

(4.60, 5 ratings)

Apache Spark is an amazing distributed system, but part of the bargain we've made with the infrastructure deamons involves providing the correct set of magic numbers (a.k.a. tuning) or our jobs may be eaten by Cthulhu. Holden Karau and Rachel Warren explore auto-tuning jobs using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads—including new settings in 2.4. Read more.

Taming large state to join datasets for personalization

4:40pm–5:20pm Thursday, 03/28/2019

Session

Data Engineering & Architecture
Location: 2002

Secondary topics: Data Integration and Data Pipelines, Data preparation, data governance, and data lineage, Media, Marketing, Advertising

Sonali Sharma (Netflix), Shriya Arora (Netflix)

Average rating:

(3.00, 2 ratings)

With so much data being generated in real time, what if we could combine all these high-volume data streams and provide near real-time feedback for model training, improving personalization and recommendations and taking the customer experience to a whole new level. Sonali Sharma and Shriya Arora explain how to do exactly that, using Flink's keyed state. Read more.

Applying machine learning in fintech startups: Modeling with sensitive customer datasets

4:40pm–5:20pm Thursday, 03/28/2019

Session

Data Science, Machine Learning & AI
Location: 2004

Secondary topics: Financial Services, Security and Privacy

Ji Peng (Earnin )

Average rating:

(4.50, 2 ratings)

As a customer-facing fintech company, Earnin has access to various types of valuable customer data, from bank transactions to GPS location. Ji Peng shares how Earnin uses unique datasets to build machine learning models and navigates the challenges of prioritizing and applying machine learning in the fintech domain. Read more.

Apache Druid autoscale-out/in for streaming data ingestion on Kubernetes

4:40pm–5:20pm Thursday, 03/28/2019

Session

Data Engineering & Architecture, Streaming and IoT
Location: 2006

Secondary topics: AI and Data technologies in the cloud

Jinchul Kim (SK Telecom)

Average rating:

(2.17, 6 ratings)

Druid supports autoscaling for data ingestion, but it's only available on AWS EC2. You can't rely on the feature on your private cloud. Jinchul Kim demonstrates autoscale-out/in on Kubernetes, details the benefit on this approach, and discusses the development of Druid Helm charts, rolling updates, and custom metric usage for horizontal autoscaling. Read more.

Bringing data to life: Combining machine learning and art to tell a data story

4:40pm–5:20pm Thursday, 03/28/2019

Session

Case studies
Location: 2007

Secondary topics: Streaming, realtime analytics, and IoT, Text and Language processing and analysis, Visualization, Design, and UX

Nancy Rausch (SAS)

Average rating:

(4.80, 5 ratings)

For data to be meaningful, it needs to be presented in a way that people can relate to. Nancy Rausch explains how she combined streaming data from a solar array and machine learning techniques to create a live-action art piece—an approach that helped bring the data to life in a fun and compelling way. Read more.

Data processing at the speed of 100 Gbps using Apache Crail

4:40pm–5:20pm Thursday, 03/28/2019

Session

Data Engineering & Architecture
Location: 2008

Secondary topics: Data Integration and Data Pipelines, Storage, Streaming, realtime analytics, and IoT

Patrick Stuedi (IBM Research)

Average rating:

(4.00, 1 rating)

Modern networking and storage technologies like RDMA or NVMe are finding their way into the data center. Patrick Stuedi offers an overview of Apache Crail (incubating), a new project that facilitates running data processing workloads (ML, SQL, etc.) on such hardware. Patrick explains what Crail does and how it benefits workloads based on TensorFlow or Spark. Read more.

Machine learning for preventive maintenance of mining haul trucks

4:40pm–5:20pm Thursday, 03/28/2019

Session

Data Science, Machine Learning & AI
Location: 2009

Secondary topics: Streaming, realtime analytics, and IoT, Temporal data and time-series analytics, Transportation and Logistics

Alex Gorbachev (Pythian), Paul Spiegelhalter (Pythian)

Average rating:

(4.67, 3 ratings)

Alex Gorbachev and Paul Spiegelhalter use the example of a mining haul truck to explain how to map preventive maintenance needs to supervised machine learning problems, create labeled datasets, do feature engineering from sensors and alerts data, evaluate models—then convert it all to a complete AI solution on Google Cloud Platform that's integrated with existing on-premises systems. Read more.

Efficient multi-armed bandit with Thompson sampling for applications with delayed feedback

4:40pm–5:20pm Thursday, 03/28/2019

Session

Data Science, Machine Learning & AI
Location: 2010

Secondary topics: Media, Marketing, Advertising

Shradha Agrawal (Adobe)

Average rating:

(4.17, 6 ratings)

Decision making often struggles with the exploration-exploitation dilemma. Multi-armed bandits (MAB) are a popular reinforcement learning solution, but increasing the number of decision criteria leads to an exponential blowup in complexity, and observational delays don’t allow for optimal performance. Shradha Agrawal offers an overview of MABs and explains how to overcome the above challenges. Read more.

Machine learning and GDPR

4:40pm–5:20pm Thursday, 03/28/2019

Session

Data Science, Machine Learning & AI, Law and Ethics
Location: 2011

Secondary topics: Security and Privacy

Michael Gregory (Cloudera)

Average rating:

(4.25, 4 ratings)

The General Data Protection Regulation (GDPR) enacted by the European Union restricts the use of machine learning practices in many cases. Michael Gregory offers an overview of the regulations, important considerations for both EU and non-EU organizations, and tools and technologies to ensure that you're appropriately using ML applications to drive continued transformation and insights. Read more.

A human-centered approach to AI and machine learning

4:40pm–5:00pm Thursday, 03/28/2019

Session

Future of the Firm
Location: 2014

Cathryn Posey (Capital One)

Average rating:

(4.33, 3 ratings)

Cathryn Posey explains how Capital One—the only bank fully committed to a cloud-based infrastructure—is approaching machine learning with a responsible, human-centered focus. Join in to hear about Capital One's research in areas like explainable AI, how the bank is leveraging the technology, and ways in which it can be used for good. Read more.

Using deep learning to automatically rank millions of hotel images

4:40pm–5:20pm Thursday, 03/28/2019

Session

Data Science, Machine Learning & AI
Location: 2016

Secondary topics: Deep Learning, Retail and e-commerce

Christopher Lennan (idealo.de)

Average rating:

(4.00, 1 rating)

Idealo.de recently trained convolutional neural networks (CNN) for aesthetic and technical image quality predictions. Christopher Lennan shares the training approach, along with some practical insights, and sheds light on what the trained models actually learned by visualizing the convolutional filter weights and output nodes of the trained models. Read more.

Model governance in the enterprise

4:40pm–5:20pm Thursday, 03/28/2019

Session

Data Engineering & Architecture
Location: 2018

Secondary topics: Model lifecycle management

Harish Doddi (Datatron), Jerry Xu (Datatron Technologies)

Average rating:

(4.00, 1 rating)

Harish Doddi and Jerry Xu share the challenges they faced scaling machine learning models and detail the solutions they're building to conquer them. Read more.

Executive Briefing: How organizations scale along the data and AI maturity curve

4:40pm–5:20pm Thursday, 03/28/2019

Session

Executive Briefing and best practices, Strata Business Summit
Location: 2020

Secondary topics: AI and machine learning in the enterprise

Michael Li (The Data Incubator)

Average rating:

(3.75, 4 ratings)

As their data and AI teams scale from one to thousands of employees and the maturity of their analytics capabilities evolve, companies find that the analytics journey is not always smooth. Drawing on experiences gleaned from dozens of clients, Michael Li discusses organizational growing pains and the best practices that successful executives have adopted to scale and grow their team. Read more.

New directions in record linkage

4:40pm–5:20pm Thursday, 03/28/2019

Session

Data Engineering & Architecture
Location: 2024

Secondary topics: Automation in data science and big data, Data preparation, data governance, and data lineage

Yves Thibaudeau (US Census Bureau)

Average rating:

(3.33, 3 ratings)

The US Census Bureau has been involved in record linkage projects for over 40 years. In that time, there's been a lot of change in computing capabilities and new techniques, and the Census Bureau is reviewing an inventory of linkage methodologies. Yves Thibaudeau describes the progress made so far in identifying specific record linkage techniques for specific applications. Read more.

5:00pm

Automating yourself out of a job? The problem with knowledge work

5:00pm–5:20pm Thursday, 03/28/2019

Session

Future of the Firm
Location: 2014

James Cham (Bloomberg Beta)

Average rating:

(4.67, 3 ratings)

Missing amid conversations about corporate strategy and innovation is a mostly untapped source of new ideas and efficiency—the people actually doing the work. James Cham explains why this a problem and suggests some possible solutions. Read more.

Presented by

Strategic Sponsors

Zettabyte Sponsor

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Supporting Sponsor

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com