Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA
 
LL20 A
11:50am Failed experiments in infrastructure security analytics and lessons learned from fixing them Ram Shankar Siva Kumar (Microsoft (Azure Security Data Science))
1:50pm Magellan: Scalable and fast geospatial analytics Ram Sriharsha (Databricks)
LL20 C
11:00am Explaining machine learning models Evan Kriminger (ZestFinance)
1:50pm Humans versus the machines: Using human-based computation to improve machine learning Veronica Mapes (Pinterest), Garner Chung (Pinterest)
4:20pm
LL20 D
11:00am Data science at Slack Josh Wills (Slack)
11:50am Approaching the pricing problem at Lyft Ashivni Shekhawat (Lyft)
1:50pm The science of patchy data Jennifer Prendki (Figure Eight)
2:40pm Building career advisory tools for the tech sector using machine learning Simon Hughes (Dice.com), Yuri Bykov (Dice.com)
LL21 B
11:00am Using computer vision to combat stolen credit card fraud Karthik Ramasamy (Google), Lenny Evans (Uber)
11:50am Automatic 3D MRI knee damage classification with 3D CNN using BigDL on Spark Jiao(Jennie) Wang (Intel), Valentina Pedoia (UCSF), Berk Norman (UCSF), Yulia Tell (Intel)
1:50pm
LL21 C/D
4:20pm HDFS on Kubernetes: Tech deep dive on locality and security Kimoon Kim (Pepperdata), Ilan Filonenko (Bloomberg LP)
LL21 E/F
11:50am Hive as a service Szehon Ho (Criteo), Pawel Szostek (Criteo)
2:40pm Achieving GDPR compliance and data privacy using blockchain technology Ajay Kumar Mothukuri (Sapient), Vijay Agneeswaran (Walmart Labs)
230 A
11:50am Effectively once, exactly once, and more in Heron Karthik Ramasamy (Streamlio), Sanjeev Kulkarni (Streamlio)
2:40pm Unified and elastic batch and stream processing with Pravega and Apache Flink Fabian Hueske (data Artisans), Flavio Junqueira (Dell EMC)
230 C
11:50am Building a contacts graph from activity data Alexis Roos (Salesforce), Noah Burbank (Salesforce)
1:50pm Playing well together: Big data beyond the JVM with Spark and friends Holden Karau (Independent), Rachel Warren (Salesforce Einstein)
2:40pm Data reflections: Making data fast and easy to use without making copies Tomer Shiran (Dremio), Jacques Nadeau (Dremio)
4:20pm Cuttlefish: Lightweight primitives for online tuning Tomer Kaftan (University of Washington)
210 A/E
2:40pm Executive Briefing: The rise of the ecosystem Anjali Thakur (Accenture)
210 C/G
11:00am Understanding metadata Michael Schrenk (Self-Employed)
210 D/H
11:00am Fighting sex trafficking with data science Ruben van der Dussen (Thorn)
11:50am Architecting an open source enterprise data lake Sagar Kewalramani (Cloudera)
2:40pm Detecting retail fraud with data wrangling and machine learning Matt Derda (Trifacta), Harrison Lynch (Consensus Corporation)
4:20pm Data-driven fuel management at Ryanair Marcin Pilarczyk (Ryanair)
Expo Hall 1
11:50am
1:50pm The real-time journey from raw streaming data to AI-based analytics Roy Ben Alta (Amazon Web Services), Ira Cohen (Anodot)
2:40pm Building ML and AI pipelines with Spark and TensorFlow Chris Fregly (Amazon Web Services)
212 A-B
11:50am Ask Me Anything: Deep learning-based search and recommendation systems using TensorFlow Vijay Agneeswaran (Walmart Labs), Abhishek Kumar (Publicis Sapient)
1:50pm Ask Me Anything: Managing data science in the enterprise Nick Elprin (Domino Data Lab)
LL20 B
11:50am The state of Postgres Umur Cubukcu (Citus Data)
1:50pm Data-driven ecosystems in the automotive industry Josef Viehhauser (BMW Group), Tobias Burger (BMW Group)
2:40pm When tests cry wolf (sponsored by Pure Storage) Ivan Jibaja (Pure Storage)
LL21 A
230 B
Grand Ballroom 220
8:45am Thursday keynote welcome Ben Lorica (O'Reilly), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
9:10am Differentiating via data science Eric Colson (Stitch Fix)
9:30am Automating decisions with data in the cloud Amr Awadallah (Cloudera), Sangeeth Ponathil (Pizza Hut)
9:40am What separates the clouds? (sponsored by Google Cloud) William Vambenepe (Google)
9:45am Inclusivity for the greater good Ajey Gore (GO-JEK)
10:05am Lessons in Google Search data Seth Stephens-Davidowitz (New York Times)
10:30am Morning break | Room: Hall 1, 2, 3
12:30pm Lunch sponsored by MapR Thursday Topic Tables at Lunch | Room: Hall 1, 2, 3
12:30pm Thursday Business Summit Lunch | Room: San Jose Ballroom, Marriott
3:20pm Afternoon break | Room: Hall 1, 2, 3
8:00am Speed Networking | Room: Concourse foyer
11:00am-11:40am (40m) Data science and machine learning, Data-driven business management
The limits of inference: What data scientists can learn from the reproducibility crisis in science
Clare Gollnick (NS1)
At the heart of the reproducibility crisis in the sciences is the widespread misapplication of statistics. Data science relies on the same statistical methodology as these scientific fields. Can we avoid the same crisis of integrity? Clare Gollnick considers the philosophy of data science and shares a framework that explains (and even predicts) the likelihood of success of a data project.
11:50am-12:30pm (40m) Data science and machine learning, Platform security and cybersecurity Graphs and Time-series
Failed experiments in infrastructure security analytics and lessons learned from fixing them
Ram Shankar Siva Kumar (Microsoft (Azure Security Data Science))
How should you best debug a security data science system: change the ML approach, redefine the security scenario, or start over from scratch? Ram Shankar answers this question by sharing the results of failed experiments and the lessons learned when building ML detections for cloud lateral movement, identifying anomalous executables, and automating incident response process.
1:50pm-2:30pm (40m) Big data and data science in the cloud, Data science and machine learning
Magellan: Scalable and fast geospatial analytics
Ram Sriharsha (Databricks)
How do you scale geospatial analytics on big data? And while you're at it, can you make it easy to use while achieving state-of-the-art performance on a single node? Ram Sriharsha offers an overview of Magellan—a geospatial optimization engine that seamlessly integrates with Spark—and explains how it provides scalability and performance without sacrificing simplicity.
2:40pm-3:20pm (40m) Data science and machine learning
sparklyr, implyr, and more: dplyr interfaces to large-scale data
Ian Cook (Cloudera)
The popular R package dplyr provides a consistent grammar for data manipulation that can abstract over diverse data sources. Ian Cook shows how you can use dplyr to query large-scale data using different processing engines including Spark and Impala. He demonstrates the R package sparklyr (from RStudio) and the new R package implyr (from Cloudera) and shares tips for making dplyr code portable.
4:20pm-5:00pm (40m) Data science and machine learning
Cataloging the visible universe through Bayesian inference at petascale in Julia
Keno Fischer (Julia Computing)
Julia is rapidly becoming a popular language at the forefront of scientific discovery. Keno Fischer explores one of the most ambitious use cases for Julia: using machine learning to catalog astronomical objects to derive a catalog from multiterabyte size astronomical image datasets. This work was a collaboration between MIT, UC Berkeley, LBNL, and Julia Computing.
11:00am-11:40am (40m) Data science and machine learning
Explaining machine learning models
Evan Kriminger (ZestFinance)
What does it mean to explain a machine learning model, and why is it important? Mike Ruberry offers an overview of several modern explainability methods, including traditional feature contributions, LIME, and DeepLift. Each of these techniques presents a different perspective, and their clever application can reveal new insights and solve business requirements.
11:50am-12:30pm (40m) Data science and machine learning
Enough data engineering for a data scientist; or, How I learned to stop worrying and love the data scientists
Stephen O'Sullivan (Data Whisperers)
Stephen O'Sullivan takes you along the data science journey, from onboarding data (using a number of data/object stores) to understanding and choosing the right data format for the data assets to using query engines (and basic query tuning). You'll learn some new skills to help you be more productive and reduce contention with the data engineering team.
1:50pm-2:30pm (40m) Data science and machine learning, Data-driven business management
Humans versus the machines: Using human-based computation to improve machine learning
Veronica Mapes (Pinterest), Garner Chung (Pinterest)
Veronica Mapes and Garner Chung detail the human evaluation platform Pinterest developed to better serve its deep learning and operational teams when its needs grew beyond platforms like Mechanical Turk. Along the way, they cover tricks for increasing data reliability and judgement reproducibility and explain how Pinterest integrated end-user-sourced judgements into its in-house platform.
2:40pm-3:20pm (40m) Data science and machine learning, Law, ethics, and governance
Human in the loop: Bayesian rules enabling explainable AI
Pramit Choudhary (h2o.ai)
Pramit Choudhary explores the usefulness of a generative approach that applies Bayesian inference to generate human-interpretable decision sets in the form of "if. . .and else" statements. These human interpretable decision lists with high posterior probabilities might be the right way to balance between model interpretability, performance, and computation.
4:20pm-5:00pm (40m)
Session
11:00am-11:40am (40m) Data science and machine learning
Data science at Slack
Josh Wills (Slack)
Josh Wills describes recent data science and machine learning projects at Slack.
11:50am-12:30pm (40m) Data science and machine learning
Approaching the pricing problem at Lyft
Ashivni Shekhawat (Lyft)
Ashivni Shekhawat explains how Lyft uses a mix of online learning, optimization, and control theory to operate its ride-sharing marketplace at an efficient price point.
1:50pm-2:30pm (40m) Data science and machine learning, Law, ethics, and governance, Streaming systems and real-time applications
The science of patchy data
Jennifer Prendki (Figure Eight)
Jennifer Prendki explains how to develop machine learning models even if the data is protected by privacy and compliance laws and cannot be used without anonymizing, covering techniques ranging from contextual bandits to document vector representation.
2:40pm-3:20pm (40m) Data science and machine learning
Building career advisory tools for the tech sector using machine learning
Simon Hughes (Dice.com), Yuri Bykov (Dice.com)
Dice.com recently released several free career advisory tools for technology professionals, including a salary predictor, a tool that recommends the next skills to learn, and a career path explorer. Simon Hughes and Yuri Bykov offer an overview of the machine learning algorithms behind these tools and the technologies used to build, deploy, and monitor these solutions in production.
4:20pm-5:00pm (40m) Big data and data science in the cloud, Data science and machine learning
Not your parents' machine learning: How to ship an XGBoost churn prediction app in under four weeks
Goodman Gu (Cogito)
Machine learning is a pivotal technology. However, bringing an ML application to life often requires overcoming bottlenecks not just in the model code but in operationalizing the end-to-end system itself. Goodman Gu shares a case study from a leading SaaS company that quickly and easily built, trained, optimized, and deployed an XGBoost churn prediction ML app at scale with Amazon SageMaker.
11:00am-11:40am (40m) Data science and machine learning
Using computer vision to combat stolen credit card fraud
Karthik Ramasamy (Google), Lenny Evans (Uber)
Stolen credit cards are a major problem faced by many companies, including Uber. Karthik Ramasamy and Lenny Evans detail a new weapon against stolen credit cards that uses computer vision to scan credit cards, verifying possession of the physical card with basic fake card detection capabilities.
11:50am-12:30pm (40m) Big data and data science in the cloud, Data science and machine learning
Automatic 3D MRI knee damage classification with 3D CNN using BigDL on Spark
Jiao(Jennie) Wang (Intel), Valentina Pedoia (UCSF), Berk Norman (UCSF), Yulia Tell (Intel)
Damage to the meniscus is a physically limiting injury that can lead to further medical complications. Automatically classifying this damage at the time of an MRI scan would allow quicker and more accurate diagnosis. Jennie Wang, Valentina Pedoia, Berk Norman, and Yulia Tell offer an overview of their classification system built with 3D convolutional neural networks using BigDL on Apache Spark.
1:50pm-2:30pm (40m)
Session
2:40pm-3:20pm (40m) Data science and machine learning
Word embeddings under the hood: How neural networks learn from language
Patrick Harrison (S&P Global)
Word vector embeddings are everywhere, but relatively few understand how they produce their remarkable results. Patrick Harrison opens up the black box of a popular word embedding algorithm and walks you through how it works its magic. Patrick also covers core neural network concepts, including hidden layers, loss gradients, backpropagation, and more.
4:20pm-5:00pm (40m) Data science and machine learning
Using deep learning to solve challenging problems
Jeff Dean (Google)
The Google Brain team conducts research on difficult problems in artificial intelligence and builds large-scale computer systems for machine learning research, both of which have been applied to dozens of Google products. Jeff Dean highlights some of Google Brain's projects with an eye toward how they can be used to solve challenging problems.
11:00am-11:40am (40m) Data engineering and architecture
Operationalize deep learning: How to deploy and consume your LSTM networks for predictive maintenance scenarios
Francesca Lazzeri (Microsoft), Fidan Boylu Uz (Microsoft)
Francesca Lazzeri and Fidan Boylu Uz explain how to operationalize LSTM networks to predict the remaining useful life of aircraft engines. They use simulated aircraft sensor values to predict when an aircraft engine will fail in the future so that maintenance can be planned in advance.
11:50am-12:30pm (40m) Big data and data science in the cloud, Data engineering and architecture
Distributed deep learning with containers on heterogeneous GPU clusters
dong meng (MapR)
Deep learning model performance relies on underlying data. Dong Meng offers an overview of a converged data platform that serves as a data infrastructure, providing a distributed filesystem, key-value storage and streams, and Kubernetes as orchestration layer to manage containers to train and deploy deep learning models using GPU clusters.
1:50pm-2:30pm (40m) Data engineering and architecture, Streaming systems and real-time applications
Machine-learned model quality monitoring in fast data and streaming applications
Emre Velipasaoglu (Lightbend)
Most machine learning algorithms are designed to work on stationary data, but real-life streaming data is rarely stationary. Models lose prediction accuracy over time if they are not retrained. Without model quality monitoring, retraining decisions are suboptimal and costly. Emre Velipasaoglu evaluates monitoring methods for applicability in modern fast data and streaming applications.
2:40pm-3:20pm (40m) Big data and data science in the cloud, Data engineering and architecture
Continuous delivery for NLP on Kubernetes: Lessons learned
Michelle Casbon (Google)
Michelle Casbon explains how to speed up the development of ML models by using open source tools such as Kubernetes, Docker, Scala, Apache Spark, and Weave Flux, detailing how to build resilient systems so that you can spend more of your time on product improvement rather than triage and uptime.
4:20pm-5:00pm (40m) Data engineering and architecture, Data science and machine learning, Streaming systems and real-time applications
HDFS on Kubernetes: Tech deep dive on locality and security
Kimoon Kim (Pepperdata), Ilan Filonenko (Bloomberg LP)
There is growing interest in running Spark natively on Kubernetes, and Spark data is often stored in HDFS. Kimoon Kim and Ilan Filonenko explain how to make Spark on Kubernetes work seamlessly with HDFS by addressing challenges such as HDFS data locality and secure HDFS support.
11:00am-11:40am (40m) Big data and data science in the cloud, Data engineering and architecture
Analytics in the cloud: Building a modern cloud-based big data warehouse
Greg Rahn (Cloudera)
For many organizations, the cloud will likely be the destination of their next big data warehouse. Greg Rahn shares considerations when evaluating the cloud for analytics and big data warehousing in order to help you get the most from the cloud. You'll leave with an understanding of different architectural approaches and impacts for moving analytic workloads to the cloud.
11:50am-12:30pm (40m) Big data and data science in the cloud, Data engineering and architecture, Media, entertainment, and advertising
Hive as a service
Szehon Ho (Criteo), Pawel Szostek (Criteo)
Hive is the main data transformation tool at Criteo, and hundreds of analysts and thousands of automated jobs run Hive queries every day. Szehon Ho and Pawel Szostek discuss the evolution of Criteo's Hive platform from an error-prone add-on installed on some spare machines to a best-in-class installation capable of self-healing and automatically scaling to handle its growing load.
1:50pm-2:30pm (40m) Data engineering and architecture
Crafting data products for the augmented writing experience
Chris Harland (Textio)
The number of resources explaining how to build a machine learning model from data greatly overshadows information on how to make real data products from such models, creating a gap between what machine learning engineers and data scientists know is possible and what users experience. Using examples from Textio's augmented writing platform, Chris Harland walks you through building a data product.
2:40pm-3:20pm (40m) Data engineering and architecture, Data-driven business management, Law, ethics, and governance
Achieving GDPR compliance and data privacy using blockchain technology
Ajay Kumar Mothukuri (Sapient), Vijay Agneeswaran (Walmart Labs)
Ajay Mothukuri and Vijay Srinivas Agneeswaran explain how to use open source blockchain technologies such as Hyperledger to implement the European Union's General Data Protection Regulation (GDPR) regulation.
4:20pm-5:00pm (40m) Big data and data science in the cloud, Data engineering and architecture
Lyft's analytics pipeline: From Redshift to Apache Hive and Presto
Shenghu Yang (Lyft)
Lyft’s business has grown over 100x in the past four years. Shenghu Yang explains how Lyft’s data pipeline has evolved over the years to serve its ever-growing analytics use cases, migrating from the world's largest AWS Redshift clusters to Apache Hive and Presto for solving scalability and concurrency hard limits.
11:00am-11:40am (40m) Data engineering and architecture, Streaming systems and real-time applications
Foundations of streaming SQL; or, How I learned to love stream and table theory
Tyler Akidau (Google)
What does it mean to execute streaming queries in SQL? What is the relationship of streaming queries to classic relational queries? Are streams and tables the same thing? And how does all of this relate to the programmatic frameworks we’re all familiar with? Tyler Akidau answers these questions and more as he walks you through key concepts underpinning data processing in general.
11:50am-12:30pm (40m) Data engineering and architecture, Streaming systems and real-time applications
Effectively once, exactly once, and more in Heron
Karthik Ramasamy (Streamlio), Sanjeev Kulkarni (Streamlio)
Stream processing systems must support a number of different types of processing semantics due to the diverse nature of streaming applications. Karthik Ramasamy and Sanjeev Kulkarni explore effectively once, exactly once, and other types of stateful processing techniques, explain how they are implemented in Heron, and demonstrate how your applications will benefit from using them.
1:50pm-2:30pm (40m) Data engineering and architecture Graphs and Time-series
TimescaleDB: Reengineering PostgreSQL as a time series database
Michael Freedman (TimescaleDB)
Michael Freedman offers an overview of TimescaleDB, a new scale-out database designed for time series workloads yet open-sourced and engineered up as a plugin to Postgres. Unlike most time series newcomers, TimescaleDB supports full SQL while achieving fast ingest and complex queries.
2:40pm-3:20pm (40m) Data engineering and architecture, Streaming systems and real-time applications Graphs and Time-series
Unified and elastic batch and stream processing with Pravega and Apache Flink
Fabian Hueske (data Artisans), Flavio Junqueira (Dell EMC)
Flavio Junqueira and Fabian Hueske detail an open source streaming data stack consisting of Pravega (stream storage) and Apache Flink (computation on streams) that offers an unprecedented way of handling “everything as a stream” that includes unbounded streaming storage and unified batch and streaming abstraction and dynamically accommodates workload variations in a novel way.
4:20pm-5:00pm (40m) Data engineering and architecture, Streaming systems and real-time applications
Effectively once in Apache Pulsar, the next-generation messaging system
Matteo Merli (Streamlio)
Traditionally, messaging systems have offered at-least-once delivery semantics, leaving the task of implementing idempotent processing to the application developers. Matteo Merli explains how to add effectively once semantics to Apache Pulsar using a message deduplication layer that can ensure those stricter semantics with guaranteed accuracy and no performance penalty.
11:00am-11:40am (40m) Data engineering and architecture
The secret sauce behind LinkedIn's self-managing Kafka clusters
Jiangjie Qin (LinkedIn)
LinkedIn runs more than 1,800+ Kafka brokers that deliver more than two trillion messages a day. Running Kafka at such a scale makes automated operations a necessity. Jiangjie Qin shares lessons learned from operating Kafka at scale with minimum human intervention.
11:50am-12:30pm (40m) Data engineering and architecture Graphs and Time-series
Building a contacts graph from activity data
Alexis Roos (Salesforce), Noah Burbank (Salesforce)
In the customer age, being able to extract relevant communications information in real time and cross-reference it with context is key. Alexis Roos and Noah Burbank explain how Salesforce uses data science and engineering to enable salespeople to monitor their emails in real time to surface insights and recommendations using a graph modeling contextual data.
1:50pm-2:30pm (40m) Data engineering and architecture, Streaming systems and real-time applications
Playing well together: Big data beyond the JVM with Spark and friends
Holden Karau (Independent), Rachel Warren (Salesforce Einstein)
Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka).
2:40pm-3:20pm (40m) Big data and data science in the cloud, Data engineering and architecture
Data reflections: Making data fast and easy to use without making copies
Tomer Shiran (Dremio), Jacques Nadeau (Dremio)
Most organizations manage 5 to 15 copies of their data in multiple systems and formats to support different analytical use cases. Tomer Shiran and Jacques Nadeau introduce a new approach called data reflections, which dramatically reduces the need for data copies, demonstrate an open source implementation built with Apache Calcite, and explore two production case studies.
4:20pm-5:00pm (40m) Big data and data science in the cloud, Data engineering and architecture
Cuttlefish: Lightweight primitives for online tuning
Tomer Kaftan (University of Washington)
Tomer Kaftan offers an overview of Cuttlefish, a lightweight framework prototyped in Apache Spark that helps developers adaptively improve the performance of their data processing applications by inserting a few library calls into their code. These calls construct tuning primitives that use reinforcement learning to adaptively modify execution as they observe application performance over time.
11:00am-11:40am (40m) Strata Business Summit
Executive Briefing: Machine learning—Why you need it, why it's hard, and what to do about it
Mike Olson (Cloudera)
Mike Olson shares examples of real-world machine learning applications, explores a variety of challenges in putting these capabilities into production—including the speed with with technology is moving, cloud versus in-data-center consumption, security and regulatory compliance, and skills and agility in getting data and answers into the right hands—and outlines proven ways to meet them.
11:50am-12:30pm (40m) Data-driven business management, Strata Business Summit
Executive Briefing: Why machine-learned models crash and burn in production and what to do about it
David Talby (Pacific AI)
Machine learning and data science systems often fail in production in unexpected ways. David Talby shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries.
1:50pm-2:30pm (40m) Law, ethics, and governance, Strata Business Summit
Executive Briefing: GDPR—Getting your data ready for heavy, new EU privacy regulations
Mark Donsky (Okera), Steven Ross (Cloudera)
In May 2018, the General Data Protection Regulation (GDPR) goes into effect for firms doing business in the EU, but many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). Mark Donsky and Steven Ross outline the capabilities your data environment needs to simplify compliance with GDPR and future regulations.
2:40pm-3:20pm (40m) Data-driven business management, Strata Business Summit
Executive Briefing: The rise of the ecosystem
Anjali Thakur (Accenture)
Whether you are a technology or a services provider, understanding your value in the ecosystem and focusing on the right partners to reach your market goals is critical. Anjali Thakur shares examples of teaming models and leading practices for accelerating value from your ecosystem strategy.
4:20pm-5:00pm (40m) Data-driven business management, Strata Business Summit
Executive Briefing: What does an exec need to know about architecture and why
Jesse Anderson (Big Data Institute)
There's been an explosion of new architectures, but is this because engineers love new things or is there a good business reason for these changes? Jesse Anderson explores new architectures and the actual business problems they solve. You may find out that your team would be far more productive if you moved to these architectures.
11:00am-11:40am (40m) Big data and data science in the cloud, Data-driven business management, Strata Business Summit Graphs and Time-series
Understanding metadata
Michael Schrenk (Self-Employed)
Big data becomes much more powerful when it has context. Fortunately, creative data scientists can create needed context though the use of metadata. Michael Schrenk explains how metadata is created and used to gain competitive advantages, predict troop strength, or even guess Social Security numbers.
11:50am-12:30pm (40m) Data-driven business management, Strata Business Summit
Human in the loop: A design pattern for managing teams working with machine learning
Paco Nathan (derwen.ai)
Human in the loop (HITL) has emerged as a key design pattern for managing teams where people and machines collaborate. Such systems are mostly automated, with exceptions referred to human experts, who help train the machines further. Paco Nathan offers an overview of HITL from the perspective of a business manager, focusing on use cases within O'Reilly Media.
1:50pm-2:30pm (40m) Strata Business Summit
Workplace culture in the age of algorithmic management: The information networks Uber drivers built
Ar Ro (Data & Society Research Institute )
Ride-hail drivers work alone, but they’re banding together online to compare notes, uncover new policies, and help each other navigate a workplace characterized by information scarcity. Alex Rosenblat explores how ride-hail workers are using online forums to create their own workplace culture as employment relationships grow more remote and algorithms replace human managers.
2:40pm-3:20pm (40m) Data-driven business management, Strata Business Summit
Trapped by the present: Estimating long-term impact from A/B experiments
Brian Karfunkel (Pinterest)
When software companies use A/B tests to evaluate product changes and fail to accurately estimate the long-term impact of such experiments, they risk optimizing for the users they have at the expense of the users they want to have. Brian Karfunkel explains how to estimate an experiment’s impact over time, thus mitigating this risk and giving full credit to experiments targeted at noncore users.
4:20pm-5:00pm (40m) Big data and data science in the cloud, Data engineering and architecture, Data science and machine learning, Data-driven business management, Strata Business Summit, Streaming systems and real-time applications, Visualization and user experience
Big data insights equal big money: Stories from the trenches at GoDaddy
Felix Gorodishter (GoDaddy)
GoDaddy ingests and analyzes over 100,000 data points per second. Felix Gorodishter discusses the company's big data journey from ingest to automation, how it is evolving its systems to scale to over 10 TB of new data per day, and how it uses tools like anomaly detection to produce valuable insights, such as the worth of a reminder email.
11:00am-11:40am (40m) Law, ethics, and governance, Strata Business Summit
Fighting sex trafficking with data science
Ruben van der Dussen (Thorn)
Sugreev Chawla offers an overview of Spotlight, a tool created by Thorn, a nonprofit that uses technology to fight online child sexual exploitation. It allows law enforcement to process millions of escort ads per month in an effort to fight sex trafficking, using graph analysis, time series analysis, and NLP techniques to surface important networks of ads and characterize their behavior over time.
11:50am-12:30pm (40m) Data-driven business management, Strata Business Summit
Architecting an open source enterprise data lake
Sagar Kewalramani (Cloudera)
With so many business intelligence tools in the Hadoop ecosystem and no common measure to identify the efficiency of each tool, where do you begin to build or modify your enterprise data lake strategy? Sagar Kewalramani shares real-world BI problems and how they were resolved with Hadoop tools and demonstrates how to build an effective data lake strategy with open source tools and components.
1:50pm-2:30pm (40m) Data-driven business management, Strata Business Summit
Lessons on driving data science and analytics transformation
Chris Chapo (Gap Inc.)
Chris Chapo walks you through real-world examples of companies that are driving transformational change by leveraging data science and analytics, paying particular attention to established organizations where these capabilities are newer concepts.
2:40pm-3:20pm (40m) Data-driven business management, Strata Business Summit
Detecting retail fraud with data wrangling and machine learning
Matt Derda (Trifacta), Harrison Lynch (Consensus Corporation)
Matt Derda and Harrison Lynch explain how Consensus leverages the combined power of data wrangling and machine learning to more efficiently identify and reduce retail fraud and how adopting data wrangling technology has helped Trifacta reduce time spent data wrangling from six weeks to one week.
4:20pm-5:00pm (40m) Data-driven business management, Strata Business Summit
Data-driven fuel management at Ryanair
Marcin Pilarczyk (Ryanair)
Managing fuel at a company flying 120 millions passengers yearly is not a trivial task. Marcin Pilarczyk explores the main aspects of fuel management of a modern airline and offers an overview of machine learning methods supporting long-term planning and daily decisions.
11:00am-11:40am (40m) Data engineering and architecture, Streaming systems and real-time applications Expo Hall
Kafka streaming applications with Akka Streams and Kafka Streams
Dean Wampler (Anyscale)
Dean Wampler compares and contrasts data processing with Akka Streams and Kafka Streams, microservice streaming applications based on Kafka. Dean discusses the strengths and weaknesses of each tool for particular design needs and contrasts them with Spark Streaming and Flink, so you'll know when to choose them instead.
11:50am-12:30pm (40m)
Session
1:50pm-2:30pm (40m) Data engineering and architecture, Data science and machine learning, Streaming systems and real-time applications Expo Hall, Graphs and Time-series
The real-time journey from raw streaming data to AI-based analytics
Roy Ben Alta (Amazon Web Services), Ira Cohen (Anodot)
Many domains, such as mobile, web, the IoT, ecommerce, and more, have turned to analyzing streaming data. However, this presents challenges both in transforming the raw data to metrics and automatically analyzing the metrics in to produce the insights. Roy Ben-Alta and Ira Cohen share a solution implemented using Amazon Kinesis as the real-time pipeline feeding Anodot's anomaly detection solution.
2:40pm-3:20pm (40m) Big data and data science in the cloud, Data engineering and architecture, Data science and machine learning, Streaming systems and real-time applications Expo Hall
Building ML and AI pipelines with Spark and TensorFlow
Chris Fregly (Amazon Web Services)
Chris Fregly demonstrates how to extend existing Spark-based data pipelines to include TensorFlow model training and deploying and offers an overview of TensorFlow’s TFRecord format, including libraries for converting to and from other popular file formats such as Parquet, CSV, JSON, and Avro stored in HDFS and S3.
11:00am-11:40am (40m) Ask Me Anything
Ask Me Anything: Big data and machine learning techniques to drive and grow business
Burcu Baran (LinkedIn), Wei Di (LinkedIn)
Join Burcu Baran and Wei Di to discuss big data in business analytics, machine learning in business analytics, and achieving actionable insights from big data.
11:50am-12:30pm (40m) Ask Me Anything
Ask Me Anything: Deep learning-based search and recommendation systems using TensorFlow
Vijay Agneeswaran (Walmart Labs), Abhishek Kumar (Publicis Sapient)
Join Vijay Srinivas Agneeswaran and Abhishek Kumar to discuss recommender systems—particularly deep learning-based recommender systems in TensorFlow—or ask any other questions you have about deep learning.
1:50pm-2:30pm (40m) Ask Me Anything
Ask Me Anything: Managing data science in the enterprise
Nick Elprin (Domino Data Lab)
Join Nick Elprin to discuss the challenges associated with evolving from random acts of data science to data science as a core competency, common pitfalls and best practices for implementing process, hiring people, and deploying diverse technology, designing and running data science organizations, and more.
2:40pm-3:20pm (40m) Ask Me Anything
Ask Me Anything: Streaming architectures and applications (Kafka, Spark, Akka, and microservices)
Dean Wampler (Anyscale), Boris Lublinsky (Lightbend)
Join Dean Wampler and Boris Lublinsky to discuss all things streaming, from architecture and implementation to streaming engines and frameworks. Be sure to bring your questions about techniques for serving machine learning models in production, traditional big data systems, or software architecture in general.
11:00am-11:40am (40m) Data science and machine learning Graphs and Time-series
Graph analysis of 200,000 tweets from Russian Twitter trolls
Ryan Boyd (Neo4j)
Ryan Boyd explains how he and his team reconstructed a subset of the Twitter network of Russian troll accounts and applied graph analytics to the data using the Neo4j graph database to uncover how these accounts were spreading fake news.
11:50am-12:30pm (40m) Data engineering and architecture
The state of Postgres
Umur Cubukcu (Citus Data)
PostgreSQL is often regarded as the world’s most advanced open source database—and it’s on fire. Umur Cubukcu moves beyond the typical list of features in the next release to explore why so many new projects “just use Postgres” as their system of record (or system of engagement) at scale. Along the way, you'll learn how PostgreSQL's extension APIs are fueling innovations in relational databases.
1:50pm-2:30pm (40m) Data engineering and architecture
Data-driven ecosystems in the automotive industry
Josef Viehhauser (BMW Group), Tobias Burger (BMW Group)
The BMW Group IT team drives the usage of data-driven technologies and forms the nucleus of a data-centric culture inside of the organization. Josef Viehhauser and Tobias Bürger discuss the E-to-E relationship of data and models and share best practices for scaling applications in real-world environments.
2:40pm-3:20pm (40m) Sponsored
When tests cry wolf (sponsored by Pure Storage)
Ivan Jibaja (Pure Storage)
Pure Storage redefined QA testing. Using open source technologies like Spark and Kafka, the company deployed a streaming big data analytics pipeline that processes over 70 billion events per day to prioritize, classify, deduplicate, and understand test failures. Ivan Jibaja discusses use cases for big data analytics technologies, the underlying elastic infrastructure, and lessons learned.
11:00am-11:40am (40m) Sponsored
The changing role of the CDO: Three keys for success (sponsored by MapR)
Jim Scott (NVIDIA)
The value of data is not strictly a function of its size but rather is in the value that can be extracted from it. Jim Scott explains how to identify the right data to leverage to monitor the pulse of fast changing business environments, the best way to integrate analytics into your business processes, and the importance of cross-application data flows.
11:50am-12:30pm (40m) Data engineering and architecture
20 Netflix-style principles and practices to get the most out of your data platform
Kurt Brown (Netflix)
Kurt Brown explains how to get the most out of your data infrastructure with 20 principles and practices used at Netflix. Kurt covers each in detail and explores how they relate to the technologies used at Netflix, including S3, Spark, Presto, Druid, R, Python, and Jupyter.
1:50pm-2:30pm (40m) Sponsored
Harnessing the cloud to enable connected systems and self-service and accelerate business growth (sponsored by Talend)
Jeff Smits (RingCentral)
Jeff Smits explains how RingCentral is utilizing the cloud, data integration, self-service, and APIs to harvest the immense potential of connected systems.
2:40pm-3:20pm (40m) Sponsored
Get a farm-to-table view of your data: Track data lineage from source to analytics (sponsored by Syncsort)
Tendu Yogurtcu (Syncsort)
Chefs must be able to trust the authenticity, quality, and origin of their ingredients; data analysts must be able to do the same of their data—and what happens to it along the way. Tendü Yoğurtçu explains how to seamlessly track the lineage and quality of your data—on and off the cluster, on-premises or in the cloud—to deliver meaningful insights and meet regulatory compliance requirements.
11:00am-11:40am (40m) Sponsored
Building the bridge from big data to machine learning and artificial intelligence (sponsored by Google Cloud)
Ryan Lippert (Google Cloud)
If your company isn't good at analytics, it's not ready for AI. Ryan Lippert explains how the right data strategy can set you up for success in machine learning and artificial intelligence—the new ground for gaining competitive edge and creating business value.
2:40pm-3:20pm (40m) Sponsored
On-device deep learning: Trends, technologies, and challenges (sponsored by TalkingData)
Andreas Pfadler (TalkingData)
Andreas Pfadler offers an overview of current technological trends for on-device deep learning and edge computing. Along the way, Andreas explores major players and platforms and computational challenges and solutions. Andreas concludes with a discussion of TalkingData's vision for the future of mobile deep learning.
8:45am-8:50am (5m)
Thursday keynote welcome
Ben Lorica (O'Reilly), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.
8:50am-9:00am (10m)
Sprouted clams and stanky bean: When machine learning makes mistakes
Janelle Shane (aiweirdness.com)
At AIweirdness.com Janelle Shane posts the results of neural network experiments gone delightfully wrong. But machine learning mistakes can also be very embarrassing or even dangerous. Using silly datasets as examples, Janelle talks about some ways that algorithms fail.
9:00am-9:10am (10m) Sponsored keynote
The case for a deliberate data strategy in today’s attention-deficit economy (sponsored by MapR)
Anoop Dawar (MapR Technologies)
We are inundated with ideas and technology news in today’s data-rich but attention-deficit economy. In this environment, competitive advantage comes not from what is abundant (i.e., data) but from what is scarce—the ability to deploy insights in real time. Anoop Dawar explains how your peers are succeeding in shrinking the insight-to-action cycle and achieving great results.
9:10am-9:30am (20m) Data-driven business management, Media, entertainment, and advertising
Differentiating via data science
Eric Colson (Stitch Fix)
While companies often use data science as a supportive function, the emergence of new business models has made it possible for some companies to differentiate via data science. Eric Colson explores what it means to differentiate by data science and explains why companies must now think very differently about the role and placement of data science in the organization.
9:30am-9:40am (10m)
Automating decisions with data in the cloud
Amr Awadallah (Cloudera), Sangeeth Ponathil (Pizza Hut)
Amr Awadallah explains why the cloud requires a different approach to machine learning and analytics and what you can do about it.
9:40am-9:45am (5m) Sponsored keynote
What separates the clouds? (sponsored by Google Cloud)
William Vambenepe (Google)
William Vambenepe explains how a pivot toward machine learning and artificial intelligence has created clearer separation among clouds than ever before. William walks you through an interesting use case of machine learning in action and discusses the central role AI will play in big data analysis moving forward.
9:45am-10:05am (20m)
Inclusivity for the greater good
Ajey Gore (GO-JEK)
Ajey Gore details GO-JEK's evolution from a small bike-hailing startup to a technology-focused unicorn in the areas of transportation, lifestyle, payments, and social enterprise and explains how the company is focusing its attention beyond urban Indonesia to impact more than a million people across the country's rural areas.
10:05am-10:25am (20m)
Lessons in Google Search data
Seth Stephens-Davidowitz (New York Times)
Seth Stephens-Davidowitz explains how to use Google searches to uncover behaviors or attitudes that may be hidden from traditional surveys, such as racism, sexuality, child abuse, and abortion.
10:30am-11:00am (30m)
Break: Morning break
12:30pm-1:50pm (1h 20m)
Thursday Topic Tables at Lunch
Topic Table discussions are a great way to informally network with people in similar industries or interested in the same topics.
12:30pm-1:50pm (1h 20m)
Thursday Business Summit Lunch
Join Strata Business Summit speakers and attendees for a networking lunch on Thursday.
3:20pm-4:20pm (1h)
Break: Afternoon break
8:00am-8:30am (30m)
Speed Networking
Gather before keynotes on Thursday morning for a speed networking event. Enjoy casual conversation while meeting fellow attendees. If one of your goals at Strata is to meet new people, this session will jumpstart your networking with fellow attendees.