Sep 23–26, 2019
 
3B - Expo Hall
Add Unified Tooling for Machine Learning Interpretability to your personal schedule
11:20am Unified Tooling for Machine Learning Interpretability Harsha Nori (Microsoft), Sameul Jenkins (Microsoft), Rich Caruana (Microsoft)
Add Feature engineering with Spark NLP to accelerate clinical trial recruitment to your personal schedule
1:15pm Feature engineering with Spark NLP to accelerate clinical trial recruitment Saif Addin Ellafi (John Snow Labs), Scott Hoch (Deep6.ai)
Add Towards More Fine-Grained Sentiment and Emotion Analysis of Text to your personal schedule
2:55pm Towards More Fine-Grained Sentiment and Emotion Analysis of Text Gerard de Melo (Rutgers University)
Add Search Logs + Machine Learning = Auto-Tagged Inventory to your personal schedule
4:35pm Search Logs + Machine Learning = Auto-Tagged Inventory John Berryman (Eventbrite)
Add Alexa, Do Men Talk Too Much? to your personal schedule
5:25pm Alexa, Do Men Talk Too Much? Sireesha Muppala (Amazon Web Services), Shelbee Eigenbrode (Amazon Web Services), Emily Webber (Amazon Web Services)
1A 06/07
Add Time Series Forecasting using Deep Learning with PyTorch to your personal schedule
11:20am Time Series Forecasting using Deep Learning with PyTorch Ying Yau (AllianceBernstein)
Add Improving OCR Quality of Documents using Generative Adversarial Networks to your personal schedule
1:15pm Improving OCR Quality of Documents using Generative Adversarial Networks Nagendra Shishodia (EXL), Chaithanya Manda (EXL Service), Solmaz Torabi (EXL Service)
Add Real time Anomaly detection on observability data using neural networks to your personal schedule
2:05pm Real time Anomaly detection on observability data using neural networks Keshav Peswani (Expedia Group), Ashish Aggarwal (Expedia Group)
Add Introducing a new anomaly detection algorithm (SR-CNN) inspired by Computer Vision to your personal schedule
2:55pm Introducing a new anomaly detection algorithm (SR-CNN) inspired by Computer Vision Tony Xing (Microsoft), Bixiong Xu (Microsoft), Congrui Huang (Microsoft), Qun Ying (Microsoft)
Add DEEP LEARNING ON MOBILE to your personal schedule
4:35pm DEEP LEARNING ON MOBILE Siddha Ganju (Nvidia), Meher Kasam (Square)
1A 08/10
Add Data Science vs Engineering: Does it really have to be this way? to your personal schedule
11:20am Data Science vs Engineering: Does it really have to be this way? Ann Spencer (Domino Data Lab), Paco Nathan (Derwen, Inc.), Amy Heineike (Primer), Pete Warden (TensorFlow)
Add Machine Learning and Large Scale Data Analysis On Centralized Platform to your personal schedule
1:15pm Machine Learning and Large Scale Data Analysis On Centralized Platform James Tang (WalmartLabs), Yiyi Zeng (WalmartLabs), Linhong Kang (WalmartLabs)
Add We run, we improve, we scale - XGBoost story in Uber to your personal schedule
2:05pm We run, we improve, we scale - XGBoost story in Uber Nan Zhu (Uber), Felix Cheung (Uber)
Add  Building a Machine Learning Framework to Measure TV Advertising Attribution to your personal schedule
2:55pm Building a Machine Learning Framework to Measure TV Advertising Attribution Fei Wang (CarGurus), Michael Brautbar (CarGurus)
Add From Whiteboard to Production: a Demand Forecasting System for an Online Grocery Shop to your personal schedule
4:35pm From Whiteboard to Production: a Demand Forecasting System for an Online Grocery Shop Robert Pesch (inovex GmbH), Robin Senge (inovex GmbH)
Add Data Science and the Business of Major League Baseball to your personal schedule
5:25pm Data Science and the Business of Major League Baseball Aaron Owen (Major League Baseball), Matt Horton (Major League Baseball), Josh Hamilton (MLB)
1A 12/14
Add Practical Feature Engineering to your personal schedule
11:20am Practical Feature Engineering Ted Dunning (MapR)
Add Learning with Limited Labeled Data to your personal schedule
1:15pm Learning with Limited Labeled Data Shioulin Sam (Cloudera Fast Forward Labs)
Add Fair, privacy preserving, and secure ML to your personal schedule
2:05pm Fair, privacy preserving, and secure ML Mikio Braun (Zalando SE)
Add How Machine Learning Meets Optimization to your personal schedule
2:55pm How Machine Learning Meets Optimization Jari Koister (FICO )
1A 15/16
Add Stream Processing beyond Streaming Data to your personal schedule
2:05pm Stream Processing beyond Streaming Data Stephan Ewen (Ververica), Aljoscha Krettek (data Artisans)
Add How China Telecom combat financial frauds over 50M transactions a day using Apache Pulsar to your personal schedule
2:55pm How China Telecom combat financial frauds over 50M transactions a day using Apache Pulsar Weisheng Xie (China Telecom BestPay Co., Ltd), Sijie Guo (ASF)
Add Trill: The Crown Jewel of Microsoft’s Streaming Pipeline Explained to your personal schedule
4:35pm Trill: The Crown Jewel of Microsoft’s Streaming Pipeline Explained James Terwilliger (Microsoft Corporation), Badrish Chandramouli (Microsoft Research), Jonathan Goldstein (Microsoft Research)
Add Fast Data with the KISSS stack to your personal schedule
5:25pm Fast Data with the KISSS stack Bas Geerdink (ING)
1A 21/22
Add Data platform architecture principles to your personal schedule
11:20am Data platform architecture principles Julien Le Dem (WeWork)
Add Productive Data Science Platform - Beyond Hosted notebooks solution at LinkedIn to your personal schedule
1:15pm Productive Data Science Platform - Beyond Hosted notebooks solution at LinkedIn Swasti Kakker (LinkedIn), Manu Ram Pandit (LinkedIn), Vidya Ravivarma (LinkedIn)
Add From raw data to informed intelligence: democratizing data science and ML at Uber to your personal schedule
2:05pm From raw data to informed intelligence: democratizing data science and ML at Uber Atul Gupte (Uber Technologies Inc.), Nikhil Joshi (Uber)
Add Downscaling: The Achilles heel of autoscaling Spark Clusters to your personal schedule
4:35pm Downscaling: The Achilles heel of autoscaling Spark Clusters Prakhar Jain (Qubole), Sourabh Goyal (Qubole)
Add Improving Spark by taking advantage of disaggregated architecture to your personal schedule
5:25pm Improving Spark by taking advantage of disaggregated architecture Chenzhao Guo (Intel Asia-Pacific Research & Development Ltd.), Carson Wang (Intel)
1A 23/24
Add The Evolution of Metadata: LinkedIn’s story to your personal schedule
2:05pm The Evolution of Metadata: LinkedIn’s story Shirshanka Das (LinkedIn), Mars Lan (LinkedIn)
Add Turning Big Data into Knowledge: Managing metadata and data-relationships at Uber scale to your personal schedule
2:55pm Turning Big Data into Knowledge: Managing metadata and data-relationships at Uber scale Kaan Onuk (Uber), Luyao Li (Uber), Atul Gupte (Uber)
Add The case for a common Metadata Layer for Machine Learning Platforms to your personal schedule
4:35pm The case for a common Metadata Layer for Machine Learning Platforms Max Neunhöffer (ArangoDB), Joerg Schad (Suki)
Add Finding your needle in a Haystack to your personal schedule
5:25pm Finding your needle in a Haystack Naghman Waheed (Bayer Crop Science), John Cooper (Bayer)
1E 07/08
Add Kubernetes for Stateful MPP systems to your personal schedule
11:20am Kubernetes for Stateful MPP systems Paige Roberts (Vertica), Deepak Majeti (Vertica)
Add Time-travel for Data Pipelines: Solving the mystery of what changed? to your personal schedule
2:55pm Time-travel for Data Pipelines: Solving the mystery of what changed? Shradha Ambekar (Intuit), Sunil Goplani (Intuit), Sandeep Uttamchandani (Intuit)
Add Apache Hadoop 3.x State of The Union and Upgrade Guidance to your personal schedule
4:35pm Apache Hadoop 3.x State of The Union and Upgrade Guidance Wangda Tan (Cloudera), Jitendra Pandey (Hortonworks)
5:25pm
1E 09
Add Data Security & Privacy Anti-Patterns to your personal schedule
11:20am Data Security & Privacy Anti-Patterns Steven Touw (Immuta)
Add Securing your cloud data lake with a "defense in depth" approach to your personal schedule
2:05pm Securing your cloud data lake with a "defense in depth" approach Tomer Shiran (Dremio), Jacques Nadeau (Dremio)
Add Secured Computation – Analyzing Sensitive Data using Homomorphic Encryption to your personal schedule
5:25pm Secured Computation – Analyzing Sensitive Data using Homomorphic Encryption Matt Carothers (Cox Communications), Jignesh Patel (Cox Communications)
1E 10/11
Add Executive Briefing: Top 10 Big Data Blunders to your personal schedule
1:15pm Executive Briefing: Top 10 Big Data Blunders Michael Stonebraker (Tamr, Inc.)
2:05pm
Add Executive Briefing: Understanding the Cult of Prediction to your personal schedule
2:55pm Executive Briefing: Understanding the Cult of Prediction Farrah Bostic (The Difference Engine)
Add Executive Briefing: Data Catalogs - Concepts, Capabilities and Key Platforms to your personal schedule
4:35pm Executive Briefing: Data Catalogs - Concepts, Capabilities and Key Platforms Andrew Brust (ZDNet | Blue Badge Insights)
1E 12/13
Add Turning petabytes of data from millions of vehicles into open data with Geotab to your personal schedule
1:15pm Turning petabytes of data from millions of vehicles into open data with Geotab Felipe Hoffa (Google), Bob Bradley (Geotab)
2:05pm
Add Enabling 5G use cases through Location Intelligence to your personal schedule
2:55pm Enabling 5G use cases through Location Intelligence Tim McKenzie (Pitney Bowes)
1E 14
11:20am
Add War Stories from the Front Lines of ML to your personal schedule
1:15pm War Stories from the Front Lines of ML Andrew Burt (Immuta), Brenda Leong (Future of Privacy Forum)
Add Regulations and the Future of Data to your personal schedule
2:05pm Regulations and the Future of Data Andrew Burt (Immuta), Brenda Leong (Future of Privacy Forum)
Add Are Your Privacy Practices Auditor-Approved? to your personal schedule
2:55pm Are Your Privacy Practices Auditor-Approved? Mark Hinely (KirkpatrickPrice)
Add Wednesday keynotes to your personal schedule
3E
8:45am Wednesday keynotes Ben Lorica (O'Reilly Media), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
10:50am Morning break sponsored by Intel | Room: Expo Hall - 3B
12:00pm Lunch sponsored by Google Cloud | Room: Expo Hall - 3B
Add Wednesday Topic Tables at Lunch to your personal schedule
12:00pm Wednesday Topic Tables at Lunch | Room: Expo Hall - 3B
3:35pm Afternoon break sponsored by MemSQL | Room: Expo Hall - 3B
Add Booth Crawl to your personal schedule
6:05pm Booth Crawl | Room: Expo Hall - 3B
Add Speed Networking to your personal schedule
8:15am Speed Networking | Room: Keynote Foyer
Add Wednesday Topic Tables at Lunch to your personal schedule
12:00pm Wednesday Topic Tables at Lunch | Room: Expo Hall - 3B
Add Wednesday Business Summit Lunch to your personal schedule
12:00pm Wednesday Business Summit Lunch | Room: Expo Hall - 3D
Add Data After Dark to your personal schedule
7:30pm Data After Dark | Room: 54 West 21st Street, New York, NY 10010
11:20am-12:00pm (40m) Data Science, Machine Learning, & AI Ethics
Unified Tooling for Machine Learning Interpretability
Harsha Nori (Microsoft), Sameul Jenkins (Microsoft), Rich Caruana (Microsoft)
Understanding decisions made by machine learning systems is critical for sensitive uses, ensuring fairness, and debugging production models. Interpretability is a maturing field of research that presents many options for trying to understand model decisions. Microsoft is releasing new tools to help you train powerful, interpretable models and interpret decisions of existing blackbox systems.
1:15pm-1:55pm (40m) Data Science, Machine Learning, & AI Health and Medicine, Text and Language processing and analysis
Feature engineering with Spark NLP to accelerate clinical trial recruitment
Saif Addin Ellafi (John Snow Labs), Scott Hoch (Deep6.ai)
Recruiting patients for clinical trials is a major challenge in drug development. This talk explains how Deep6 utilizes Spark NLP to scale its training and inference pipelines to millions of patients while achieving state-of-the-art accuracy. It covers the technical challenges, the architecture of the full solution, and lessons learned.
2:05pm-2:45pm (40m) Data Science, Machine Learning, & AI Text and Language processing and analysis
Mind the Semantic Gap: How “talking semantics” can help you perform better data science
Panos Alexopoulos (Textkernel BV)
In an era where discussions among data scientists are monopolized by the latest trends in Machine Learning, the role of Semantics in Data Science is often underplayed. In this talk, I present real-world cases where making fine, seemingly pedantic, distinctions in the meaning of data science tasks and their related data, has helped improve significantly their effectiveness and value.
2:55pm-3:35pm (40m) Data Science, Machine Learning, & AI Text and Language processing and analysis
Towards More Fine-Grained Sentiment and Emotion Analysis of Text
Gerard de Melo (Rutgers University)
What kinds of sentiment and emotions do consumers associate with a text? With new data-driven approaches, organizations can better pay attention to what is being said about them in different markets. We can also consider the fonts and color palettes best-suited to convey specific emotions, so that organizations can make informed choices when presenting information to consumers.
4:35pm-5:15pm (40m) Data Science, Machine Learning, & AI Text and Language processing and analysis
Search Logs + Machine Learning = Auto-Tagged Inventory
John Berryman (Eventbrite)
Eventbrite is exploring a new machine learning approach that allows us to harvest data from customer search logs and automatically tag events based upon their content. The results have allowed us to provide users with a better inventory browsing experience.
5:25pm-6:05pm (40m) Data Science, Machine Learning, & AI Culture and Organization, Text and Language processing and analysis
Alexa, Do Men Talk Too Much?
Sireesha Muppala (Amazon Web Services), Shelbee Eigenbrode (Amazon Web Services), Emily Webber (Amazon Web Services)
Mansplaining. Know it? Hate it? Want to make it go away? In this session we tackle the chronic problem of men talking over or down to women and its negative impact on career progression for women. We will also demonstrate an Alexa skill that uses deep learning techniques on incoming audio feeds. We discuss ownership of the problem for both women and men, and suggest helpful strategies.
11:20am-12:00pm (40m) Data Science, Machine Learning, & AI Deep Learning, Financial Services, Temporal data and time-series analytics
Time Series Forecasting using Deep Learning with PyTorch
Ying Yau (AllianceBernstein)
Time series forecasting techniques can be applied in a wide range of scientific disciplines, business scenarios, and policy settings. This session discusses the application of deep learning techniques to time series forecasting and compares them to time series statistical models when forecasting time series with trends, multiple seasonality, regime switch, and exogenous series.
1:15pm-1:55pm (40m) Data Science, Machine Learning, & AI Deep Learning, Financial Services, Health and Medicine
Improving OCR Quality of Documents using Generative Adversarial Networks
Nagendra Shishodia (EXL), Chaithanya Manda (EXL Service), Solmaz Torabi (EXL Service)
Every NLP based document processing solution depends on converting scanned documents/ images to machine readable text using an OCR solution. However, accuracy of OCR solutions is limited by quality of scanned images. We show that generative adversarial networks can be used to bring significant efficiencies in any document processing solution by enhancing resolution and de-noising scanned images.
2:05pm-2:45pm (40m) Data Science, Machine Learning, & AI Deep Learning, Temporal data and time-series analytics, Transportation and Logistics
Real time Anomaly detection on observability data using neural networks
Keshav Peswani (Expedia Group), Ashish Aggarwal (Expedia Group)
Observability is the key in modern architecture to quickly detect and repair problems in microservices. Modern observability platforms have evolved beyond simple application logs and now include distributed tracing systems like Zipkin, Haystack. Combining them with real time intelligent alerting mechanisms with accurate alerts helps in automated detection of these problems.
2:55pm-3:35pm (40m) Data Science, Machine Learning, & AI Deep Learning, Temporal data and time-series analytics
Introducing a new anomaly detection algorithm (SR-CNN) inspired by Computer Vision
Tony Xing (Microsoft), Bixiong Xu (Microsoft), Congrui Huang (Microsoft), Qun Ying (Microsoft)
Anomaly Detection may sound old fashioned yet super important in many industry applications. How about doing this in a computer vision way? Come to our talk to learn a novel Anomaly Detection algorithm based on Spectral Residual (SR) and Convolutional Neural Network (CNN), and how this novel method was applied in the monitoring system supporting Microsoft AIOps and business incident prevention.
4:35pm-5:15pm (40m) Data Science, Machine Learning, & AI Data Integration and Data Processing, Deep Learning, Financial Services
DEEP LEARNING ON MOBILE
Siddha Ganju (Nvidia), Meher Kasam (Square)
Optimizing deep neural nets to run efficiently on mobile devices.
5:25pm-6:05pm (40m) Data Science, Machine Learning, & AI Deep Learning, Model Development, Governance, Operations
Deploying End-to-End Deep Learning Pipelines with ONNX
Nick Pentreath (IBM)
The common perception of deep learning is that it results in a fully self-contained model. However, in most cases these models have similar requirements for data pre-processing as more "traditional" machine learning. Despite this, there are few standard solutions for deploying end-to-end deep learning. In this talk, I show how the ONNX format and ecosystem is addressing this challenge.
11:20am-12:00pm (40m) Data Science, Machine Learning, & AI Culture and Organization
Data Science vs Engineering: Does it really have to be this way?
Ann Spencer (Domino Data Lab), Paco Nathan (Derwen, Inc.), Amy Heineike (Primer), Pete Warden (TensorFlow)
Are you a data scientist that has wondered "why does it take so long to deploy my model into production?" Are you an engineer that has ever thought "data scientists have no idea what they want"? You are not alone. Join us for a lively discussion panel, with industry veterans, to chat about best practices and insights regarding how to increase collaboration when developing and deploying models.
1:15pm-1:55pm (40m) Data Science, Machine Learning, & AI Data, Analytics, and AI Architecture, Financial Services, Retail and e-commerce
Machine Learning and Large Scale Data Analysis On Centralized Platform
James Tang (WalmartLabs), Yiyi Zeng (WalmartLabs), Linhong Kang (WalmartLabs)
How No1 retailer provides secure and seamless shopping experience through machine learning and large scale data analysis on centralized platform.
2:05pm-2:45pm (40m) Data Science, Machine Learning, & AI Deep dive into specific tools, platforms, or frameworks, Transportation and Logistics
We run, we improve, we scale - XGBoost story in Uber
Nan Zhu (Uber), Felix Cheung (Uber)
XGBoost has been widely deployed in companies across the industry. This talk begins with introducing the internals of distributed training in XGBoost and then demonstrate how XGBoost resolves the business problem in Uber with a scale to thousands of workers and 10s of TB training data.
2:55pm-3:35pm (40m) Data Science, Machine Learning, & AI Media and Advertising, Retail and e-commerce
Building a Machine Learning Framework to Measure TV Advertising Attribution
Fei Wang (CarGurus), Michael Brautbar (CarGurus)
This session will present the case study for the CarGurus TV Attribution Model. Attendees will learn how the creation of a causal inference model can be leveraged to calculate cost per acquisition (CPA) of TV spend and measure effectiveness when compared to CPA of Digital Performance Marketing spend.
4:35pm-5:15pm (40m) Data Science, Machine Learning, & AI Retail and e-commerce, Temporal data and time-series analytics
From Whiteboard to Production: a Demand Forecasting System for an Online Grocery Shop
Robert Pesch (inovex GmbH), Robin Senge (inovex GmbH)
In this talk, we outline the development process, the statistical modeling, the data-driven decision making, and the components needed for productionizing a fully automated and highly scalable demand forecasting system for an online grocery shop for a billion-dollar retail group in Europe.
5:25pm-6:05pm (40m) Data Science, Machine Learning, & AI Media and Advertising
Data Science and the Business of Major League Baseball
Aaron Owen (Major League Baseball), Matt Horton (Major League Baseball), Josh Hamilton (MLB)
Utilizing SAS, Python, and AWS Sagemaker, MLB’s data science team discusses how it predicts ticket purchasers’ likelihoods to purchase again, evaluates prospective season schedules, estimates customer lifetime value, optimizes promotion schedules, quantifies the strength of fan avidity, and monitors the health of monthly subscriptions to its game-streaming service.
11:20am-12:00pm (40m) Data Science, Machine Learning, & AI
Practical Feature Engineering
Ted Dunning (MapR)
Feature engineering is generally the section that gets left out of machine learning books, but it is also the most critical part in practice. I will provide a variety of techniques, a few well known, but some rarely spoken of outside the tribal lore of top teams, including how to handle categorical inputs, natural language, transactions and more all in the context of modern machine learning.
1:15pm-1:55pm (40m) Data Science, Machine Learning, & AI Deep Learning
Learning with Limited Labeled Data
Shioulin Sam (Cloudera Fast Forward Labs)
Supervised machine learning requires large labeled datasets - a prohibitive limitation in many real world applications. What if machines could learn with few labeled examples? This talk explores and demonstrates an algorithmic solution that relies on collaboration between human and machines to label smartly, and discuss product possibilities.
2:05pm-2:45pm (40m) Data Science, Machine Learning, & AI, Security and Privacy Ethics, Privacy and Security, Retail and e-commerce
Fair, privacy preserving, and secure ML
Mikio Braun (Zalando SE)
With ML becoming more and more mainstream, the side effects of using machine learning and AI on our lives become more and more visible. One has to take extra measures to make machine learning models fair and unbiased In addition, awareness for preserving the privacy in ML models is rapidly growing.
2:55pm-3:35pm (40m) Data Science, Machine Learning, & AI Financial Services
How Machine Learning Meets Optimization
Jari Koister (FICO )
Machine Learning and Constraint-based Optimization are both used to solve critical business problems. They come from distinct research communities and have traditionally been treated separately. This talk describes how they are similar, how they differ and how they can be used to solve complex problems with amazing results.
4:35pm-5:15pm (40m) Data Science, Machine Learning, & AI Media and Advertising, Temporal data and time-series analytics
Predicting Criteo’s Internet traffic load using Bayesian structural time series model.
Hamlet Jesse Medina Ruiz (Criteo)
Criteo’s infrastructure provides capacity and connectivity to host Criteo’s platform and applications. The evolution of our infrastructure is driven by the ability to forecast Criteo’s traffic demand. In this talk, we explain how Criteo uses Bayesian Dynamic time series models to accurately forecast its traffic load and optimize hardware resources across data centers.
5:25pm-6:05pm (40m) Data Science, Machine Learning, & AI Retail and e-commerce
Causal inference 101: Answering the crucial ‘why’ in your analysis.
Subhasish Misra (Walmart )
Causal questions are ubiquitous. Randomized tests are considered to be the gold standard for these. However, such tests are not always feasible and then, one just has observational data to get to causal insights. Techniques such as matching offer a solve then. This talk will offer a take on the above aspects, plus share practical tips when trying to infer causal effects.
11:20am-12:00pm (40m) Data Engineering and Architecture Data Integration and Data Processing, Data, Analytics, and AI Architecture, Retail and e-commerce, Streaming and IoT
Building a multi-tenant data processing and model inferencing platform with Kafka Streams
Navinder Pal Singh Brar (Walmart Labs)
Each week 275 million people shop at Walmart, generating multi-terabytes of interaction and transaction data. In Customer Backbone team, we enable extraction, transforming and storing of customer data to be served to teams such as Ads and Personalisation. At 5 Billion events/day our Kafka Streams cluster processes events from various channels and maintains a uniform identity of a customer.
1:15pm-1:55pm (40m) Data Engineering and Architecture Data, Analytics, and AI Architecture, Deep dive into specific tools, platforms, or frameworks
Now You See Me, Now You Compute: Building Event-driven Architectures with Apache Kafka
Michael Noll (Confluent)
Would you cross the street with traffic information that is a minute old? Certainly not! Modern businesses have the same needs. In this talk we cover why and how you can use Kafka and its growing ecosystem to build elastic event-driven architectures. Specifically, we look at Kafka as the storage layer, at Kafka Connect for data integration, and at Kafka Streams and KSQL as the compute layer.
2:05pm-2:45pm (40m) Data Engineering and Architecture, Streaming and IoT Deep dive into specific tools, platforms, or frameworks, Streaming and IoT
Stream Processing beyond Streaming Data
Stephan Ewen (Ververica), Aljoscha Krettek (data Artisans)
The talk discusses how stream processing is becoming a "grand unifying paradigm" for data processing and the newest developments in Apache Flink to support this trend: New cross-batch-streaming Machine Learning algorithms, State-of-the-art batch performance, and new building blocks for data-driven applications and application consistency.
2:55pm-3:35pm (40m) Data Engineering and Architecture Data, Analytics, and AI Architecture, Financial Services, Streaming and IoT, Telecom
How China Telecom combat financial frauds over 50M transactions a day using Apache Pulsar
Weisheng Xie (China Telecom BestPay Co., Ltd), Sijie Guo (ASF)
As a Fintech company of China Telecom with half billion registered users and 41 million monthly active users, risk control decision deployment has been critical to the success of the business. In this talk we share how we leverage Apache Pulsar to boost the efficiency of our risk control decision development for combating financial frauds over 50 million transactions a day.
4:35pm-5:15pm (40m) Data Engineering and Architecture, Streaming and IoT Cloud Platforms and SaaS, Data Integration and Data Processing, Media and Advertising, Streaming and IoT
Trill: The Crown Jewel of Microsoft’s Streaming Pipeline Explained
James Terwilliger (Microsoft Corporation), Badrish Chandramouli (Microsoft Research), Jonathan Goldstein (Microsoft Research)
Trill has been open-sourced, making the streaming engine behind services like the multi-billion-dollar Bing Ads platform available for all to use and extend. We give a brief history of streaming data at Microsoft and lessons learned. We then demonstrate how its API can power complex application logic, and the performance that gives the engine its name: a trillion events per day per node.
5:25pm-6:05pm (40m) Data Engineering and Architecture Data, Analytics, and AI Architecture, Streaming and IoT
Fast Data with the KISSS stack
Bas Geerdink (ING)
Streaming Analytics (or Fast Data processing) is the field of making predictions on real-time data. In this talk, I'll present a fast data architecture that covers many use cases that follows a 'pipes and filters' pattern. This architecture can be used to create enterprise-grade solutions with a diversity of technology options. The stack is Kafka, Impala, and Spark Structured Streaming (KISSS).
11:20am-12:00pm (40m) Data Engineering and Architecture Data, Analytics, and AI Architecture
Data platform architecture principles
Julien Le Dem (WeWork)
Big Data is crucial to organizations. Big not only by volume of data but also by the multitude of datasources and teams using them. Central data teams doing all the work is outdated as the entire organization becomes an ecosystem and central teams become enablers. We will discuss the principles of a data platform enabling the entire organization to build data centric products.
1:15pm-1:55pm (40m) Data Engineering and Architecture Data, Analytics, and AI Architecture, Media and Advertising
Productive Data Science Platform - Beyond Hosted notebooks solution at LinkedIn
Swasti Kakker (LinkedIn), Manu Ram Pandit (LinkedIn), Vidya Ravivarma (LinkedIn)
Come hear about the infrastructure and features offered by flexible and scalable hosted data science platform at LinkedIn. The platform provides features to seamlessly develop in multiple languages, enforce developer best practices, governance policies, execute, visualize solutions, efficient knowledge management and collaboration that improve developer productivity.
2:05pm-2:45pm (40m) Data Engineering and Architecture Data, Analytics, and AI Architecture, Transportation and Logistics
From raw data to informed intelligence: democratizing data science and ML at Uber
Atul Gupte (Uber Technologies Inc.), Nikhil Joshi (Uber)
At Uber, we’re changing the way people think about transportation. As an integral part of the logistical fabric in 65+ countries around the world, we’re using ML and advanced data science to power every aspect of the Uber experience - from dispatch to customer support. In this talk, we’ll explore how we enable teams at Uber to transform insights into intelligence and facilitate critical workflows.
2:55pm-3:35pm (40m) Automation in data science and data, Data Engineering and Architecture Data, Analytics, and AI Architecture, Deep Learning
Large-scale Deep Learning offline platform: Bing's approach
Kai Liu (Microsoft (BING))
Facilitating large scale of deep learning projects in parallel requires some effort and innovation. Bing is now running a deployment of thousands of servers to address this challenge. We provides training services, offline data processing, vector hosting, and inferencing service at offline fashion to help data scientists through all steps in the project life cycle.
4:35pm-5:15pm (40m) Data Engineering and Architecture Deep dive into specific tools, platforms, or frameworks
Downscaling: The Achilles heel of autoscaling Spark Clusters
Prakhar Jain (Qubole), Sourabh Goyal (Qubole)
Autoscaling of resources aims to achieve low latency for a big data application, while reducing resource costs at the same time. Upscale a cluster in cloud is fairly easy as compared to downscaling nodes and so overall Total-cost-of-ownership (TCO) goes up. We will talk about new design to get efficient downscaling which further helps in achieving better resource utilization and thus lower TCO.
5:25pm-6:05pm (40m) Data Engineering and Architecture Data, Analytics, and AI Architecture, Deep dive into specific tools, platforms, or frameworks
Improving Spark by taking advantage of disaggregated architecture
Chenzhao Guo (Intel Asia-Pacific Research & Development Ltd.), Carson Wang (Intel)
Shuffle in Spark requires the shuffle data to be persisted on local disks.However, the assumptions of collocated storage do not always hold in today’s data centers. We implemented a new Spark shuffle manager, which writes shuffle data to a remote cluster with different storage backends. This makes life easier for those customers who want to leverage the latest storage hardware, and HPC customers
11:20am-12:00pm (40m) Automation in data science and data, Data Engineering and Architecture Data, Analytics, and AI Architecture
Building an AI platform – key principles and lessons learned
Moty Fania (Intel)
In this session, Moty Fania will share Intel’s IT experience of implementing a Sales AI platform. This platform is based on streaming, micro-services architecture with a message bus backbone. It was designed for real-time, data extraction and reasoning. The platform handles processing of millions of website pages and capable of sifting thru millions of tweets per day.
1:15pm-1:55pm (40m) Data Engineering and Architecture Data quality, data governance and data lineage, Deep dive into specific tools, platforms, or frameworks
Sharing is caring: using Egeria to establish true enterprise metadata governance
Wim Stoop (Cloudera)
Establishing enterprise wide security and governance remains a challenge for most organisations. Integrations and exchanges across their landscape are costly to manage and maintain, and typically work in one direction only. In this session, we'll discuss how ODPi's Egeria standard and framework removes the challenges and is leveraged by Cloudera and partners alike to deliver value for customers.
2:05pm-2:45pm (40m) Data Engineering and Architecture Data quality, data governance and data lineage, Media and Advertising
The Evolution of Metadata: LinkedIn’s story
Shirshanka Das (LinkedIn), Mars Lan (LinkedIn)
How do you scale metadata to an organization of 10,000 employees, 1M+ data assets and an AI-enabled company that ships code to the site three times a day. We describe the journey of LinkedIn’s metadata from a two-person back-office team to a central hub powering data discovery, AI productivity and automatic data privacy. Different metadata strategies and our battle scars will be revealed!
2:55pm-3:35pm (40m) Data Engineering and Architecture Data quality, data governance and data lineage, Transportation and Logistics
Turning Big Data into Knowledge: Managing metadata and data-relationships at Uber scale
Kaan Onuk (Uber), Luyao Li (Uber), Atul Gupte (Uber)
At Uber’s scale and pace of growth, a robust system for discovering and managing various entities, from datasets to services to pipelines, and their relevant metadata is not just nice to have: it is absolutely integral to making data useful at Uber. In this talk, we will explore the current state of metadata management and end-to-end data flow solutions at Uber and what’s coming next.
4:35pm-5:15pm (40m) Data Engineering and Architecture Data quality, data governance and data lineage
The case for a common Metadata Layer for Machine Learning Platforms
Max Neunhöffer (ArangoDB), Joerg Schad (Suki)
Machine Learning Platforms being built are becoming more complex with different components each producing their own metadata. Currently, most components provide their own way of storing metadata. In this talk, we propose a first draft of a common Metadata API and demo a first implementation of this API in Kubeflow using ArangoDB, which is a native multi-model database.
5:25pm-6:05pm (40m) Data Engineering and Architecture Data quality, data governance and data lineage, Data, Analytics, and AI Architecture
Finding your needle in a Haystack
Naghman Waheed (Bayer Crop Science), John Cooper (Bayer)
As complexity of data systems has grown at Bayer, so has the difficulty to locate and understand what data sets are available for consumption. To address this challenge, a custom metadata management tool was recently deployed as a new capability at Bayer. The system is cloud enabled and uses multiple open source components including machine learning and natural language processing to aid search.
11:20am-12:00pm (40m) Data Engineering and Architecture Cloud Platforms and SaaS, Data Management and Storage, Data, Analytics, and AI Architecture
Kubernetes for Stateful MPP systems
Paige Roberts (Vertica), Deepak Majeti (Vertica)
a. Analytics experts, GoodData, needed to auto-recover from node failures and scale rapidly when workloads spike on their MPP database in the cloud. Kubernetes could solve that, but K8 is for stateless micro-services, not a stateful MPP database that needs 100s of containers. In order to merge the power of an MPP database with the flexibility of Kubernetes, a lot of hurdles had to be overcome.
1:15pm-1:55pm (40m) Data Engineering and Architecture Cloud Platforms and SaaS, Data Integration and Data Processing
Your easy move to serverless computing and radically simplified data processing
Gil Vernik (IBM)
Most analytic flows can benefit from the serverless, starting with simple cases to complex data preparations for AI frameworks, like TensorFlow. To address the challenge of how to easily integrate serverless, without major disruptions to your system, we present “push to the cloud” experience. This ability dramatically simplifies using serverless for different big data processing frameworks.
2:05pm-2:45pm (40m) Data Engineering and Architecture Cloud Platforms and SaaS, Data, Analytics, and AI Architecture
Orchestrating Data Workflows Using a Fully Serverless Architecture
Tomer Levi (Fundbox)
Use of data workflows is a fundamental functionality of any data engineering team. Nonetheless, designing an easy to use, scalable, and flexible data workflow platform is a complex undertaking. In this talk, attendees will learn how the data engineering team at Fundbox uses AWS serverless technologies to address this problem, and how it enables data scientists, BI devs and engineers move faster.
2:55pm-3:35pm (40m) Data Engineering and Architecture Data Integration and Data Processing, Data quality, data governance and data lineage
Time-travel for Data Pipelines: Solving the mystery of what changed?
Shradha Ambekar (Intuit), Sunil Goplani (Intuit), Sandeep Uttamchandani (Intuit)
Imagine a business insight showing a sudden spike.Debugging data pipelines is non-trivial and finding the root cause can take hours or even days! We’ll share how Intuit built a self-serve tool that automatically discovers data pipeline lineage and tracks every change that impacts pipeline.This helps debug pipeline issues in minutes–establishing trust in data while improving developer productivity.
4:35pm-5:15pm (40m) Data Engineering and Architecture Deep dive into specific tools, platforms, or frameworks
Apache Hadoop 3.x State of The Union and Upgrade Guidance
Wangda Tan (Cloudera), Jitendra Pandey (Hortonworks)
In this talk, we’ll start with the current status of Apache Hadoop community, we'll then move on to the exciting present & future of Hadoop 3.x. We will cover new features like erasure coding, GPU support, namenode federation, Docker, long-running services support, powerful container placement constraints, data node disk balancing, etc. Also we will talk about upgrade guidance from 2.x to 3.x.
5:25pm-6:05pm (40m) Data Engineering and Architecture, Streaming and IoT Deep dive into specific tools, platforms, or frameworks
Session
11:20am-12:00pm (40m) Data Engineering and Architecture, Security and Privacy Data Management and Storage, Privacy and Security
Data Security & Privacy Anti-Patterns
Steven Touw (Immuta)
Anti-patterns are behaviors that take bad problems and lead to even worse solutions. In the world of data security and privacy, they’re everywhere. Over the past 4 years we’ve seen data security and privacy anti-patterns consistently emerge across 100s of customers and industry verticals - there has been an obvious trend. We’ll cover 5 anti-patterns and more importantly, the solutions for them.
1:15pm-1:55pm (40m) Data Engineering and Architecture, Security and Privacy Deep dive into specific tools, platforms, or frameworks, Health and Medicine, Privacy and Security
Parquet Modular Encryption: Confidentiality and Integrity of Sensitive Column Data
Gidon Gershinsky (IBM)
The Apache Parquet community is working on a column encryption mechanism that protects the sensitive data and enables access control for table columns. Many companies are involved, the mechanism specification has recently been signed off by the community management committee. I will present the basics of Parquet encryption technology, its usage model and a number of use cases.
2:05pm-2:45pm (40m) Data Engineering and Architecture Cloud Platforms and SaaS, Privacy and Security
Securing your cloud data lake with a "defense in depth" approach
Tomer Shiran (Dremio), Jacques Nadeau (Dremio)
With cheap and infinitely scalable storage services such as S3 and ADLS, it has never been easier to dump data into a cloud data lake. But how do you secure that data and make sure it doesn't leak? In this talk we explore numerous capabilities for securing a cloud data lake, including authentication, access control, encryption (in motion and at rest) and auditing, as well as network protections.
2:55pm-3:35pm (40m) Data Engineering and Architecture, Security and Privacy Privacy and Security
When Machines Fight Machines: Cyber Battles & the New Frontier of Artificial Intelligence
Justin Fier (Darktrace)
Cyber security must find what it doesn’t know to look for. AI technologies have led to the emergence of self-learning, self-defending networks that achieve this – detecting and autonomously responding to in-progress attacks in real time. These cyber immune systems enable the security team to focus on high-value tasks, can counter even machine-speed threats, and work in all environments.
4:35pm-5:15pm (40m) Data Engineering and Architecture Health and Medicine, Privacy and Security
Protecting the Healthcare Enterprise from PHI Breaches using Streaming and NLP
Jeff Zemerick (Mountain Fog)
This talk describes how open source technologies can be used to identify and remove PHI from streaming text in an enterprise healthcare environment.
5:25pm-6:05pm (40m) Data Engineering and Architecture, Security and Privacy Media and Advertising, Privacy and Security
Secured Computation – Analyzing Sensitive Data using Homomorphic Encryption
Matt Carothers (Cox Communications), Jignesh Patel (Cox Communications)
Organizations often work with sensitive information such as social security number, and Credit card information. Although this data is stored in encrypted form, most analytical operations ranging from data analysis to advanced machine learning algorithms require data decryption for computation. This creates unwanted exposures to theft or unauthorized read by undesirables.
11:20am-12:00pm (40m) Model Development, Governance, Operations
Executive Briefing: Why machine-learned models crash and burn in production and what to do about it
David Talby (Pacific AI)
Machine learning and data science systems often fail in production in unexpected ways. David Talby shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries.
1:15pm-1:55pm (40m) Executive Briefing and best practices, Strata Business Summit Culture and Organization, Data Management and Storage, Data, Analytics, and AI Architecture
Executive Briefing: Top 10 Big Data Blunders
Michael Stonebraker (Tamr, Inc.)
As a steward for your enterprise’s data and digital transformation initiatives, you’re tasked with making the right choice. But before you can make those decisions, it’s important to understand what NOT to do when planning for your organization’s Big Data initiatives. Dr Michael Stonebraker, Adjunct Professor, MIT, & Co-Founder/CTO, Tamr will discuss his Top 10 Big Data Blunders.
2:05pm-2:45pm (40m)
Session
2:55pm-3:35pm (40m) Executive Briefing and best practices, Strata Business Summit Ethics
Executive Briefing: Understanding the Cult of Prediction
Farrah Bostic (The Difference Engine)
We are living in a culture obsessed with predictions. In politics and business, we collect data in service of the obsession. But our need for certainty and control leads some organizations to be duped by unproven technology or pseudo-science - often with unforeseen societal consequences. This talk looks at historical - and sometimes funny! - examples of sacrificing understanding for 'data'.
4:35pm-5:15pm (40m) Executive Briefing and best practices, Strata Business Summit Data quality, data governance and data lineage
Executive Briefing: Data Catalogs - Concepts, Capabilities and Key Platforms
Andrew Brust (ZDNet | Blue Badge Insights)
A primer on data catalogs and review of the major vendors and platforms in the market. Includes discussion on the use of data catalogs with classic and newer data repositories, including data warehouses, data lakes, cloud object storage and even software/applications. Coverage of AI's role in the data catalog world and analysis of data catalog futures will be provided.
5:25pm-6:05pm (40m) Executive Briefing and best practices, Strata Business Summit Data, Analytics, and AI Architecture, Privacy and Security
Executive Briefing: Making Intelligent Insights at the Edge — the Demise of Big Data?
Alasdair Allan (Babilim Light Industries)
A arrival of new generation of smart embedded hardware may cause the demise of large scale data harvesting. In its place smart devices will allow us process data at the edge, allowing us to extract insights from the data without storing potentially privacy and GDPR infringing data. The current age where privacy is no longer "a social norm" may not long survive the coming of the Internet of Things.
11:20am-12:00pm (40m) Executive Briefing and best practices, Strata Business Summit Culture and Organization
Improve Your Data Science ROI with a Portfolio and Risk Management Lens
Brian Dalessandro (SparkBeyond)
While Data Science value is well recognized within Tech, our experience with leaders across industries shows that the ability to realize and measure business impact is not universal. A core issue is DS programs face unique risks that many leaders aren’t trained to hedge against. This talk addresses these risks and advocates for new ways to think about and manage data science programs.
1:15pm-1:55pm (40m) Case studies, Strata Business Summit BI, Interactive Analytics and Visualization, Cloud Platforms and SaaS, Streaming and IoT
Turning petabytes of data from millions of vehicles into open data with Geotab
Felipe Hoffa (Google), Bob Bradley (Geotab)
Geotab is a world's leading asset tracking company, with millions of vehicles under service every day. In the first part of this talk we are going to review their challenges and solutions to create an ML and GIS enabled petabyte scale data warehouse leveraging Google Cloud. Then we are going to review their process to publish open, how to access it, and how cities are using it.
2:05pm-2:45pm (40m) Case studies, Strata Business Summit
Session
2:55pm-3:35pm (40m) Case studies, Strata Business Summit Streaming and IoT, Telecom, Transportation and Logistics
Enabling 5G use cases through Location Intelligence
Tim McKenzie (Pitney Bowes)
Planning 5G network rollout and associated services requires a good understanding of location based data. Accurate addressing and linking consumers to property parcels or points of interest allows data enrichment with property attributes, demographics and social data. Companies use location to organize and analyze network and customer data in order to understand where to target new services.
4:35pm-5:15pm (40m) Case studies, Strata Business Summit Text and Language processing and analysis
What Does The Public Say? A Computational Analysis of Regulatory Comments
Vlad Eidelman (FiscalNote)
While regulations affect your life every day, and millions of public comments are submitted to regulatory agencies in response to their proposals, analyzing the comments has traditionally been reserved for legal experts. In this talk, we show how natural language processing and machine learning can be used to automate the process by analyzing over 10 million publicly released comments.
5:25pm-6:05pm (40m) Case studies, Strata Business Summit Data, Analytics, and AI Architecture, Health and Medicine
How Brazil deployed a 160-million people biometric identification system: challenges, benefits, and lessons learned
Thiago Ribeiro (Griaule)
Brazil deployed a national biometric system to register all Brazilian voters using multiple biometric modalities and to ensure that a person does not enroll twice. This session highlights how a large-scale biometric system works, and what are the main architecture decisions that one has to take in consideration.
11:20am-12:00pm (40m)
Session
1:15pm-1:55pm (40m) Law and Ethics, Strata Business Summit Ethics, Privacy and Security
War Stories from the Front Lines of ML
Andrew Burt (Immuta), Brenda Leong (Future of Privacy Forum)
Machine learning techniques are being deployed across almost every industry and sector. But this adoption comes with real, and oftentimes underestimated, privacy and security risks. In this session, Immuta and the Future of Privacy Forum will convene leading industry representatives and experts to talk about real life examples of when ML goes wrong, and the lessons they learned.
2:05pm-2:45pm (40m) Security and Privacy, Strata Business Summit Ethics, Privacy and Security
Regulations and the Future of Data
Andrew Burt (Immuta), Brenda Leong (Future of Privacy Forum)
From the EU to California and China, more and more of the world is regulating how data can be used. In this session, Immuta and the Future of Privacy Forum will convene leading experts on law and data science for a deep dive into ways to regulate the use of AI and advanced analytics. Come learn why these laws are being proposed, how they’ll impact data, and what the future has in store.
2:55pm-3:35pm (40m) Security and Privacy, Strata Business Summit Privacy and Security
Are Your Privacy Practices Auditor-Approved?
Mark Hinely (KirkpatrickPrice)
The fear that comes along with new compliance requirements is overwhelming. Organizations don’t know where to start, what to fix, or what an auditor expects to see. In this session, learn what an auditor’s perspective is on the newest security and privacy regulations, how your business can prepare for compliance, and what the audit looks like from their perspective.
4:35pm-5:15pm (40m) Business Analytics and Visualization, Strata Business Summit BI, Interactive Analytics and Visualization, Data Management and Storage, Deep dive into specific tools, platforms, or frameworks
Supercharging Elasticsearch for extended Knowledge Graph use cases
Giovanni Tummarello (Siren)
Elasticsearch allows extremely quick search and drilldowns on large amounts of semistructured data. Elasticsearch, however, does not have relational join capabilities. In this presentation I'll introduce a plugin for ES that adds cluster distributed joins and demonstrate how it enables an exciting array of use cases dealing with interconnected or "Knowledge Graph" enterprise data.
5:25pm-6:05pm (40m) Case studies, Strata Business Summit Data quality, data governance and data lineage, Ethics, Health and Medicine
Looking Beyond the Binary: How the lack of a gender data collection standard impacts users
Brindaalakshmi K (Independent Consultant)
There is a lack of standard for the collection of gender data. This session takes a look at the implications of such a lack in the context of a developing country like India, the exclusion of individuals beyond the binary genders of male and female and how this exclusion permeates beyond the public sector into private sector services.
8:45am-10:45am (2h)
Wednesday keynotes
Ben Lorica (O'Reilly Media), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.
10:50am-11:20am (30m)
Break: Morning break sponsored by Intel
12:00pm-1:15pm (1h 15m)
Break: Lunch sponsored by Google Cloud
12:00pm-1:15pm (1h 15m)
Wednesday Topic Tables at Lunch
Topic Table discussions are a great way to informally network with people in similar industries or interested in the same topics.
12:00pm-1:15pm (1h 15m)
Better Together Diversity Networking Lunch
If you’d like to make new professional connections and hear ideas for supporting diversity in the tech community, come to the diversity and inclusion networking lunch on Wednesday.
3:35pm-4:35pm (1h)
Break: Afternoon break sponsored by MemSQL
6:05pm-7:05pm (1h)
Booth Crawl
Make your way from booth to booth while you check out all the exhibitors in the Expo Hall on Wednesday after sessions end.
8:15am-8:45am (30m)
Speed Networking
Gather before keynotes on Wednesday morning to enjoy casual conversation while meeting fellow attendees. If one of your goals at Strata is to meet new people, this session will jumpstart your networking with other attendees.
12:00pm-1:15pm (1h 15m)
Wednesday Topic Tables at Lunch
Topic Table discussions are a great way to informally network with people in similar industries or interested in the same topics.
12:00pm-1:15pm (1h 15m)
Wednesday Business Summit Lunch
Join fellow executives, business leaders, and strategists for a networking lunch on Wednesday for Strata Business Summit attendees and speakers.
7:30pm-10:30pm (3h)
Data After Dark
Don't miss an exciting evening filled with cocktails, food, and entertainment at Data After Dark at Strata in New York.

Contact us

confreg@oreilly.com

For conference registration information and customer service

partners@oreilly.com

For more information on community discounts and trade opportunities with O’Reilly conferences

strataconf@oreilly.com

For information on exhibiting or sponsoring a conference

Contact list

View a complete list of Strata Data Conference contacts