Sep 23–26, 2019
 
3B - Expo Hall
11:20am Unified tooling for machine learning interpretability Harsha Nori (Microsoft), Samuel Jenkins (Microsoft), Rich Caruana (Microsoft)
1:15pm Feature engineering with Spark NLP to accelerate clinical trial recruitment Saif Addin Ellafi (John Snow Labs), Scott Hoch (BlackBox Engineering)
2:55pm Toward more fine-grained sentiment and emotion analysis of text Gerard de Melo (Rutgers University)
4:35pm Search logs + machine learning = autotagged inventory John Berryman (Eventbrite)
5:25pm Alexa, do men talk too much? Sireesha Muppala (Amazon Web Services), Shelbee Eigenbrode (Amazon Web Services), Emily Webber (Amazon Web Services)
1A 06/07
1:15pm Improving OCR quality of documents using generative adversarial networks Nagendra Shishodia (EXL), Chaithanya Manda (EXL), Solmaz Torabi (EXL)
2:05pm Real-time anomaly detection on observability data using neural networks Keshav Peswani (Expedia Group), Ashish Aggarwal (Expedia Group)
2:55pm Introducing a new anomaly detection algorithm (SR-CNN) inspired by computer vision Tony Xing (Microsoft), Congrui Huang (Microsoft), Qiyang Li (Microsoft), Wenyi Yang (Microsoft)
4:35pm Deep learning on mobile Anirudh Koul (Microsoft), Meher Kasam (Square)
1A 08/10
11:20am We run, we improve, we scale: The XGBoost story at Uber Nan Zhu (Uber), Felix Cheung (Uber)
1:15pm Machine learning and large-scale data analysis on a centralized platform James Tang (Walmart Labs), Yiyi Zeng (Walmart Labs), Linhong Kang (Walmart Labs)
2:05pm Data science versus engineering: Does it really have to be this way? Ann Spencer (Domino), Amy Heineike (Primer), Paco Nathan (derwen.ai), Chris Wiggins (NYT | Columbia)
5:25pm Data science and the business of Major League Baseball Aaron Owen (Major League Baseball), Matthew Horton (Major League Baseball), Josh Hamilton (Major League Baseball)
1A 12/14
11:20am Practical feature engineering Ted Dunning (MapR)
1:15pm Learning with limited labeled data Shioulin Sam (Cloudera Fast Forward Labs)
2:05pm Fair, privacy-preserving, and secure ML Mikio Braun (Zalando)
2:55pm How machine learning meets optimization Jari Koister (FICO )
1A 15/16
2:05pm
2:55pm How Orange Financial combats financial fraud over 50M transactions a day using Apache Pulsar Weisheng Xie (Orange Financial), Jia Zhai (streamnative)
4:35pm Trill: The crown jewel of Microsoft’s streaming pipeline explained James Terwilliger (Microsoft Corporation), Badrish Chandramouli (Microsoft Research), Jonathan Goldstein (Microsoft Research)
5:25pm Fast data with the KISSS stack Bas Geerdink (Aizonic)
1A 21/22
11:20am Scaling data engineers Evgeny Vinogradov (Yandex.Money)
1:15pm A productive data science platform: Beyond a hosted-notebooks solution at LinkedIn Swasti Kakker (LinkedIn), Manu Ram Pandit (LinkedIn), Vidya Ravivarma (LinkedIn)
2:55pm Creating a data engineering culture Jesse Anderson (Big Data Institute)
4:35pm Downscaling: The Achilles heel of autoscaling Spark clusters Prakhar Jain (Qubole), Sourabh Goyal (Qubole)
5:25pm Improving Spark by taking advantage of disaggregated architecture Chenzhao Guo (Intel), Carson Wang (Intel)
1A 23/24
1:15pm Sharing is caring: Using Egeria to establish true enterprise metadata governance Wim Stoop (Cloudera), Srikanth Venkat (Cloudera)
2:05pm The evolution of metadata: LinkedIn’s story Shirshanka Das (LinkedIn), Mars Lan (LinkedIn)
2:55pm Turning big data into knowledge: Managing metadata and data relationships at Uber's scale Kaan Onuk (Uber), Luyao Li (Uber), Atul Gupte (Uber)
4:35pm The case for a common metadata layer for machine learning platforms Max Neunhöffer (ArangoDB), Joerg Schad (ArangoDB)
5:25pm Finding your needle in a haystack Naghman Waheed (Bayer Crop Science), John Cooper (Bayer)
1E 07/08
11:20am Kubernetes for stateful MPP systems Paige Roberts (Vertica), Deepak Majeti (Vertica)
2:55pm Time travel for data pipelines: Solving the mystery of what changed Shradha Ambekar (Intuit), Sunil Goplani (Intuit), Sandeep Uttamchandani (Intuit)
4:35pm Apache Hadoop 3.x state of the union and upgrade guidance Wangda Tan (Cloudera), Wei-Chiu Chuang (Cloudera)
5:25pm HBase 2.0 and beyond Krishna Maheshwari (Cloudera)
1E 09
11:20am Data security and privacy anti-patterns Steven Touw (Immuta)
2:05pm Building a best-in-class data lake on AWS and Azure Tomer Shiran (Dremio), Jacques Nadeau (Dremio)
5:25pm Secured computation: Analyzing sensitive data using homomorphic encryption Matt Carothers (Cox Communications), Jignesh Patel (Cox Communications), Harry Tang (Cox Communications)
1E 10/11
1:15pm Executive Briefing: Top 10 big data blunders Michael Stonebraker (Tamr)
2:55pm Executive Briefing: Understanding the cult of prediction Farrah Bostic (The Difference Engine)
4:35pm Executive Briefing: Data catalogs—Concepts, capabilities, and key platforms Andrew Brust (Blue Badge Insights | ZDNet)
1E 12/13
1:15pm Turning petabytes of data from millions of vehicles into open data with Geotab Felipe Hoffa (Google), Bob Bradley (Geotab)
2:55pm Enabling 5G use cases through location intelligence Tim McKenzie (Pitney Bowes)
1E 14
11:20am Embrace complexity: The new rules of AI Janet Haven (Data & Society)
1:15pm War stories from the front lines of ML Andrew Burt (Immuta), Brenda Leong (Future of Privacy Forum), David Florsek (IDEMIA NSS), Alex Beutel (Google Brain), Chris Wheeler (Mastercard)
2:05pm Regulations and the future of data Andrew Burt (Immuta), Brenda Leong (Future of Privacy Forum), Boris Segalis (Cooley), Susan Israel (Loeb & Loeb, LLP)
2:55pm Are your privacy practices auditor approved? Mark Hinely (KirkpatrickPrice)
5:25pm Looking beyond the binary: How data for development impacts gender justice? Brindaalakshmi K (Independent Consultant)
1A 01/02
1A 03
5:25pm The why and how of data lineage Neelesh Salian (Stitch Fix)
1A 04/05
1:15pm Running AI workloads in containers (sponsored by BMC Software) See-Kit Lam (Malwarebytes), Darren Chinen (Malwarebytes)
1E 06
2:05pm Mastercard and Pitney Bowes: Creating a data-driven business (sponsored by Pitney Bowes) Olga Lagunova (Pitney Bowes), John Derrico (Mastercard)
4:35pm Semantics and graph data models in the enterprise data fabric (sponsored by Cambridge Semantics) Barbara Petrocelli (Cambridge Semantics), Peter Ball (Consultant)
5:25pm Challenges faced in machine learning infrastructure in traditional large enterprises venkata gunnu (Comcast), Harish Doddi (Datatron)
1E 17
3E
8:45am Wednesday keynotes Ben Lorica (O'Reilly), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
8:50am The road to an enterprise cloud Mick Hollison (Cloudera), Hillery Hunter (IBM)
9:15am Everything is connected and the clock is ticking: AI and big ag data for food security Sara Menker (Gro Intelligence), Nemo Semret (Gro Intelligence)
9:40am AI isn't magic. It’s computer science. Robert Thomas (IBM), Tim O'Reilly (O'Reilly Media)
10:30am Interactive sports analytics Patrick Lucey (Stats Perform)
10:50am Morning break sponsored by Intel | Room: Expo Hall - 3B
12:00pm Lunch sponsored by Google Cloud | Room: Expo Hall - 3B
12:00pm Wednesday Topic Tables at Lunch | Room: Expo Hall - 3B
3:35pm Afternoon break sponsored by MemSQL | Room: Expo Hall - 3B
6:05pm Booth Crawl | Room: Expo Hall - 3B
8:00am Speed Networking | Room: Keynote Foyer
8:30am Early morning coffee (8:00am - 8:45am) | Room: Keynote Foyer
12:00pm Wednesday Topic Tables at Lunch | Room: Expo Hall - 3B
12:00pm Wednesday Business Summit Lunch | Room: Expo Hall - 3D
7:30pm Data After Dark | Room: Slate NYC (54 West 21st Street)
11:20am-12:00pm (40m) Data Science, Machine Learning, & AI, Expo Hall Ethics
Unified tooling for machine learning interpretability
Harsha Nori (Microsoft), Samuel Jenkins (Microsoft), Rich Caruana (Microsoft)
Understanding decisions made by machine learning systems is critical for sensitive uses, ensuring fairness, and debugging production models. Interpretability presents options for trying to understand model decisions. Harsha Nori, Sameul Jenkins, and Rich Caruana explore the tools Microsoft is releasing to help you train powerful, interpretable models and interpret existing black box systems.
1:15pm-1:55pm (40m) Data Science, Machine Learning, & AI, Expo Hall Health and Medicine, Text and Language processing and analysis
Feature engineering with Spark NLP to accelerate clinical trial recruitment
Saif Addin Ellafi (John Snow Labs), Scott Hoch (BlackBox Engineering)
Recruiting patients for clinical trials is a major challenge in drug development. Saif Addin Ellafi and Scott Hoch explain how Deep 6 uses Spark NLP to scale its training and inference pipelines to millions of patients while achieving state-of-the-art accuracy. They dive into the technical challenges, the architecture of the full solution, and the lessons the company learned.
2:05pm-2:45pm (40m) Data Science, Machine Learning, & AI, Expo Hall Text and Language processing and analysis
Mind the semantic gap: How "talking semantics" can help you perform better data science
Panos Alexopoulos (Textkernel)
In an era where discussions among data scientists are monopolized by the latest trends in machine learning, the role of semantics in data science is often underplayed. Panos Alexopoulos presents real-world cases where making fine, seemingly pedantic, distinctions in the meaning of data science tasks and the related data has helped improve significantly the effectiveness and value.
2:55pm-3:35pm (40m) Data Science, Machine Learning, & AI, Expo Hall Text and Language processing and analysis
Toward more fine-grained sentiment and emotion analysis of text
Gerard de Melo (Rutgers University)
Gerard de Melo takes a deep dive into the kinds of sentiment and emotion consumers associate with a text. With new data-driven approaches, organizations can better pay attention to what's being said about them in different markets. And you can consider fonts and palettes best suited to convey specific emotions, so organizations can make informed choices when presenting information to consumers.
4:35pm-5:15pm (40m) Data Science, Machine Learning, & AI, Expo Hall Text and Language processing and analysis
Search logs + machine learning = autotagged inventory
John Berryman (Eventbrite)
Eventbrite is exploring a new machine learning approach that allows it to harvest data from customer search logs and automatically tag events based upon their content. John Berryman dives into the results and how they have allowed the company to provide users with a better inventory-browsing experience.
5:25pm-6:05pm (40m) Data Science, Machine Learning, & AI, Expo Hall Culture and Organization, Text and Language processing and analysis
Alexa, do men talk too much?
Sireesha Muppala (Amazon Web Services), Shelbee Eigenbrode (Amazon Web Services), Emily Webber (Amazon Web Services)
Mansplaining. Know it? Hate it? Want to make it go away? Sireesha Muppala, Shelbee Eigenbrode, and Emily Webber tackle the problem of men talking over or down to women and its impact on career progression for women. They also demonstrate an Alexa skill that uses deep learning techniques on incoming audio feeds, examine ownership of the problem for women and men, and suggest helpful strategies.
11:20am-12:00pm (40m) Data Science, Machine Learning, & AI Temporal data and time-series analytics
Lightning-fast time series modeling and prediction: (S)ARIMA on steroids
Meir TOLEDANO (Anodot)
ARIMA has been used for time series modeling for decades. In practice, most time series collected from human activities exhibit seasonal patterns, but the efficient estimation of seasonal ARIMA ((S)ARIMA) models was inefficient for decades. Meir Toledano explains how Anodot was able to apply the technique for forecasting and anomaly detection for millions of time series every day.
1:15pm-1:55pm (40m) Data Science, Machine Learning, & AI Deep Learning, Financial Services, Health and Medicine
Improving OCR quality of documents using generative adversarial networks
Nagendra Shishodia (EXL), Chaithanya Manda (EXL), Solmaz Torabi (EXL)
Every NLP-based document-processing solution depends on converting scanned documents and images to machine readable text using an OCR solution, limited by the quality of scanned images. Nagendra Shishodia, Chaithanya Manda, and Solmaz Torabi explore how GAN can bring significant efficiencies in any document-processing solution by enhancing resolution and denoising scanned images.
2:05pm-2:45pm (40m) Data Science, Machine Learning, & AI Deep Learning, Temporal data and time-series analytics, Transportation and Logistics
Real-time anomaly detection on observability data using neural networks
Keshav Peswani (Expedia Group), Ashish Aggarwal (Expedia Group)
Observability is the key in modern architecture to quickly detect and repair problems in microservices. Modern observability platforms have evolved beyond simple application logs and include distributed tracing systems like Zipkin and Haystack. Keshav Peswani and Ashish Aggarwal explore how combining them with real-time, intelligent alerting mechanisms helps in the automated detection of problems.
2:55pm-3:35pm (40m) Data Science, Machine Learning, & AI Deep Learning, Temporal data and time-series analytics
Introducing a new anomaly detection algorithm (SR-CNN) inspired by computer vision
Tony Xing (Microsoft), Congrui Huang (Microsoft), Qiyang Li (Microsoft), Wenyi Yang (Microsoft)
Anomaly detection may sound old fashioned, yet it's super important in many industry applications. Tony Xing, Congrui Huang, Qiyang Li, and Wenyi Yang detail a novel anomaly-detection algorithm based on spectral residual (SR) and convolutional neural network (CNN) and how this method was applied in the monitoring system supporting Microsoft AIOps and business incident prevention.
4:35pm-5:15pm (40m) Data Science, Machine Learning, & AI Data Integration and Data Processing, Deep Learning, Financial Services
Deep learning on mobile
Anirudh Koul (Microsoft), Meher Kasam (Square)
Over the last few years, convolutional neural networks (CNNs) have risen in popularity, especially in the area of computer vision. Anirudh Koul and Meher Kasam take you through how you can get deep neural nets to run efficiently on mobile devices.
5:25pm-6:05pm (40m) Data Science, Machine Learning, & AI Deep Learning, Model Development, Governance, Operations
Deploying end-to-end deep learning pipelines with ONNX
Nick Pentreath (IBM)
The common perception of deep learning is that it results in a fully self-contained model. However, in most cases, these models have similar requirements for data preprocessing as does more "traditional" machine learning. Despite this, there are few standard solutions for deploying end-to-end deep learning. Nick Pentreath explores how the ONNX format and ecosystem addresses this challenge.
11:20am-12:00pm (40m) Data Science, Machine Learning, & AI Deep dive into specific tools, platforms, or frameworks, Transportation and Logistics
We run, we improve, we scale: The XGBoost story at Uber
Nan Zhu (Uber), Felix Cheung (Uber)
XGBoost has been widely deployed in companies across the industry. Nan Zhu and Felix Cheung dive into the internals of distributed training in XGBoost and demonstrate how XGBoost resolves the business problem in Uber with a scale to thousands of workers and tens of TB of training data.
1:15pm-1:55pm (40m) Data Science, Machine Learning, & AI Data, Analytics, and AI Architecture, Financial Services, Retail and e-commerce
Machine learning and large-scale data analysis on a centralized platform
James Tang (Walmart Labs), Yiyi Zeng (Walmart Labs), Linhong Kang (Walmart Labs)
James Tang, Yiyi Zeng, and Linhong Kang outline how Walmart provides a secure and seamless shopping experience through machine learning and large scale data analysis on centralized platform.
2:05pm-2:45pm (40m) Data Science, Machine Learning, & AI Culture and Organization
Data science versus engineering: Does it really have to be this way?
Ann Spencer (Domino), Amy Heineike (Primer), Paco Nathan (derwen.ai), Chris Wiggins (NYT | Columbia)
If, as a data scientist, you've wondered why it takes so long to deploy your model into production or, as an engineer, thought data scientists have no idea what they want, you're not alone. Join a lively discussion with industry veterans Ann Spencer, Paco Nathan, Amy Heineike, and Chris Wiggins to find best practices or insights on increasing collaboration when developing and deploying models.
2:55pm-3:35pm (40m) Data Science, Machine Learning, & AI Media and Advertising, Retail and e-commerce
Building a machine learning framework to measure TV advertising attribution
Fei Wang (CarGurus)
Fei Wang takes a deep dive into a case study for the CarGurus TV Attribution Model. You'll understand how you can leverage the creation of a causal inference model to calculate cost per acquisition (CPA) of TV spend and measure effectiveness when compared to CPA of digital performance marketing spend.
4:35pm-5:15pm (40m) Data Science, Machine Learning, & AI Retail and e-commerce, Temporal data and time-series analytics
From whiteboard to production: A demand forecasting system for an online grocery shop
Robert Pesch (inovex), Robin Senge (inovex)
Data-driven software is revolutionizing the world and enable intelligent services we interact with daily. Robert Pesch and Robin Senge outline the development process, statistical modeling, data-driven decision making, and components needed for productionizing a fully automated and highly scalable demand forecasting system for an online grocery shop for a billion-dollar retail group in Europe.
5:25pm-6:05pm (40m) Data Science, Machine Learning, & AI Media and Advertising
Data science and the business of Major League Baseball
Aaron Owen (Major League Baseball), Matthew Horton (Major League Baseball), Josh Hamilton (Major League Baseball)
Using SAS, Python, and AWS SageMaker, Major League Baseball's (MLB's) data science team outlines how it predicts ticket purchasers’ likelihood to purchase again, evaluates prospective season schedules, estimates customer lifetime value, optimizes promotion schedules, quantifies the strength of fan avidity, and monitors the health of monthly subscriptions to its game-streaming service.
11:20am-12:00pm (40m) Data Science, Machine Learning, & AI
Practical feature engineering
Ted Dunning (MapR)
Feature engineering is generally the section that gets left out of machine learning books, but it's also the most critical part in practice. Ted Dunning explores techniques, a few well known, but some rarely spoken of outside the institutional knowledge of top teams, including how to handle categorical inputs, natural language, transactions, and more in the context of machine learning.
1:15pm-1:55pm (40m) Data Science, Machine Learning, & AI Deep Learning
Learning with limited labeled data
Shioulin Sam (Cloudera Fast Forward Labs)
Supervised machine learning requires large labeled datasets—a prohibitive limitation in many real world applications. But this could be avoided if machines could earn with a few labeled examples. Shioulin Sam explores and demonstrates an algorithmic solution that relies on collaboration between human and machine to label smartly, and she outlines product possibilities.
2:05pm-2:45pm (40m) Data Science, Machine Learning, & AI, Security and Privacy Ethics, Privacy and Security, Retail and e-commerce
Fair, privacy-preserving, and secure ML
Mikio Braun (Zalando)
With ML becoming more mainstream, the side effects of machine learning and AI on our lives become more visible. You have to take extra measures to make machine learning models fair and unbiased. And awareness for preserving the privacy in ML models is rapidly growing. Mikio Braun explores techniques and concepts around fairness, privacy, and security when it comes to machine learning models.
2:55pm-3:35pm (40m) Data Science, Machine Learning, & AI Financial Services
How machine learning meets optimization
Jari Koister (FICO )
Machine learning and constraint-based optimization are both used to solve critical business problems. They come from distinct research communities and have traditionally been treated separately. But Jari Koister examines how they're similar, how they're different, and how they can be used to solve complex problems with amazing results.
4:35pm-5:15pm (40m) Data Science, Machine Learning, & AI Media and Advertising, Temporal data and time-series analytics
Predicting Criteo’s internet traffic load using Bayesian structural time series models
Hamlet Jesse Medina Ruiz (Criteo)
Criteo’s infrastructure provides the capacity and connectivity to host Criteo’s platform and applications. The evolution of this infrastructure is driven by the ability to forecast Criteo’s traffic demand. Hamlet Jesse Medina Ruiz explains how Criteo uses Bayesian dynamic time series models to accurately forecast its traffic load and optimize hardware resources across data centers.
5:25pm-6:05pm (40m) Data Science, Machine Learning, & AI Retail and e-commerce
Causal inference 101: Answering the crucial "why" in your analysis
Subhasish Misra (Walmart Labs)
Causal questions are ubiquitous, and randomized tests are considered the gold standard. However, such tests are not always feasible, and then you just have observational data to get to causal insights. But techniques such as matching offer an opportunity to solve this. Subhasish Misra explores this and practical tips when trying to infer causal effects.
11:20am-12:00pm (40m) Data Engineering and Architecture Data Integration and Data Processing, Data, Analytics, and AI Architecture, Retail and e-commerce, Streaming and IoT
Building a multitenant data processing and model inferencing platform with Kafka Streams
Navinder Pal Singh Brar (Walmart Labs)
Each week 275 million people shop at Walmart, generating interaction and transaction data. Navinder Pal Singh Brar explains how the customer backbone team enables extraction, transformation, and storage of customer data to be served to other teams. At 5 billion events per day, the Kafka Streams cluster processes events from various channels and maintains a uniform identity of a customer.
1:15pm-1:55pm (40m) Data Engineering and Architecture Data, Analytics, and AI Architecture, Deep dive into specific tools, platforms, or frameworks
Now you see me; now you compute: Building event-driven architectures with Apache Kafka
Michael Noll (Confluent)
Would you cross the street with traffic information that's a minute old? Certainly not. Modern businesses have the same needs. Michael Noll explores why and how you can use Kafka and its growing ecosystem to build elastic event-driven architectures. Specifically, you look at Kafka as the storage layer, at Kafka Connect for data integration, and at Kafka Streams and KSQL as the compute layer.
2:05pm-2:45pm (40m)
Session
2:55pm-3:35pm (40m) Data Engineering and Architecture Data, Analytics, and AI Architecture, Financial Services, Streaming and IoT, Telecom
How Orange Financial combats financial fraud over 50M transactions a day using Apache Pulsar
Weisheng Xie (Orange Financial), Jia Zhai (streamnative)
As a fintech company of China Telecom with half of a billion registered users and 41 million monthly active users, risk control decision deployment has been critical to its success. Weisheng Xie and Jia Zhai explore how the company leverages Apache Pulsar to boost the efficiency of its risk control decision development for combating financial frauds of over 50 million transactions a day.
4:35pm-5:15pm (40m) Data Engineering and Architecture, Streaming and IoT Cloud Platforms and SaaS, Data Integration and Data Processing, Media and Advertising, Streaming and IoT
Trill: The crown jewel of Microsoft’s streaming pipeline explained
James Terwilliger (Microsoft Corporation), Badrish Chandramouli (Microsoft Research), Jonathan Goldstein (Microsoft Research)
Trill has been open-sourced, making the streaming engine behind services like the Bing Ads platform available for all to use and extend. James Terwilliger, Badrish Chandramouli, and Jonathan Goldstein dive into the history of and insights from streaming data at Microsoft. They demonstrate how its API can power complex application logic and the performance that gives the engine its name.
5:25pm-6:05pm (40m) Data Engineering and Architecture Data, Analytics, and AI Architecture, Streaming and IoT
Fast data with the KISSS stack
Bas Geerdink (Aizonic)
Streaming analytics (or fast data processing) is the field of making predictions based on real-time data. Bas Geerdink presents a fast data architecture that covers many use cases that follow a "pipes and filters" pattern. This architecture can be used to create enterprise-grade solutions with a diversity of technology options. The stack is Kafka, Ignite, and Spark Structured Streaming (KISSS).
11:20am-12:00pm (40m) Data Engineering and Architecture Culture and Organization, Financial Services, Model Development, Governance, Operations
Scaling data engineers
Evgeny Vinogradov (Yandex.Money)
With a microservice architecture, a data warehouse is the first place where all the data meets. It's supplied by many different data sources and used for many purposes—from near-online transactional processing (OLTP) to model fitting and real-time classifying. Evgeny Vinogradov details his experience in managing and scaling data for support of 20+ product teams.
1:15pm-1:55pm (40m) Data Engineering and Architecture Data, Analytics, and AI Architecture, Media and Advertising
A productive data science platform: Beyond a hosted-notebooks solution at LinkedIn
Swasti Kakker (LinkedIn), Manu Ram Pandit (LinkedIn), Vidya Ravivarma (LinkedIn)
Join Swasti Kakker, Manu Ram Pandit, and Vidya Ravivarma to explore what's offered by a flexible and scalable hosted data science platform at LinkedIn. It provides features to seamlessly develop in multiple languages, enforce developer best practices, governance policies, execute, visualize solutions, efficient knowledge management, and collaboration to improve developer productivity.
2:05pm-2:45pm (40m) Data Engineering and Architecture Data, Analytics, and AI Architecture, Transportation and Logistics
From raw data to informed intelligence: Democratizing data science and ML at Uber
Atul Gupte (Uber)
Uber is changing the way people think about transportation. As an integral part of the logistical fabric in 65+ countries around the world, it uses ML and advanced data science to power every aspect of the Uber experience—from dispatch to customer support. Atul Gupte and Nikhil Joshi explore how Uber enables teams to transform insights into intelligence and facilitate critical workflows.
2:55pm-3:35pm (40m) Culture and organization
Creating a data engineering culture
Jesse Anderson (Big Data Institute)
In this talk, we will cover the most common reasons why data engineering teams fail and how to correct them. This will include ways to get your management to understand that data engineering is really complex and time consuming. It is not data warehousing with new names. Management needs to understand that you can’t compare a data engineering team to the web development team, for example.
4:35pm-5:15pm (40m) Data Engineering and Architecture Deep dive into specific tools, platforms, or frameworks
Downscaling: The Achilles heel of autoscaling Spark clusters
Prakhar Jain (Qubole), Sourabh Goyal (Qubole)
Autoscaling of resources aims to achieve low latency for a big data application while reducing resource costs. Upscaling a cluster in cloud is fairly easy as compared to downscaling nodes, and so the overall total cost of ownership (TCO) goes up. Prakhar Jain and Sourabh Goyal examine a new design to get efficient downscaling, which helps achieve better resource utilization and lower TCO.
5:25pm-6:05pm (40m) Data Engineering and Architecture Data, Analytics, and AI Architecture, Deep dive into specific tools, platforms, or frameworks
Improving Spark by taking advantage of disaggregated architecture
Chenzhao Guo (Intel), Carson Wang (Intel)
Shuffle in Spark requires the shuffle data to be persisted on local disks. However, the assumptions of collocated storage do not always hold in today’s data centers. Chenzhao Guo and Carson Wang outline the implementation of a new Spark shuffle manager, which writes shuffle data to a remote cluster with different storage backends, making life easier for customers.
11:20am-12:00pm (40m) Automation in data science and data, Data Engineering and Architecture Data, Analytics, and AI Architecture
Building an AI platform: Key principles and lessons learned
Moty Fania (Intel)
Moty Fania details Intel’s IT experience of implementing a sales AI platform. This platform is based on streaming, microservices architecture with a message bus backbone. It was designed for real-time data extraction and reasoning and handles the processing of millions of website pages and is capable of sifting through millions of tweets per day.
1:15pm-1:55pm (40m) Data Engineering and Architecture Data quality, data governance and data lineage, Deep dive into specific tools, platforms, or frameworks
Sharing is caring: Using Egeria to establish true enterprise metadata governance
Wim Stoop (Cloudera), Srikanth Venkat (Cloudera)
Establishing enterprise-wide security and governance remains a challenge for most organizations. Integrations and exchanges across the landscape are costly to manage and maintain, and typically work in one direction only. Wim Stoop and Srikanth Venkat explore how ODPi's Egeria standard and framework removes the challenges and is leveraged by Cloudera and partners alike to deliver value.
2:05pm-2:45pm (40m) Data Engineering and Architecture Data quality, data governance and data lineage, Media and Advertising
The evolution of metadata: LinkedIn’s story
Shirshanka Das (LinkedIn), Mars Lan (LinkedIn)
Imagine scaling metadata to an organization of 10,000 employees, 1M+ data assets, and an AI-enabled company that ships code to the site three times a day. Shirshanka Das and Mars Lan dive into LinkedIn’s metadata journey from a two-person back-office team to a central hub powering data discovery, AI productivity, and automatic data privacy. They reveal metadata strategies and the battle scars.
2:55pm-3:35pm (40m) Data Engineering and Architecture Data quality, data governance and data lineage, Transportation and Logistics
Turning big data into knowledge: Managing metadata and data relationships at Uber's scale
Kaan Onuk (Uber), Luyao Li (Uber), Atul Gupte (Uber)
Uber takes data driven to the next level. It needs a robust system for discovering and managing various entities, from datasets to services to pipelines, and their relevant metadata isn't just nice—it's absolutely integral to making data useful. Kaan Onuk, Luyao Li, and Atul Gupte explore the current state of metadata management, end-to-end data flow solutions at Uber, and what’s coming next.
4:35pm-5:15pm (40m) Data Engineering and Architecture Data quality, data governance and data lineage
The case for a common metadata layer for machine learning platforms
Max Neunhöffer (ArangoDB), Joerg Schad (ArangoDB)
Machine learning platforms are becoming more complex, with different components each producing their own metadata and their own way of storing metadata. Max Neunhöffer and Joerg Schad propose a first draft of a common metadata API and demonstrate a first implementation of this API in Kubeflow using ArangoDB, a native multimodel database.
5:25pm-6:05pm (40m) Data Engineering and Architecture Data quality, data governance and data lineage, Data, Analytics, and AI Architecture
Finding your needle in a haystack
Naghman Waheed (Bayer Crop Science), John Cooper (Bayer)
As complexity of data systems has grown at Bayer, so has the difficulty to locate and understand what datasets are available for consumption. Naghman Waheed and John Cooper outline a custom metadata management tool recently deployed at Bayer. The system is cloud-enabled and uses multiple open source components, including machine learning and natural language processing to aid searches.
11:20am-12:00pm (40m) Data Engineering and Architecture Cloud Platforms and SaaS, Data Management and Storage, Data, Analytics, and AI Architecture
Kubernetes for stateful MPP systems
Paige Roberts (Vertica), Deepak Majeti (Vertica)
GoodData needed to autorecover from node failures and scale rapidly when workloads spiked on their MPP database in the cloud. Kubernetes could solve it, but it's for stateless microservices, not a stateful MPP database that needs hundreds of containers. Paige Roberts and Deepak Majeti detail the hurdles GoodData needed to overcome in order to merge the power of the database with Kubernetes.
1:15pm-1:55pm (40m) Data Engineering and Architecture Cloud Platforms and SaaS, Data Integration and Data Processing
Your easy move to serverless computing and radically simplified data processing
Gil Vernik (IBM)
Most analytic flows can benefit from serverless, starting with simple cases to and moving to complex data preparations for AI frameworks like TensorFlow. To address the challenge of how to easily integrate serverless without major disruptions to your system, Gil Vernik explores the “push to the cloud” experience, which dramatically simplifies serverless for big data processing frameworks.
2:05pm-2:45pm (40m) Data Engineering and Architecture Cloud Platforms and SaaS, Data, Analytics, and AI Architecture
Orchestrating data workflows using a fully serverless architecture
Tomer Levi (Fundbox)
Use of data workflows is a fundamental functionality of any data engineering team. Nonetheless, designing an easy-to-use, scalable, and flexible data workflow platform is a complex undertaking. Tomer Levi walks you through how the data engineering team at Fundbox uses AWS serverless technologies to address this problem and how it enables data scientists, BI devs, and engineers move faster.
2:55pm-3:35pm (40m) Data Engineering and Architecture Data Integration and Data Processing, Data quality, data governance and data lineage
Time travel for data pipelines: Solving the mystery of what changed
Shradha Ambekar (Intuit), Sunil Goplani (Intuit), Sandeep Uttamchandani (Intuit)
A business insight shows a sudden spike. It can take hours, or days, to debug data pipelines to find the root cause. Shradha Ambekar, Sunil Goplani, and Sandeep Uttamchandani outline how Intuit built a self-service tool that automatically discovers data pipeline lineage and tracks every change, helping debug the issues in minutes—establishing trust in data while improving developer productivity.
4:35pm-5:15pm (40m) Data Engineering and Architecture Deep dive into specific tools, platforms, or frameworks
Apache Hadoop 3.x state of the union and upgrade guidance
Wangda Tan (Cloudera), Wei-Chiu Chuang (Cloudera)
Wangda Tan and Wei-Chiu Chuang outline the current status of Apache Hadoop community and dive into present and future of Hadoop 3.x. You'll get a peak at new features like erasure coding, GPU support, NameNode federation, Docker, long-running services support, powerful container placement constraints, data node disk balancing, etc. And they walk you through upgrade guidance from 2.x to 3.x.
5:25pm-6:05pm (40m) Data Engineering and Architecture, Streaming and IoT Deep dive into specific tools, platforms, or frameworks
HBase 2.0 and beyond
Krishna Maheshwari (Cloudera)
Krishna Maheshwari provides an overview of the major features and enhancements in the HBase 2.0 release, upcoming releases, and the future of HBase. You'll be able to ask her questions at the end. Apache HBase 2.0 comes packed with a lot of new functionalities: off-heap read paths, multitier bucket cache, new finite state machine-based assignment manager, etc.
11:20am-12:00pm (40m) Data Engineering and Architecture, Security and Privacy Data Management and Storage, Privacy and Security
Data security and privacy anti-patterns
Steven Touw (Immuta)
Anti-patterns are behaviors that take bad problems and lead to even worse solutions. In the world of data security and privacy, they’re everywhere. Over the past four years, data security and privacy anti-patterns have emerged across hundreds of customers and industry verticals—there's been an obvious trend. Steven Touw details five anti-patterns and, more importantly, the solutions for them.
1:15pm-1:55pm (40m) Data Engineering and Architecture, Security and Privacy Deep dive into specific tools, platforms, or frameworks, Health and Medicine, Privacy and Security
Parquet modular encryption: Confidentiality and integrity of sensitive column data
Gidon Gershinsky (IBM)
The Apache Parquet community is working on a column encryption mechanism that protects sensitive data and enables access control for table columns. Many companies are involved, and the mechanism specification has recently been signed off on by the community management committee. Gidon Gershinsky explores the basics of Parquet encryption technology, its usage model, and a number of use cases.
2:05pm-2:45pm (40m) Business Analytics and Visualization, Data Engineering and Architecture BI, Interactive Analytics and Visualization, Cloud Platforms and SaaS, Data Management and Storage
Building a best-in-class data lake on AWS and Azure
Tomer Shiran (Dremio), Jacques Nadeau (Dremio)
Data lakes have become a key ingredient in the data architecture of most companies. In the cloud, object storage systems such as S3 and ADLS make it easier than ever to operate a data lake. Tomer Shiran and Jacques Nadeau explain how you can build best-in-class data lakes in the cloud, leveraging open source technologies and the cloud's elasticity to run and optimize workloads simultaneously.
2:55pm-3:35pm (40m) Data Engineering and Architecture, Security and Privacy Privacy and Security
When machines fight machines: Cyberbattles and the new frontier of artificial intelligence
Marcus Fowler (Darktrace)
Cybersecurity must find what it doesn’t know to look for. AI technologies led to the emergence of self-learning, self-defending networks that achieve this—detecting and autonomously responding to in-progress attacks in real time. Marcus Fowler examine these cyber-immune systems enable the security team to focus on high-value tasks, counter even machine-speed threats, and work in all environments.
4:35pm-5:15pm (40m) Data Engineering and Architecture Health and Medicine, Privacy and Security
Protecting the healthcare enterprise from PHI breaches using streaming and NLP
Jeff Zemerick (Mountain Fog)
Hospitals small and large are adopting cloud technologies, and many are in hybrid environments. These distributed environments pose challenges, none of which are more critical than the protection of protected health information (PHI). Jeff Zemerick explores how open source technologies can be used to identify and remove PHI from streaming text in an enterprise healthcare environment.
5:25pm-6:05pm (40m) Data Engineering and Architecture, Security and Privacy Media and Advertising, Privacy and Security
Secured computation: Analyzing sensitive data using homomorphic encryption
Matt Carothers (Cox Communications), Jignesh Patel (Cox Communications), Harry Tang (Cox Communications)
Organizations often work with sensitive information such as social security and credit card numbers. Although this data is stored in encrypted form, most analytical operations require data decryption for computation. This creates unwanted exposures to theft or unauthorized read by undesirables. Matt Carothers, Jignesh Patel, and Harry Tang explain how homomorphic encryption prevents fraud.
11:20am-12:00pm (40m) Strata Business Summit Model Development, Governance, Operations
Executive Briefing: Why machine-learned models crash and burn in production and what to do about it
David Talby (Pacific AI)
Machine learning and data science systems often fail in production in unexpected ways. David Talby outlines real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries.
1:15pm-1:55pm (40m) Executive Briefing and best practices, Strata Business Summit Culture and Organization, Data Management and Storage, Data, Analytics, and AI Architecture
Executive Briefing: Top 10 big data blunders
Michael Stonebraker (Tamr)
As a steward for your enterprise’s data and digital transformation initiatives, you’re tasked with making the right choice. But before you can make those decisions, it’s important to understand what not to do when planning for your organization’s big data initiatives. Michael Stonebraker shares his top 10 big data blunders.
2:05pm-2:45pm (40m) Strata Business Summit Culture and Organization, Financial Services
Executive Briefing: Building a data-assisted organization
Arup Nanda (Capital One)
Every organization wants to use data more effectively and as a weapon, but few succeed. Arup Nanda explores how Priceline started on this journey and how it was successful using different techniques and tools. Join in to learn how to streamline data assets, make it easier for end users, define KPIs, create value from data, and build sponsorships to build a data organization.
2:55pm-3:35pm (40m) Executive Briefing and best practices, Strata Business Summit Ethics
Executive Briefing: Understanding the cult of prediction
Farrah Bostic (The Difference Engine)
We're living in a culture obsessed with predictions. In politics and business, we collect data in service of the obsession. But our need for certainty and control leads some organizations to be duped by unproven technology or pseudoscience—often with unforeseen societal consequences. Farrah Bostic looks at historical—and sometimes funny—examples of sacrificing understanding for "data."
4:35pm-5:15pm (40m) Executive Briefing and best practices, Strata Business Summit Data quality, data governance and data lineage
Executive Briefing: Data catalogs—Concepts, capabilities, and key platforms
Andrew Brust (Blue Badge Insights | ZDNet)
Andrew Brust provides a primer on data catalogs and a review of the major vendors and platforms in the market. He examines the use of data catalogs with classic and newer data repositories, including data warehouses, data lakes, cloud object storage, and even software and applications. You'll learn about AI's role in the data catalog world and get an analysis of data catalog futures.
5:25pm-6:05pm (40m) Executive Briefing and best practices, Strata Business Summit Data, Analytics, and AI Architecture, Privacy and Security
Executive Briefing: Making intelligent insights at the edge—The demise of big data?
Alasdair Allan (Babilim Light Industries)
The arrival of a new generation of smart embedded hardware may cause the demise of large-scale data harvesting. In its place, smart devices will let us process data at the edge and extract insights without storing potentially privacy and GDPR infringing data. Join Alasdair Allan to learn why the current age where privacy is no longer "a social norm" may not long survive the coming of the IoT.
11:20am-12:00pm (40m) Executive Briefing and best practices, Strata Business Summit Culture and Organization
Improve your data science ROI with a portfolio and risk management lens
Brian Dalessandro (Capital One)
While data science value is well recognized within tech, experience across industries shows that the ability to realize and measure business impact is not universal. A core issue is that data science programs face unique risks many leaders aren’t trained to hedge against. Brian Dalessandro addresses these risks and advocates for new ways to think about and manage data science programs.
1:15pm-1:55pm (40m) Case studies, Strata Business Summit BI, Interactive Analytics and Visualization, Cloud Platforms and SaaS, Streaming and IoT
Turning petabytes of data from millions of vehicles into open data with Geotab
Felipe Hoffa (Google), Bob Bradley (Geotab)
Geotab is a world-leading asset-tracking company with millions of vehicles under service every day. Felipe Hoffa and Bob Bradley examine the challenges and solutions to create an ML- and geographic information system- (GI)S enabled petabyte-scale data warehouse leveraging Google Cloud. And they dive into the process to publish open, how you can access it, and how cities are using it.
2:05pm-2:45pm (40m) Strata Business Summit
Executive Briefing: Usable machine learning—Lessons from Stanford and beyond
Peter Bailis (Sisu | Stanford University)
Despite a meteoric rise in data volumes within modern enterprises, enabling nontechnical users to put this data to work in diagnostic and predictive tasks remains a fundamental challenge. Peter Bailis details the lessons learned in building new systems to help users leverage the data at their disposal, drawing on production experience from Facebook, Microsoft, and the Stanford DAWN project.
2:55pm-3:35pm (40m) Case studies, Strata Business Summit Streaming and IoT, Telecom, Transportation and Logistics
Enabling 5G use cases through location intelligence
Tim McKenzie (Pitney Bowes)
Tim McKenzie examines why planning 5G network rollout and associated services requires a good understanding of location-based data. Accurate addressing and linking consumers to property or points of interest allows data enrichment with attributes, demographics and social data. Companies use location to organize and analyze network and customer data to understand where to target new services.
4:35pm-5:15pm (40m) Case studies, Strata Business Summit Text and Language processing and analysis
What does the public say? A computational analysis of regulatory comments
Vlad Eidelman (FiscalNote)
While regulations affect your life every day, and millions of public comments are submitted to regulatory agencies in response to their proposals, analyzing the comments has traditionally been reserved for legal experts. Vlad Eidelman outlines how natural language processing (NLP) and machine learning can be used to automate the process by analyzing over 10 million publicly released comments.
5:25pm-6:05pm (40m) Case studies, Strata Business Summit Data, Analytics, and AI Architecture, Health and Medicine
How Brazil deployed a 160 million-person biometric identification system: Challenges, benefits, and lessons learned
Thiago Ribeiro (Griaule)
Brazil deployed a national biometric system to register all Brazilian voters using multiple biometric modalities and to ensure that a person does not enroll twice. This session highlights how a large-scale biometric system works, and what are the main architecture decisions that one has to take in consideration.
11:20am-12:00pm (40m) Strata Business Summit
Embrace complexity: The new rules of AI
Janet Haven (Data & Society)
Join Data & Society Research Institute Executive Director Janet Haven for a deep dive into research, case studies and emerging governance approaches to creating the rules of ethical AI.
1:15pm-1:55pm (40m) Law and Ethics, Strata Business Summit Ethics, Privacy and Security
War stories from the front lines of ML
Andrew Burt (Immuta), Brenda Leong (Future of Privacy Forum), David Florsek (IDEMIA NSS), Alex Beutel (Google Brain), Chris Wheeler (Mastercard)
Machine learning techniques are being deployed across almost every industry and sector. But this adoption comes with real, and oftentimes underestimated, privacy and security risks. Andrew Burt and Brenda Leong convene a panel of experts including David Florsek, Chris Wheeler, and Alex Beutel to detail real-life examples of when ML goes wrong, and the lessons they learned.
2:05pm-2:45pm (40m) Security and Privacy, Strata Business Summit Ethics, Privacy and Security
Regulations and the future of data
Andrew Burt (Immuta), Brenda Leong (Future of Privacy Forum), Boris Segalis (Cooley), Susan Israel (Loeb & Loeb, LLP)
From the EU to California and China, more of the world is regulating how data can be used. Andrew Burt and Brenda Leong convene leading experts on law and data science for a deep dive into ways to regulate the use of AI and advanced analytics. Come learn why these laws are being proposed, how they’ll impact data, and what the future has in store.
2:55pm-3:35pm (40m) Security and Privacy, Strata Business Summit Privacy and Security
Are your privacy practices auditor approved?
Mark Hinely (KirkpatrickPrice)
The fear that comes along with new compliance requirements is overwhelming. Organizations don’t know where to start, what to fix, or what an auditor expects to see. Mark Hinely gives you an auditor's perspective on the newest security and privacy regulations, how your business can prepare for compliance, and what the audit looks like to an auditor.
4:35pm-5:15pm (40m) Business Analytics and Visualization, Strata Business Summit BI, Interactive Analytics and Visualization, Data Management and Storage, Deep dive into specific tools, platforms, or frameworks
Supercharging Elasticsearch for extended Knowledge Graph use cases
Giovanni Tummarello (Siren)
Elasticsearch (ES) allows extremely quick search and drilldowns on large amounts of semistructured data. Elasticsearch, however, does not have relational join capabilities. Giovanni Tummarello examines a plug-in for ES that adds cluster distributed joins and demonstrates how it enables an exciting array of use cases dealing with interconnected or "Knowledge Graph" enterprise data.
5:25pm-6:05pm (40m) Case studies, Strata Business Summit Data quality, data governance and data lineage, Ethics, Health and Medicine
Looking beyond the binary: How data for development impacts gender justice?
Brindaalakshmi K (Independent Consultant)
There's a lack of standard for the collection of gender data. Brindaalakshmi K takes a look at the implications of such a lack in the context of a developing country like India, the exclusion of individuals beyond the binary genders of male and female, and how this exclusion permeates beyond the public sector into private sector services.
11:20am-12:00pm (40m) Sponsored
Mass migration: Tales of moving on-premises Hadoop to Google Cloud (sponsored by Google Cloud)
James Malone (Google)
James Malone takes a deep dive into how customers across the world partner with Google Cloud to reimagine big data processing and data lakes while generating incredible business value.
1:15pm-1:55pm (40m) Sponsored
The ugly truth about making analytics actionable (sponsored by SAS)
Diana Shaw (SAS)
Companies today are working to adopt data-driven mind-sets, strategies, and cultures. Yet the ugly truth is many still struggle to make analytics actionable. Diana Shaw outlines a simple, powerful, and automated solution to operationalize all types of analytics at scale. You'll learn how to put analytics into action while providing model governance and data scalability to drive real results.
2:05pm-2:45pm (40m) Sponsored
Bringing together machine and human intelligence in business applications at enterprise scale (sponsored by SAP)
Kevin Poskitt (SAP), Andreas Wesselmann (SAP)
Oftentimes there's a fracture between the highly governed data of enterprise IT systems and the comprehensive but often ungoverned world of large-scale data lakes and streams of data from blogs, system logs, sensors, IoT devices, and more. Kevin Poskitt and Andreas Wesselmann walk you through how AI needs to connect to all of this data, as well as image, video, audio, and text data sources.
2:55pm-3:35pm (40m) Sponsored
Take the bias out of big data insights with augmented analytics (sponsored by Kyligence)
Dong Li (Kyligence), Hongbin Ma (Kyligence)
Your analytics are biased. Efforts to extract meaning by manually scrubbing, indexing, and parsing big data is limited by time, cost, and human assumptions. Dong Li and Hongbin Ma offer an overview of augmented analytics. It takes OLAP into the future with AI, ensuring objective and unique insights that cover all relevant scenarios found in petabytes of multidimensional and variable data.
4:35pm-5:15pm (40m) Sponsored
DevOps in the cloud: Deploy, monitor, manage and automate (sponsored by Impetus)
Amit Assudani (Impetus)
Data lakes and analytical processing on the cloud is a reality. This presents new challenges for DevOps, with respect to Governance, Continuous Integration & Deployment, etc. This session will present our views on how to maintain sanity in your development organization while implementing the many dimensions of building an efficient cloud-based data platform and application development environment.
5:25pm-6:05pm (40m) Data Science, Machine Learning, & AI Transportation and Logistics
Harnessing graph-native algorithms to enhance machine learning: A primer
Brandy Freitas (Pitney Bowes)
Brandy Freitas examines the interplay between graph analytics and machine learning, improved feature engineering with graph native algorithms, and how to harness the power of graph structure for machine learning through node embedding.
11:20am-12:00pm (40m) Sponsored
Navigating the Transition to a Data First Enterprise: an Intel perspective (sponsored by Intel)
Jeremy Rader (Intel)
This session will reveal first-hand insights of an Intel analytics practitioner, share Intel IT’s own data maturity journey and provide actionable best known methods (BKMs) for Enterprises amidst transformation into an intelligent data-first business.
1:15pm-1:55pm (40m) Sponsored
Low-latency computing and stream processing for financial systems (sponsored by Hazelcast)
John DesJardins (Hazelcast)
In this talk, we will explore the challenges with integrating real-time stream processing and machine learning into banking and capital markets applications.
2:05pm-2:45pm (40m) Sponsored
Solving for enterprise scale analytics and agile data operations (sponsored by Infoworks)
Amar Arsikere (infoworks.io)
The breakneck pace of business change and its insatiable appetite for data and analytics to drive Digital Transformation makes agile use of data an imperative.
2:55pm-3:35pm (40m) Sponsored
Migrating Apache Spark and Hive from on-premises to Amazon EMR (sponsored by Amazon Web Services)
Radhika Ravirala (Amazon Web Services)
Radhika Ravirala explains how to migrate your workloads to Amazon EMR. Join in to learn the key motivations and benefits from a move to the cloud, along with the architectural changes required and best practices you can use right away.
4:35pm-5:15pm (40m) Sponsored
Clean the swamp: Gain greater visibility, speed, and governance with data ops (sponsored by Hitachi Vantara)
Chuck Yarbrough (Hitachi Vantara)
According to Gartner, over 80% of data lake projects were deemed inefficient. Data lakes come and go. Swamps happen. Data agility is fleeting. Chuck Yarbrough walks you through how data ops practices and a modern data architecture bring greater visibility and allow faster data access with proper governance.
5:25pm-6:05pm (40m) Data Engineering and Architecture Data quality, data governance and data lineage, Retail and e-commerce
The why and how of data lineage
Neelesh Salian (Stitch Fix)
Every data team has to build an ecosystem that sustains the data, the users, and the use of the data itself. This data ecosystem comes with its own challenges during the building phase, maintenance, and enhancement. Neelesh Salian dives into the importance of data lineage for an organization. You'll explore how to go about building such a system.
11:20am-12:00pm (40m) Sponsored
Building a fast, scalable, efficient operational analytics and reporting application using MemSQL, Docker, Airflow, and Prometheus (sponsored by MemSQL)
Praveen Chitrada (Akamai Technologies)
Praveen Chitrada walks you through how Akamai uses MemSQL, Docker, Airflow, Prometheus, and other technologies as an enabler to streamline and accelerate data ingestion and calculation to generate usage metrics for billing, reporting, and analytics at massive scale.
1:15pm-1:55pm (40m) Sponsored
Running AI workloads in containers (sponsored by BMC Software)
See-Kit Lam (Malwarebytes), Darren Chinen (Malwarebytes)
Developing, deploying and managing AI and anomaly detection models is tough business. See-Kit Lam details how Malwarebytes has leveraged containerization, scheduling, and orchestration to build a behavioral detection platform and a pipeline to bring models from concept to production.
2:05pm-2:45pm (40m) Sponsored
ALDO’s data strategy to create the right customer experience for its consumers (sponsored by Talend)
Aaron Swanson (Talend)
Winning the hearts and minds of millennials and Gen Z is not an easy task. ALDO has devised a data-driven strategy to create the best consumer experience. Today ALDO relies on Talend and AWS. Aaron Swanson explains the choices made for its data architecture and the hurdles the teams had to solve to turn the vision into reality.
2:55pm-3:35pm (40m) Sponsored
Architecting a data analytics service both in the public cloud and in the on-premise private cloud: ETL, BI, and machine learning (sponsored by SK Holdings)
Jungwook SEo (SK Holdings)
Jungwook Seo walks you through a data analytics platform in the cloud by the name of AccuInsight+ with eight data analytic services in the CloudZ (one of the biggest cloud service providers in Korea), which SK Holdings announced in January 2019.
5:25pm-6:05pm (40m) Sponsored
The future of Hadoop in an era of exponentially growing data (sponsored by SQream)
David Leichner (SQream)
What started as an asset for data scientists and BI professionals has become a poorly performing problem. David Leichner explores the Hadoop ecosystem and relational databases from an analytics perspective—reviewing the current landscape, what Hadoop was designed for, and how a Hadoop-based infrastructure can be improved to support a new era of exponentially growing data.
11:20am-12:00pm (40m) Sponsored
Operationalizing AI and ML with Cisco Data Intelligence Platform (sponsored by Cisco)
Han Yang (Cisco), Karthik Kulkarni (Cisco)
Artificial intelligence and machine learning are well beyond the laboratory exploratory stage of deployment. In fact, the speed of AI and ML deployment has a huge impact on an organization’s financial income. Chiang Yang and Karthik Kulkarni explore how the Cisco Data Intelligence Platform can help bridge the gap between AI and ML and big data.
1:15pm-1:55pm (40m) Sponsored
Data science isn't just another job (sponsored by Anaconda)
Peter Wang (Anaconda)
Peter Wang explores why data science shouldn’t be seen as merely another technical job within the business and why open source is such a critical aspect of innovation in the field of data science.
2:05pm-2:45pm (40m) Sponsored
Mastercard and Pitney Bowes: Creating a data-driven business (sponsored by Pitney Bowes)
Olga Lagunova (Pitney Bowes), John Derrico (Mastercard)
Mastercard and Pitney Bowes have overcome many challenges on their journey to accelerate innovation, achieve efficiencies, and improve the overall customer experience. Olga Lagunova and John Derrico share lessons learned as the data strategy evolved and highlight pitfalls and solutions from data science projects across several industries, from finance to cross-border shipping logistics.
2:55pm-3:35pm (40m) Sponsored
See what others can’t with spatial analysis and data science (sponsored by Esri)
Shannon Kalisky (Esri), Alberto Nieto (Esri)
Digital location data is a crucial part of data science. The "where" matters as much to an analysis as the "what" and the "why." Shannon Kalisky and Alberto Nieto explore tools that help you apply a range of geospatial techniques in your data science workflows to get deeper insights.
4:35pm-5:15pm (40m) Sponsored
Semantics and graph data models in the enterprise data fabric (sponsored by Cambridge Semantics)
Barbara Petrocelli (Cambridge Semantics), Peter Ball (Consultant)
Join industry consultant Peter Ball, of Liminal Innovation, and Barbara Petrocelli, VP Field Operations of Cambridge Semantics, to learn how enterprise data fabrics are reshaping the modern data management landscape.
5:25pm-6:05pm (40m) Automation in data science and data, Data Engineering and Architecture, Data Science, Machine Learning, & AI Data, Analytics, and AI Architecture, Media and Advertising, Model Development, Governance, Operations
Challenges faced in machine learning infrastructure in traditional large enterprises
venkata gunnu (Comcast), Harish Doddi (Datatron)
Machine learning infrastructure is key to the success of AI at scale in enterprises, with many challenges when you want to bring machine learning models to a production environment, given the legacy of the enterprise environment. Venkata Gunnu and Harish Doddi explore some key insights, what worked, what didn't work, and best practices that helped the data engineering and data science teams.
11:20am-12:00pm (40m) Sponsored
The future? Data, AI, and multicloud: It’s time to modernize (sponsored by IBM)
Madhu Kochar (IBM)
An economic revolution is underway, driven by advancements in AI and multicloud technologies. Businesses are crafting strategic plans to modernize their data architecture for this emerging reality, and at the top of their wish list is the ability to virtualize all their data regardless of where it lives. Madhu Kochar explores the data advancements on the horizon.
1:15pm-1:55pm (40m) Sponsored
AI/ML on Oracle Cloud with Kinetica and H2O.ai (sponsored by Oracle Cloud Infrastructure)
Ben Lackey (Oracle)
Learn about running AI/ML solutions like H2O.ai and Kinetica on Oracle Cloud. The session will include a live demo of Terraform, Oracle Cloud Infrastructure, GPUs and Oracle Marketplace. We’ll discuss other leading Data and AI products including Cloudera, DataStax and Confluent.
2:55pm-3:35pm (40m) Sponsored
How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE (BlueData))
Anant Chintamaneni (HPE (BlueData)), Matt Maccaux (HPE (BlueData))
Anant Chintamaneni and Matt Maccaux explore whether the combination of containers with large-scale distributed data analytics and machine learning applications is like combining oil and water— or like peanut butter and chocolate.
4:35pm-5:15pm (40m) Sponsored
Solve tomorrow’s business challenges with a modern data warehouse (sponsored by Matillion)
Daniel D'Orazio (Matillion)
According to Forrester, insight-driven companies are on pace to make $1.8 trillion annually by 2021. Daniel D'Orazio wants to know how fast your team can collect, process, and analyze data to solve present—and future—business challenges. You'll gain actionable tips and lessons learned from cloud data warehouse modernizations at companies like DocuSign that you can take back to your business.
5:25pm-6:05pm (40m) Sponsored
How Nuveen rapidly integrated ESG data to advance its platform value (sponsored by Zaloni)
Ben Sharma (Zaloni), Santanu Sengupta (Nuveen)
Ben Sharma and Santanu Sengupta walk you through how to quickly integrate and accelerate environmental, social, and governance (ESG) data and third-party data into your environment to provide governed, trusted, and traceable data to portfolio managers and analysts in a self-service manner.
8:45am-8:50am (5m)
Wednesday keynotes
Ben Lorica (O'Reilly), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.
8:50am-9:05am (15m)
The road to an enterprise cloud
Mick Hollison (Cloudera), Hillery Hunter (IBM)
Learn how IBM and Cloudera are fueling innovation in IoT, streaming, data warehouse and machine learning, and making their customer’s digital transformation journey easier, faster and safer.
9:05am-9:15am (10m) Data Engineering and Architecture, Data Science, Machine Learning, & AI
Recent trends in data and machine learning technologies
Ben Lorica (O'Reilly)
Ben Lorica dives into emerging technologies for building data infrastructures and machine learning platforms.
9:15am-9:30am (15m)
Everything is connected and the clock is ticking: AI and big ag data for food security
Sara Menker (Gro Intelligence), Nemo Semret (Gro Intelligence)
Sara Menker, CEO, Gro Intelligence
9:30am-9:40am (10m) Sponsored
The future of Google Cloud data processing (sponsored by Google Cloud)
James Malone (Google)
Open source has always been a core pillar of Google Cloud’s data and analytics strategy. James Malone examines how, as the community continues to set industry standards, the company continues to integrate those standards into its services so organizations around the world can unlock the value of data faster.
9:40am-10:00am (20m)
AI isn't magic. It’s computer science.
Robert Thomas (IBM), Tim O'Reilly (O'Reilly Media)
AI has the potential to add $16 trillion global economy by 2030, but adoption has been slow. While we understand the power of AI, many of us aren’t sure how to fully unleash its potential. Join Robert Thomas and Tim O'Reilly to learn that the reality is AI isn't magic. It’s hard work.
10:00am-10:05am (5m) Sponsored
Unleash the power of data at scale (sponsored by Intel)
Jeremy Rader (Intel)
Data analytics is the long-standing but constantly evolving science that companies leverage for insight, innovation, and competitive advantage. Jeremy Rader explores Intel’s end-to-end data pipeline software strategy designed and optimized for a modern and flexible data-centric infrastructure that allows for the easy deployment of unified advanced analytics and AI solutions at scale.
10:05am-10:20am (15m)
How disruptive tech is reshaping the financial services industry
Swatee Singh (American Express)
The financial services industry is increasingly using disruptive technology—including AI and machine learning, edge computing, blockchain, mobile and mixed reality, virtual assistants, and quantum computing to name a few—to enhance the customer experience and personalize their interactions with customers. Swatee Singh outlines how the same is true at American Express.
10:20am-10:25am (5m) Sponsored
It’s not you; it’s your database: How to unlock the full potential of your operational data (sponsored by MemSQL)
Nikita Shamgunov (MemSQL)
Data is now the world’s most valuable resource, with winners and losers decided every day by how well we collect, analyze, and act on data. However, most companies struggle to unlock the full value of their data, using outdated, outmoded data infrastructure. Nikita Shamgunov examines how businesses use data, the new demands on data infrastructure, and what you should expect from your tools.
10:25am-10:30am (5m) Sponsored
Cisco Data Intelligence Platform (sponsored by Cisco)
Siva Sivakumar (Cisco)
Siva Sivakumar explains the Cisco Data Intelligence Platform (CDIP), which is a cloud-scale architecture that brings together big data, AI and compute farm, and storage tiers to work together as a single entity, while also being able to scale independently to address the IT issues in the modern data center.
10:30am-10:45am (15m)
Interactive sports analytics
Patrick Lucey (Stats Perform)
Imagine watching sports and being able to immediately find all plays that are similar to what just happened. Better still, imagine being able to draw a play with the Xs and Os on an interface like a coach draws on a chalkboard and instantaneously finding all the similar plays and conduct analytics on those plays. Join Patrick Lucey to see how this is possible.
10:50am-11:20am (30m)
Break: Morning break sponsored by Intel
12:00pm-1:15pm (1h 15m)
Break: Lunch sponsored by Google Cloud
12:00pm-1:15pm (1h 15m)
Wednesday Topic Tables at Lunch
Topic Table discussions are a great way to informally network with people in similar industries or interested in the same topics.
12:00pm-1:15pm (1h 15m)
Better Together Diversity Networking Lunch
If you’d like to make new professional connections and hear ideas for supporting diversity in the tech community, come to the diversity and inclusion networking lunch on Wednesday.
12:30pm-1:10pm (40m) Sponsored
10 things to know about running and migrating Hadoop to GCP (sponsored by Google Cloud)
Blake DuBois (Google)
Taking advantage of cloud infrastructure and analytic services is a must for any digital enterprise. Join Google Cloud as they discuss 10 things you should know about running and migrating on-prem Hadoop deployments to GCP.
3:35pm-4:35pm (1h)
Break: Afternoon break sponsored by MemSQL
6:05pm-7:05pm (1h)
Booth Crawl
Make your way from booth to booth while you check out all the exhibitors in the Expo Hall on Wednesday after sessions end.
8:00am-8:30am (30m)
Speed Networking
Gather before keynotes on Wednesday morning to enjoy casual conversation while meeting fellow attendees. If one of your goals at Strata is to meet new people, this session will jumpstart your networking with other attendees.
8:30am-8:45am (15m)
Break: Early morning coffee (8:00am - 8:45am)
12:00pm-1:15pm (1h 15m)
Wednesday Topic Tables at Lunch
Topic Table discussions are a great way to informally network with people in similar industries or interested in the same topics.
12:00pm-1:15pm (1h 15m)
Wednesday Business Summit Lunch
Join fellow executives, business leaders, and strategists for a networking lunch on Wednesday for Strata Business Summit attendees and speakers.
7:30pm-10:30pm (3h)
Data After Dark
Don't miss an exciting evening filled with cocktails, food, and entertainment at Data After Dark at Strata in New York.

    Contact us

    confreg@oreilly.com

    For conference registration information and customer service

    partners@oreilly.com

    For more information on community discounts and trade opportunities with O’Reilly conferences

    strataconf@oreilly.com

    For information on exhibiting or sponsoring a conference

    pr@oreilly.com

    For media/analyst press inquires