Sep 23–26, 2019

Schedule

Topics

3B - Expo Hall

11:20am ML is not enough: Decision automation in the real world Brian Keng (Rubikloud)

1:15pm Handtrack.js: Building gesture-based interactions in the browser using TensorFlow Victor Dibia (Cloudera Fast Forward Labs)

2:05pm Machine learning for streaming data: Practical insights Heitor Murilo Gomes (Télécom ParisTech), Albert Bifet (Télécom ParisTech)

1A 06/07

11:20am Getting to know the elephant: Real-time debugging and visualization for deep learning Shital Shah (Microsoft Research)

1:15pm Scaling Apache Spark at Facebook Sameer Agarwal (Facebook), Ankit Agarwal (Facebook Inc.)

2:05pm Learning asset naming patterns to find risky unmanaged devices Ryan Foltz (Exabeam)

3:45pm Deep learning on Apache Spark at CERN’s Large Hadron Collider with Analytics Zoo Sajan Govindan (Intel)

4:35pm Deep learning technologies for giant hogweed eradication Naoto Umemori (NTT DATA), Masaru Dobashi (NTT DATA)

1A 08/10

11:20am Working with time series: Denoising and imputation frameworks to improve data density Anjali Samani (CircleUp)

1:15pm Handling data gaps in time series using imputation Alfred Whitehead (Klick), clare jeon (Klick)

2:05pm When Holt-Winters is better than machine learning Anais Dotis (InfluxData)

3:45pm Soss: Lightweight probabilistic programming in Julia Chad Scherrer (Metis)

4:35pm Scalable anomaly detection with Spark and SOS Jeroen Janssens (Data Science Workshops)

1A 12/14

11:20am A practical guide to algorithmic bias and explainability in machine learning Alejandro Saucedo (The Institute for Ethical AI & Machine Learning)

1:15pm Data need not be a moat: Mixed formal learning enables zero- and low-shot learning Sandra Carrico (GLYNT)

2:05pm Automating ML model training and deployments via metadata-driven data, infrastructure, feature engineering, and model management Mumin Ransom (Comcast), Nick Pinckernell (Comcast)

3:45pm An introduction to machine learning on graphs David Mack (Octavian)

1A 15/16

11:20am Your cloud, your ML, but more and more scale? How SurveyMonkey did it Jing Huang (SurveyMonkey), Jesscia Mong (SurveyMonkey)

1:15pm Managing your Kafka in an explosive growth environment Alon Gavra (AppsFlyer)

2:05pm Posttransaction processing using Apache Pulsar at Narvar Davor Bonaci (Kaskada), Anand Madhavan (Narvar)

3:45pm SK Telecom's 5G network monitoring and 3D visualization on streaming technologies Jonghyok Lee (SK Telecom), Chon Yong Lee (SK Telecom)

1A 21/22

11:20am Online machine learning in streaming applications Stavros Kontopoulos (Lightbend), Debasish Ghosh (Lightbend)

1:15pm Problems taking AI to production and how to fix them Jim Scott (NVIDIA)

2:05pm The new SDLC: CI/CD in the age of machine learning Diego Oppenheimer (Algorithmia)

3:45pm ML ops: Applying DevOps practices to machine learning workloads Sireesha Muppala (Amazon Web Services), Shelbee Eigenbrode (Amazon Web Services), Randall DeFauw (Amazon Web Services)

1A 23/24

11:20am Performant time series data management and analytics with PostgreSQL Michael Freedman (TimescaleDB | Princeton University)

1:15pm How to performance-tune Spark applications in large clusters Omkar Joshi (Uber), Bo Yang (Uber)

2:05pm Creating an extensible 100+ PB real-time big data platform by unifying storage and serving Reza Shiftehfar (Uber)

3:45pm Enabling big data and AI workloads on the object store at DBS Bank Vitaliy Baklikov (DBS Bank), Dipti Borkar (Alluxio )

4:35pm Bridging the gap between big data computing and high-performance computing Supun Kamburugamuve (Indiana University)

1E 07/08

11:20am Using Spark for crunching astronomical data on the LSST scale Petar Zecevic (SV Group)

1:15pm The hitchhiker’s guide to the cloud: Architecting for the cloud through customer stories Sushant Rao (Cloudera)

2:05pm Fuzzy matching and deduplicating data: Techniques for advanced data prep Nikki Rouda (Amazon Web Services), Janisha Anand (Amazon Web Services)

3:45pm Lessons learned from scaling the tech stack of a modern analytics platform Scott Castle (Sisense)

4:35pm Spark on Kubernetes for data science Jordan Volz (Dataiku)

1E 09

11:20am Where's my lookup table? Modeling relational data in a denormalized world Rick Houlihan (Amazon Web Services)

1:15pm Intelligent design patterns for cloud-based analytics and BI Shant Hovsepian (Arcadia Data)

2:05pm Securing your cloud data lake with a "defense in depth" approach Tomer Shiran (Dremio), Jacques Nadeau (Dremio)

3:45pm Protect your private data in your Hadoop clusters with ORC column encryption Owen O'Malley (Cloudera)

4:35pm Using Spark to speed up the diagnosis performance for big data applications Ruixin Xu (Microsoft), Long Tian (Microsoft), Yu Zhou (Microsoft)

1E 10/11

11:20am Executive Briefing: Creating a center for data science from scratch—Lessons from nonprofit research Gayle Bieler (RTI International)

1:15pm Executive Briefing: Lessons from the front lines—Building a responsible AI/ML program in the enterprise Keegan Hines (Capital One)

2:05pm Executive Briefing: Unpacking AutoML Paco Nathan (derwen.ai)

3:45pm Executive Briefing: Building a culture of self-service from predeployment to continued engagement Jonathan Tudor (GE Aviation), Ross Schalmo (GE Aviation)

4:35pm Executive Briefing: What it takes to use machine learning in fast data pipelines Dean Wampler (Anyscale)

1E 12/13

11:20am Executive Briefing: Say what? The ethical challenges of designing for humanlike interaction Jonathan Foster (Microsoft)

1:15pm An in-depth look at the data science career: Defining roles, assessing skills Usama Fayyad (Open Insights & OODA Health, Inc.), Hamit Hamutcu (Analytics Center)

2:05pm T-Mobile's journey to turn crowdsourced big data into actionable insights Alex Yoon (T-Mobile)

3:45pm Migrating millions of users from voice- and email-based customer support to a chatbot Madhu Gopinathan (MakeMyTrip), Sanjay Mohan (MakeMyTrip)

4:35pm Combining creativity and analytics David Boyle (Audience Strategies)

1E 14

11:20am How Deutsche Bank industrialized AI and machine learning John Allen (Deutsche Bank)

1:15pm Communication breakdown: Facing machine learning’s all-too-human failure James Kotecki (Infinia ML)

2:05pm ThirdEye: LinkedIn’s business-wide monitoring platform Akshay Rai (Linkedin)

3:45pm Purposefully designing technology for civic engagement Audrey Lobo-Pulo (Phoensight), Annette Hester (National Energy Board, Canada)

4:35pm Executive Briefing: Big data in the era of heavy worldwide privacy regulations Mark Donsky (Okera)

1A 01/02

11:20am The key to climbing the AI ladder (sponsored by IBM) DANIEL HERNANDEZ (IBM)

1:15pm So you built a model; now what? (sponsored by Dataiku) Jed Dougherty (Dataiku)

2:05pm Powering the future with data intelligence (sponsored by Collibra) Jim Cushman (Collibra), Piyush Jain (Progressive)

1A 03

11:20am Deliver personalized experiences and content like Xbox with Cognitive Services Personalizer (sponsored by Microsoft Azure) Edward Jezierski (Microsoft), Jackie Nichols (Microsoft)

1:15pm Migrating Hadoop analytics to Spark in the cloud without disruption (sponsored by WANdisco) Paul Scott-Murphy (WANdisco)

2:05pm Stream processing beyond streaming data Stephan Ewen (Ververica)

1A 04/05

11:20am Organizing the chaos of healthcare with smart data discovery (sponsored by Io-Tahoe) Charles Boicey (Clearsense)

1:15pm Next-generation serverless data architecture for insights at the speed of thought (sponsored by Actian) Paul Wolmering (Actian)

2:05pm Getting clinical trial data ready for analysis: How IQVIA wrangled its way to success (sponsored by Trifacta) Matt Derda (Trifacta), Yogesh Prasad (IQVIA)

1E 06

11:20am Transforming Financial Reporting Services with Massively Scalable OLAP (sponsored by Kyvos Insights) Ajay Anand (Kyvos Insights)

1:15pm The end of applications: How data collaboration is changing everything (sponsored by Cinchy) Dan DeMers (Cinchy)

3E
8:45am Thursday keynotes Ben Lorica (O'Reilly), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)

8:55am Staying safe in the AI era Cassie Kozyrkov (Google)

9:15am Unlocking the value of your data (sponsored by IBM) DANIEL HERNANDEZ (IBM)

9:25am Delivering the enterprise data cloud Arun Murthy (Cloudera )

9:35am Postrevolutionary big data: Promoting the general welfare (sponsored by Io-Tahoe) Barbara Eckman (Comcast)

9:40am RL in real life: Bringing reinforcement learning to the enterprise (sponsored by Microsoft Azure) Edward Jezierski (Microsoft)

9:45am Strata Data Awards: Winners announced

9:55am Say what? The ethical challenges of designing for humanlike interaction Jonathan Foster (Microsoft)

10:15am Data Science Pioneers: Conquering the next frontier, a documentary investigating the future of data science (sponsored by Dataiku) Jed Dougherty (Dataiku)

10:20am Data sonification: Making music from the yield curve Alan Smith (Financial Times)

10:40am Closing remarks Ben Lorica (O'Reilly), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)

10:50am Morning break sponsored by Cisco | Room: Expo Hall - 3B

12:00pm Break | Room: Expo Hall - 3B

12:00pm Thursday Topic Tables at Lunch (sponsored by IBM) | Room: Expo Hall - 3B

12:30pm Why AI fails: Overcoming AI challenges (sponsored by IBM) | Room: 3B - Expo Hall Brittany Bogle (IBM)

2:45pm Afternoon break sponsored by Io-Tahoe | Room: Expo Hall - 3B

8:00am Speed Networking | Room: Keynote Foyer

8:30am Early morning coffee (8:00am - 8:45am) | Room: Keynote Foyer

12:00pm Thursday Topic Tables at Lunch (sponsored by IBM) | Room: Expo Hall - 3B

12:00pm Thursday Business Summit Lunch | Room: Expo Hall - 3D

11:20am-12:00pm (40m) Data Science, Machine Learning, & AI, Expo Hall Culture and Organization, Retail and e-commerce, Transportation and Logistics

ML is not enough: Decision automation in the real world

Brian Keng (Rubikloud)

Automating decisions require a system to consider more than just a data-driven prediction. Real-world decisions require additional constraints and fuzzy objectives to ensure they're robust and consistent with business goals. Brian Keng takes a deep dive into how to leverage modern machine learning methods and traditional mathematical optimization techniques for decision automation.

1:15pm-1:55pm (40m) Data Science, Machine Learning, & AI, Expo Hall Deep dive into specific tools, platforms, or frameworks, Deep Learning

Handtrack.js: Building gesture-based interactions in the browser using TensorFlow

Victor Dibia (Cloudera Fast Forward Labs)

Recent advances in machine learning frameworks for the browser such as TensorFlow provides the opportunity to craft truly novel experiences within frontend applications. Victor Dibia explores the state of the art for machine learning in the browser using TensorFlow and outlines its use in the design of Handtrack.js—a library for prototyping real-time hand detection in the browser.

2:05pm-2:45pm (40m) Data Science, Machine Learning, & AI, Expo Hall Streaming and IoT, Telecom, Temporal data and time-series analytics

Machine learning for streaming data: Practical insights

Heitor Murilo Gomes (Télécom ParisTech), Albert Bifet (Télécom ParisTech)

Heitor Murilo Gomes and Albert Bifet introduce you to a machine learning pipeline for streaming data using the streamDM framework. You'll also learn how to use streamDM for supervised and unsupervised learning tasks, see examples of online preprocessing methods, and discover how to expand the framework by adding new learning algorithms or preprocessing methods.

11:20am-12:00pm (40m) Data Science, Machine Learning, & AI

Getting to know the elephant: Real-time debugging and visualization for deep learning

Shital Shah (Microsoft Research)

Taming massive deep learning models, data, and training times requires new way of thinking. Shital Shah explores new tools and methods to better understand AI. Explaining the decisions made by AI not only helps us accelerate its development but also make it safe and more trustworthy.

1:15pm-1:55pm (40m) Data Science, Machine Learning, & AI

Scaling Apache Spark at Facebook

Sameer Agarwal (Facebook), Ankit Agarwal (Facebook Inc.)

Apache Spark is the largest compute engine at Facebook by CPU. Sameer Agarwal dives into the story of how Facebook optimized, tuned, and scaled Apache Spark to run on clusters of tens of thousands of machines, processing hundreds of petabytes of data, and being used by thousands of data scientists, engineers, and product analysts every day.

2:05pm-2:45pm (40m) Data Science, Machine Learning, & AI Deep Learning, Streaming and IoT

Learning asset naming patterns to find risky unmanaged devices

Ryan Foltz (Exabeam)

Unmanaged and foreign devices in the corporate networks pose a security risk, and the first step toward reducing this risk is the ability to identify them. Ryan Foltz walks you through a comprehensive device management machine learning model based on deep learning that performs anomaly detection based on only device names to flag devices that do not follow naming structures.

3:45pm-4:25pm (40m) Data Science, Machine Learning, & AI Deep Learning

Deep learning on Apache Spark at CERN’s Large Hadron Collider with Analytics Zoo

Sajan Govindan (Intel)

Sajan Govindan outlines CERN’s research on deep learning in high energy physics experiments as an alternative to customized rule-based methods with an example of topology classification to improve real-time event selection at the Large Hadron Collider. CERN uses deep learning pipelines on Apache Spark using BigDL and Analytics Zoo open source software on Intel Xeon-based clusters.

4:35pm-5:15pm (40m) Data Science, Machine Learning, & AI Deep Learning

Deep learning technologies for giant hogweed eradication

Naoto Umemori (NTT DATA), Masaru Dobashi (NTT DATA)

Giant hogweed is a highly toxic plant. Naoto Umemori and Masaru Dobashi aim to automate the process of detecting the plant with technologies like drones and image recognition and detection using machine learning. You'll see how they designed the architecture, took advantage of big data and machine and deep learning technologies (e.g., Hadoop, Spark, and TensorFlow), and the lessons they learned.

11:20am-12:00pm (40m) Data Science, Machine Learning, & AI Financial Services, Temporal data and time-series analytics

Working with time series: Denoising and imputation frameworks to improve data density

Anjali Samani (CircleUp)

The application of smoothing and imputation strategies is common practice in predictive modeling and time series analysis. With a technique-agnostic approach, Anjali Samani provides qualitative and quantitative frameworks that address questions related to smoothing and imputation of missing values to improve data density.

1:15pm-1:55pm (40m) Data Science, Machine Learning, & AI Temporal data and time-series analytics

Handling data gaps in time series using imputation

Alfred Whitehead (Klick), clare jeon (Klick)

Time series forecasts depend on sensors or measurements made in the real, messy world. The sensors flake out, get turned off, disconnect, and otherwise conspire to cause missing signals. Signals that may tell you what tomorrow's temperature will be or what your blood glucose levels are before bed. Alfred Whitehead and Clare Jeon explore methods for handling data gaps and when to consider which.

2:05pm-2:45pm (40m) Data Science, Machine Learning, & AI Temporal data and time-series analytics

When Holt-Winters is better than machine learning

Anais Dotis (InfluxData)

Machine learning (ML) gets a lot of hype, but its classical predecessors are still immensely powerful, especially in the time series space, and classical algorithms outperform machine learning methods in time series forecasting. Anais Dotis dives into how she used the Holt-Winters forecasting algorithm to predict water levels in a creek.

3:45pm-4:25pm (40m) Data Science, Machine Learning, & AI Deep dive into specific tools, platforms, or frameworks

Soss: Lightweight probabilistic programming in Julia

Chad Scherrer (Metis)

Chad Scherrer explores the basic ideas in Soss, a new probabilistic programming library for Julia. Soss allows a high-level representation of the kinds of models often written in PyMC3 or Stan, and offers a way to programmatically specify and apply model transformations like approximations or reparameterizations.

4:35pm-5:15pm (40m) Data Science, Machine Learning, & AI Temporal data and time-series analytics

Scalable anomaly detection with Spark and SOS

Jeroen Janssens (Data Science Workshops)

Jeroen Janssens dives into stochastic outlier section (SOS), an unsupervised algorithm for detecting anomalies in large, high-dimensional data. SOS has been implemented in Python, R, and, most recently, Spark. He illustrates the idea and intuition behind SOS, demonstrates the implementation of SOS on top of Spark, and applies SOS to a real-world use case.

11:20am-12:00pm (40m) Data Science, Machine Learning, & AI Ethics

A practical guide to algorithmic bias and explainability in machine learning

Alejandro Saucedo (The Institute for Ethical AI & Machine Learning)

Alejandro Saucedo demystifies AI explainability through a hands-on case study, where the objective is to automate a loan-approval process by building and evaluating a deep learning model. He introduces motivations through the practical risks that arise with undesired bias and black box models and shows you how to tackle these challenges using tools from the latest research and domain knowledge.

1:15pm-1:55pm (40m) Data Science, Machine Learning, & AI Text and Language processing and analysis

Data need not be a moat: Mixed formal learning enables zero- and low-shot learning

Sandra Carrico (GLYNT)

Sandra Carrico explores mixed formal learning, explains it, and outlines one machine learning example that previously used large numbers of examples and now learns with either zero or a handful of training examples. It maps apparently idiosyncratic techniques to mixed formal learning, a general AI architecture that you can use in your projects.

2:05pm-2:45pm (40m) Automation in data science and data, Data Science, Machine Learning, & AI Data quality, data governance and data lineage, Media and Advertising, Model Development, Governance, Operations

Automating ML model training and deployments via metadata-driven data, infrastructure, feature engineering, and model management

Mumin Ransom (Comcast), Nick Pinckernell (Comcast)

Mumin Ransom gives an overview of the data management and privacy challenges around automating ML model (re)deployments and stream-based inferencing at scale.

3:45pm-4:25pm (40m) Data Science, Machine Learning, & AI Financial Services

An introduction to machine learning on graphs

David Mack (Octavian)

Graphs are a powerful way to represent knowledge. Organizations, in fields such as biosciences and finance, are starting to amass large knowledge graphs, but they lack the machine learning tools to extract insights from them. David Mack offers an overview of what insights are possible and surveys the most popular approaches.

11:20am-12:00pm (40m) Data Engineering and Architecture Cloud Platforms and SaaS, Data, Analytics, and AI Architecture, Media and Advertising

Your cloud, your ML, but more and more scale? How SurveyMonkey did it

Jing Huang (SurveyMonkey), Jesscia Mong (SurveyMonkey)

You're a SaaS company operating on a cloud infrastructure prior to the machine learning (ML) era and you need to successfully extend your existing infrastructure to leverage the power of ML. Jing Huang and Jessica Mong detail a case study with critical lessons from SurveyMonkey’s journey of expanding its ML capabilities with its rich data repo and hybrid cloud infrastructure.

1:15pm-1:55pm (40m) Data Engineering and Architecture Data Management and Storage, Deep dive into specific tools, platforms, or frameworks

Managing your Kafka in an explosive growth environment

Alon Gavra (AppsFlyer)

Frequently, Kafka is just a piece of the stack that lives in production that often times no one wants to touch—because it just works. Alon Gavra outlines how Kafka sits at the core of AppsFlyer's infrastructure that processes billions of events daily.

2:05pm-2:45pm (40m) Data Engineering and Architecture, Streaming and IoT Data Integration and Data Processing, Data, Analytics, and AI Architecture, Retail and e-commerce, Streaming and IoT

Posttransaction processing using Apache Pulsar at Narvar

Davor Bonaci (Kaskada), Anand Madhavan (Narvar)

Narvar provides next-generation posttransaction experience for over 500 retailers. Karthik Ramasamy and Anand Madhavan take you on the journey of how Narvar moved away from using a slew of technologies for their platform and consolidated its use cases using Apache Pulsar.

3:45pm-4:25pm (40m) Data Engineering and Architecture, Streaming and IoT Data Integration and Data Processing, Data, Analytics, and AI Architecture, Streaming and IoT, Telecom

SK Telecom's 5G network monitoring and 3D visualization on streaming technologies

Jonghyok Lee (SK Telecom), Chon Yong Lee (SK Telecom)

Jonghyok Lee Chon Yong Lee discuss T-CORE, SK Telecom’s monitoring and service analytics platform, which collects system and application data from several thousand servers and applications and provides a 3D visualization of the real-time status of the whole network. Join in to hear lessons learned during development.

11:20am-12:00pm (40m) Data Engineering and Architecture, Streaming and IoT Streaming and IoT, Temporal data and time-series analytics

Online machine learning in streaming applications

Stavros Kontopoulos (Lightbend), Debasish Ghosh (Lightbend)

Stavros Kontopoulos and Debasish Ghosh explore online machine learning algorithm choices for streaming applications, especially those with resource-constrained use cases like IoT and personalization. They dive into Hoeffding Adaptive Trees, classic sketch data structures, and drift detection algorithms from implementation to production deployment, describing the pros and cons of each of them.

1:15pm-1:55pm (40m) Data Engineering and Architecture Model Development, Governance, Operations

Problems taking AI to production and how to fix them

Jim Scott (NVIDIA)

Data scientists create and test hundreds or thousands more models than in the past. Models require support from both real-time and static data sources. As data becomes enriched, and parameters tuned and explored, there's a need for versioning everything, including the data. Jim Scott examines the very specific problems and approaches to fix them.

2:05pm-2:45pm (40m) Automation in data science and data, Data Engineering and Architecture Model Development, Governance, Operations

The new SDLC: CI/CD in the age of machine learning

Diego Oppenheimer (Algorithmia)

Machine learning (ML) will fundamentally change the way we build and maintain applications. Diego Oppenheimer dives into how you can adapt your infrastructure, operations, staffing, and training to meet the challenges of the new software development life cycle (SDLC) without throwing away everything that already works.

3:45pm-4:25pm (40m) Automation in data science and data, Data Engineering and Architecture Cloud Platforms and SaaS, Deep dive into specific tools, platforms, or frameworks, Model Development, Governance, Operations

ML ops: Applying DevOps practices to machine learning workloads

Sireesha Muppala (Amazon Web Services), Shelbee Eigenbrode (Amazon Web Services), Randall DeFauw (Amazon Web Services)

As an increasing level of automation becomes available to data science, the balance between automation and quality needs to be maintained. Applying DevOps practices to machine learning workloads brings models to the market faster and maintains the quality and integrity of those models. Sireesha Muppala, Shelbee Eigenbrode, and Randall DeFauw explore applying DevOps practices to ML workloads.

11:20am-12:00pm (40m) Data Engineering and Architecture Data Management and Storage, Streaming and IoT

Performant time series data management and analytics with PostgreSQL

Michael Freedman (TimescaleDB | Princeton University)

Leveraging polyglot solutions for your time series data can lead to issues including engineering complexity, operational challenges, and even referential integrity concerns. Michael Freedman explains why, by re-engineering PostgreSQL to serve as a general data platform, your high-volume time series workloads will be better streamlined, resulting in more actionable data and greater ease of use.

1:15pm-1:55pm (40m) Data Engineering and Architecture Deep dive into specific tools, platforms, or frameworks, Transportation and Logistics

How to performance-tune Spark applications in large clusters

Omkar Joshi (Uber), Bo Yang (Uber)

Omkar Joshi and Bo Yang offer an overview of how Uber’s ingestion (Marmary) and observability team improved performance of Apache Spark applications running on thousands of cluster machines and across hundreds of thousands+ of applications and how the team methodically tackled these issues. They also cover how they used Uber’s open-sourced jvm-profiler for debugging issues at scale.

2:05pm-2:45pm (40m) Data Engineering and Architecture Data Integration and Data Processing, Data Management and Storage, Data, Analytics, and AI Architecture, Transportation and Logistics

Creating an extensible 100+ PB real-time big data platform by unifying storage and serving

Reza Shiftehfar (Uber)

Building a reliable big data platform is extremely challenging when it has to store and serve hundreds of petabytes of data in real time. Reza Shiftehfar reflects on the challenges faced and proposes architectural solutions to scale a big data platform to ingest, store, and serve 100+ PB of data with minute-level latency while efficiently utilizing the hardware and meeting security needs.

3:45pm-4:25pm (40m) Data Engineering and Architecture Cloud Platforms and SaaS, Data Management and Storage, Data, Analytics, and AI Architecture, Financial Services

Enabling big data and AI workloads on the object store at DBS Bank

Vitaliy Baklikov (DBS Bank), Dipti Borkar (Alluxio )

Vitaliy Baklikov and Dipti Borkar explore how DBS Bank built a modern big data analytics stack leveraging an object store even for data-intensive workloads like ATM forecasting and how it uses Alluxio to orchestrate data locality and data access for Spark workloads.

4:35pm-5:15pm (40m) Data Engineering and Architecture Data, Analytics, and AI Architecture

Bridging the gap between big data computing and high-performance computing

Supun Kamburugamuve (Indiana University)

Big data computing and high-performance computing (HPC) evolved over the years as separate paradigms. With the explosion of the data and the demand for machine learning algorithms, these two paradigms increasingly embrace each other for data management and algorithms. Supun Kamburugamuve explores the possibilities and tools available for getting the best of HPC and big data.

11:20am-12:00pm (40m) Data Engineering and Architecture Data Integration and Data Processing

Using Spark for crunching astronomical data on the LSST scale

Petar Zecevic (SV Group)

The Large Scale Survey Telescope (LSST) is one of the most important future surveys. Its unique design allows it to cover large regions of the sky and obtain images of the faintest objects. After 10 years of operation, it will produce about 80 PB of data in images and catalog data. Petar Zecevic explains AXS, a system built for fast processing and cross-matching of survey catalog data.

1:15pm-1:55pm (40m) Data Engineering and Architecture Cloud Platforms and SaaS, Data, Analytics, and AI Architecture

The hitchhiker’s guide to the cloud: Architecting for the cloud through customer stories

Sushant Rao (Cloudera)

Jason Wang and Sushant Rao offer an overview of cloud architecture, then go into detail on core cloud paradigms like compute (virtual machines), cloud storage, authentication and authorization, and encryption and security. They conclude by bringing these concepts together through customer stories to demonstrate how real-world companies have leveraged the cloud for their big data platforms.

2:05pm-2:45pm (40m) Data Engineering and Architecture Data Integration and Data Processing, Data quality, data governance and data lineage

Fuzzy matching and deduplicating data: Techniques for advanced data prep

Nikki Rouda (Amazon Web Services), Janisha Anand (Amazon Web Services)

Nikki Rouda and Janisha Anand demonstrate how to deduplicate or link records in a dataset, even when the records don’t have a common unique identifier and no fields match exactly. You'll also learn how to link customer records across different databases, match external product lists against your own catalog, and solve tough challenges to prepare and cleanse data for analysis.

3:45pm-4:25pm (40m) Data Engineering and Architecture Data, Analytics, and AI Architecture

Lessons learned from scaling the tech stack of a modern analytics platform

Scott Castle (Sisense)

In this session, Scott Castle, General Manager at Sisense and former VP of Product at Periscope Data, will discuss lessons learned from scaling up Periscope Data to support incredibly large volumes of data and queries from its data teams.

4:35pm-5:15pm (40m) Data Science, Machine Learning, & AI

Spark on Kubernetes for data science

Jordan Volz (Dataiku)

Spark on Kubernetes is a winning combination for data science that stitches together a flexible platform harnessing the best of both worlds. Jordan Volz gives a brief overview of Spark and Kubernetes, the Spark on Kubernetes project, why it’s an ideal fit for data scientists who may have been dissatisfied with other iterations of Spark in the past, and some applications.

11:20am-12:00pm (40m) Data Engineering and Architecture Cloud Platforms and SaaS, Data Management and Storage

Where's my lookup table? Modeling relational data in a denormalized world

Rick Houlihan (Amazon Web Services)

Data has always been and will always be relational. NoSQL databases are gaining in popularity, but that doesn't change the fact that the data is still relational, it just changes how we have to model the data. Rick Houlihan dives deep into how real entity relationship models can be efficiently modeled in a denormalized manner using schema examples from real application services.

1:15pm-1:55pm (40m) Business Analytics and Visualization, Data Engineering and Architecture BI, Interactive Analytics and Visualization

Intelligent design patterns for cloud-based analytics and BI

Shant Hovsepian (Arcadia Data)

With cloud object storage (e.g., S3, ADLS) one expects business intelligence (BI) applications to benefit from the scale of data and real-time analytics. However, traditional BI in the cloud surfaces nonobvious challenges. Shant Hovsepian examines service-oriented cloud design (storage, compute, catalog, security, SQL) and how native cloud BI provides analytic depth, low cost, and performance.

2:05pm-2:45pm (40m) Data Engineering and Architecture Cloud Platforms and SaaS, Privacy and Security

Securing your cloud data lake with a "defense in depth" approach

Tomer Shiran (Dremio), Jacques Nadeau (Dremio)

With cheap and scalable storage services such as S3 and ADLS, it's never been easier to dump data into a cloud data lake. But you still need to secure that data and be sure it doesn't leak. Tomer Shiran and Jacques Nadeau explore capabilities for securing a cloud data lake, including authentication, access control, encryption (in motion and at rest), and auditing, as well as network protections.

3:45pm-4:25pm (40m) Data Engineering and Architecture, Security and Privacy Deep dive into specific tools, platforms, or frameworks, Privacy and Security

Protect your private data in your Hadoop clusters with ORC column encryption

Owen O'Malley (Cloudera)

Fine-grained data protection at a column level in data lake environments has become a mandatory requirement to demonstrate compliance with multiple local and international regulations across many industries today. Owen O'Malley dives into how column encryption in ORC files enables both fine-grain protection and audits of who accessed the private data.

4:35pm-5:15pm (40m) Data Engineering and Architecture

Using Spark to speed up the diagnosis performance for big data applications

Ruixin Xu (Microsoft), Long Tian (Microsoft), Yu Zhou (Microsoft)

Ruixin Xu, Long Tian, and Yu Zhou explore an experiment run using Spark and Jupyter notebooks as a replacement for existing IDE-based tools for internal DevOps. The Spark-based solution improved the diagnosis performance significantly, especially for a complex job with a large profile, and leveraging the Jupyter notebooks brings the benefit of fast iteration and easy knowledge share.

11:20am-12:00pm (40m) Culture and organization, Strata Business Summit Culture and Organization

Executive Briefing: Creating a center for data science from scratch—Lessons from nonprofit research

Gayle Bieler (RTI International)

Gayle Bieler explains how she built a thriving center for data science within a large, well-respected nonprofit research institute and shares some of its most impactful projects and best adventures to date, that have solved important national problems, improved local communities, and transformed research.

1:15pm-1:55pm (40m) Executive Briefing and best practices, Strata Business Summit Culture and Organization, Ethics

Executive Briefing: Lessons from the front lines—Building a responsible AI/ML program in the enterprise

Keegan Hines (Capital One)

This talk will explore some of the philosophy around the concept of explaining a model given the colloquial definition is partially recursive. It will cover the lens banking regulation places on this philosophical basis and expand into techniques used for these well governed aspects.

2:05pm-2:45pm (40m) Strata Business Summit

Executive Briefing: Unpacking AutoML

Paco Nathan (derwen.ai)

Paco Nathan outlines the history and landscape for vendors, open source projects, and research efforts related to AutoML. Starting from the perspective of an AI expert practitioner who speaks business fluently, Paco unpacks the ground truth of AutoML—translating from the hype into business concerns and practices in a vendor-neutral way.

3:45pm-4:25pm (40m) Culture and organization, Strata Business Summit Culture and Organization, Transportation and Logistics

Executive Briefing: Building a culture of self-service from predeployment to continued engagement

Jonathan Tudor (GE Aviation), Ross Schalmo (GE Aviation)

Jonathan Tudor and Ross Schalmo explore how GE Aviation made it a mission to implement self-service data. To ensure success beyond initial implementation of tools, the data engineering and analytics teams created initiatives to foster engagement from an ongoing partnership with each part of the business to the gamification of tagging data in a data catalog to forming a published dataset council.

4:35pm-5:15pm (40m) Executive Briefing and best practices, Strata Business Summit Data, Analytics, and AI Architecture, Streaming and IoT

Executive Briefing: What it takes to use machine learning in fast data pipelines

Dean Wampler (Anyscale)

Dean Wampler dives into how (and why) to integrate ML into production streaming data pipelines and to serve results quickly; how to bridge data science and production environments with different tools, techniques, and requirements; how to build reliable and scalable long-running services; and how to update ML models without downtime.

11:20am-12:00pm (40m) Strata Business Summit

Executive Briefing: Say what? The ethical challenges of designing for humanlike interaction

Jonathan Foster (Microsoft)

Language shapes our thinking, our relationships, our sense of self. Conversation connects us in powerful, intimate, and often unconscious ways. Jonathan Foster explains why, as we design for natural language interactions and more humanlike digital experiences, language—as design material, conversation, and design canvas—reveals ethical challenges we couldn't encounter with GUI-powered experiences.

1:15pm-1:55pm (40m) Culture and organization, Strata Business Summit Culture and Organization

An in-depth look at the data science career: Defining roles, assessing skills

Usama Fayyad (Open Insights & OODA Health, Inc.), Hamit Hamutcu (Analytics Center)

If you've ever been confused about what it takes to be a data scientist or curious about how companies recruit, train, and manage analytics resources, Usama Fayyad and Hamit Hamutcu are here to explore insights from the most comprehensive research effort to date on the data analytics profession and propose a framework for the standardization of roles and methods for assessing skills.

2:05pm-2:45pm (40m) Case studies, Strata Business Summit BI, Interactive Analytics and Visualization, Telecom

T-Mobile's journey to turn crowdsourced big data into actionable insights

Alex Yoon (T-Mobile)

T-Mobile successfully improved the quality of voice calling by analyzing crowdsourced big data from mobile devices. Alex Yoon walks you through how engineers from multiple backgrounds collaborated to achieve 10% improvement in voice quality and why the analysis of big data was the key to the success in bringing a better voice call service quality to millions of end users.

3:45pm-4:25pm (40m) Case studies, Strata Business Summit Text and Language processing and analysis, Transportation and Logistics

Migrating millions of users from voice- and email-based customer support to a chatbot

Madhu Gopinathan (MakeMyTrip), Sanjay Mohan (MakeMyTrip)

At MakeMyTrip customers were using voice or email to contact agents for postsale support. In order to improve the efficiency of agents and improve customer experience, MakeMyTrip developed a chatbot, Myra, using some of the latest advances in deep learning. Madhu Gopinathan and Sanjay Mohan explain the high-level architecture and the business impact Myra created.

4:35pm-5:15pm (40m) Strata Business Summit

Combining creativity and analytics

David Boyle (Audience Strategies)

Companies that harness creativity and data in tandem have growth rates twice as high as companies that don’t. David Boyle shares lessons from his successes and failures in trying to do just that across presidential politics, with pop stars, and with power brands in the world of luxury goods. Join in to find out how analysts can work differently to build these partnerships and unlock this growth.

11:20am-12:00pm (40m) Data Science, Machine Learning, & AI, Strata Business Summit

How Deutsche Bank industrialized AI and machine learning

John Allen (Deutsche Bank)

As an early adopter of data science, machine learning, and AI, Deutsche Bank's analytics function is trailblazing new ways to drive revenues, lower costs, and reduce risk across all areas of the group. John Allen shares how his team combines commercial offerings with open source technologies to revolutionize legacy processes and transform the way the bank uses technology to drive innovation.

1:15pm-1:55pm (40m) Executive Briefing and best practices, Strata Business Summit

Communication breakdown: Facing machine learning’s all-too-human failure

James Kotecki (Infinia ML)

Miscommunication between business leaders and technical experts can doom even the best data science project. Don’t let it drive you insane! In this session, we’ll dissect many flavors of communication failure, from goal misalignment to technical misunderstanding. Then, we’ll explore practical ways to bridge these gaps.

2:05pm-2:45pm (40m) Business Analytics and Visualization, Strata Business Summit BI, Interactive Analytics and Visualization, Media and Advertising, Temporal data and time-series analytics

ThirdEye: LinkedIn’s business-wide monitoring platform

Akshay Rai (Linkedin)

Failures or issues in a product or service can negatively affect the business. Detecting issues in advance and recovering from them is crucial to keeping the business alive. Join Akshay Rai to learn more about LinkedIn's next-generation open source monitoring platform, an integrated solution for real-time alerting and collaborative analysis.

3:45pm-4:25pm (40m) Law and Ethics, Strata Business Summit BI, Interactive Analytics and Visualization, Ethics

Purposefully designing technology for civic engagement

Audrey Lobo-Pulo (Phoensight), Annette Hester (National Energy Board, Canada)

As new digital platforms emerge and governments look at new ways to engage with citizens, there's an increasing awareness of the role these platforms play in shaping public participation and democracy. Audrey Lobo-Pulo, Annette Hester, and Ryan Hum examine the design attributes of civic engagement technologies and their ensuing impacts and an NEB Canada case study.

4:35pm-5:15pm (40m) Executive Briefing and best practices, Strata Business Summit Privacy and Security

Executive Briefing: Big data in the era of heavy worldwide privacy regulations

Mark Donsky (Okera)

California is following the EU's GDPR with the California Consumer Protection Act (CCPA) in 2020. Penalties for non-compliance, but many companies aren't prepared for this strict regulation. This session will explore the capabilities your data environment needs in order to simplify CCPA and GDPR compliance, as well as other regulations.

11:20am-12:00pm (40m) Sponsored

The key to climbing the AI ladder (sponsored by IBM)

DANIEL HERNANDEZ (IBM)

AI isn't magic. It’s still hard work. Daniel Hernandez explains why having the technology alone isn't enough; it requires a thoughtful and well-architected approach.

1:15pm-1:55pm (40m) Sponsored

So you built a model; now what? (sponsored by Dataiku)

Jed Dougherty (Dataiku)

Jed Dougherty takes a deep dive into an often overlooked aspect of the data science lifecycle: model deployment. Once they’ve constructed a data science model that does a good job accurately predicting their test set, many data scientists think the job is over. But really, it’s just begun.

2:05pm-2:45pm (40m) Sponsored

Powering the future with data intelligence (sponsored by Collibra)

Jim Cushman (Collibra), Piyush Jain (Progressive)

Transforming data into a trusted business asset that informs decision making requires giving teams access to a powerful platform that makes it easy to harness data across the enterprise. Jim Cushman and Piyush Jain detail how Progressive uses Collibra to transform the way data is managed and used across the organization, driving real business value.

11:20am-12:00pm (40m) Sponsored

Deliver personalized experiences and content like Xbox with Cognitive Services Personalizer (sponsored by Microsoft Azure)

Edward Jezierski (Microsoft), Jackie Nichols (Microsoft)

Edward Jezierski and Jackie Nichols demonstrate how Cognitive Services Personalizer works with your content and data, how it autonomously learns to make optimal decisions, how you can add it to your app with two lines of code, and what’s under the hood. Then they share the results Personalizer achieved on the Xbox One home page as well as best practices for applying it in your applications today.

1:15pm-1:55pm (40m) Sponsored

Migrating Hadoop analytics to Spark in the cloud without disruption (sponsored by WANdisco)

Paul Scott-Murphy (WANdisco)

Paul Scott-Murphy dives into the options that exist for cloud migration and their advantages and disadvantages, what cloud vendors do and don't offer to support large-scale migration, the business risks associated with large-scale cloud migration, and how to migrate analytics data at scale for immediate use in Spark without disrupting on-premises operations.

2:05pm-2:45pm (40m) Data Engineering and Architecture, Streaming and IoT Deep dive into specific tools, platforms, or frameworks, Streaming and IoT

Stream processing beyond streaming data

Stephan Ewen (Ververica)

Stephan Ewen details how stream processing is becoming a "grand unifying paradigm" for data processing and the newest developments in Apache Flink to support this trend: new cross-batch-streaming machine learning algorithms, state-of-the-art batch performance, and new building blocks for data-driven applications and application consistency.

11:20am-12:00pm (40m) Sponsored

Organizing the chaos of healthcare with smart data discovery (sponsored by Io-Tahoe)

Charles Boicey (Clearsense)

Healthcare’s reliance on comprehendible data is critical to the mission of providing optimal and affordable care. Charles Boicey takes a deep dive into how the application of technology, such as machine learning, is paramount to the modernization of healthcare that provides its professionals with fully integrated and complete medical records.

1:15pm-1:55pm (40m) Sponsored

Next-generation serverless data architecture for insights at the speed of thought (sponsored by Actian)

Paul Wolmering (Actian)

Paul Wolmering explores the key characteristics for building an Agile data warehouse and defines a reference architecture for hybrid data.

2:05pm-2:45pm (40m) Sponsored

Getting clinical trial data ready for analysis: How IQVIA wrangled its way to success (sponsored by Trifacta)

Matt Derda (Trifacta), Yogesh Prasad (IQVIA)

Clinical trial data analysis can be a complex process. The data is typically hand-coded and formatted differently and is required to be delivered in an FDA-approved format. Matt Derda and Yogesh Prasad explain how IQVIA built its Clean Patient Tracker and how it enabled agility and flexibility for end users of the platform, from data acquisition to reporting and analytics.

11:20am-12:00pm (40m) Sponsored

Transforming Financial Reporting Services with Massively Scalable OLAP (sponsored by Kyvos Insights)

Ajay Anand (Kyvos Insights)

Learn how you can overcome the challenges of traditional OLAP solutions and scale BI to deliver quick insights to business users across your enterprise

1:15pm-1:55pm (40m) Sponsored

The end of applications: How data collaboration is changing everything (sponsored by Cinchy)

Dan DeMers (Cinchy)

After 40 years of apps, enterprise companies now realize that building or buying an application for every use case has become a major threat to their ability to leverage and protect their core data assets. Dan DeMers provides a live demo of Cinchy, the world’s first data collaboration platform.

8:45am-8:55am (10m)

Thursday keynotes

Ben Lorica (O'Reilly), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

8:55am-9:15am (20m)

Staying safe in the AI era

Cassie Kozyrkov (Google)

Machine learning and artificial intelligence are no longer science fiction, so now you have to address what it takes to harness their potential effectively, responsibly, and reliably. Based on lessons learned at Google, Cassie Kozyrkov offers actionable advice to help you find opportunities to take advantage of machine learning, navigate the AI era, and stay safe as you innovate.

9:15am-9:25am (10m) Sponsored

Unlocking the value of your data (sponsored by IBM)

DANIEL HERNANDEZ (IBM)

Daniel Hernandez takes a deep dive into how, with a unified, prescriptive information architecture, organizations can successfully unlock the value of their data for an AI and multicloud world.

9:25am-9:35am (10m)

Delivering the enterprise data cloud

Arun Murthy (Cloudera )

In this keynote, we’ll introduce you to the new 100% open source Cloudera Data Platform (CDP), the world’s first enterprise data cloud. CDP is hybrid and multi-cloud, delivering the speed, agility, and scale you need to secure and govern your data anywhere from the edge to AI.

9:35am-9:40am (5m) Sponsored

Postrevolutionary big data: Promoting the general welfare (sponsored by Io-Tahoe)

Barbara Eckman (Comcast)

Barbara Eckman shares lessons learned from early big data mistakes and the progress her team at Comcast is making toward a postrevolutionary big data vision.

9:40am-9:45am (5m) Sponsored

RL in real life: Bringing reinforcement learning to the enterprise (sponsored by Microsoft Azure)

Edward Jezierski (Microsoft)

Microsoft has an ecosystem spanning research, gaming, and the cloud that's advancing reinforcement learning (RL) and putting it into everyday use. Join Edward Jezierski to see where RL is used practically across Microsoft and imagine the opportunities that exist for your business today.

9:45am-9:55am (10m)

Strata Data Awards: Winners announced

The Strata Data Awards recognize the most innovative startups, leaders, and data science projects from Strata sponsors and exhibitors around the world. Join us during keynotes for the announcement of the winners.

9:55am-10:15am (20m)

Say what? The ethical challenges of designing for humanlike interaction

Jonathan Foster (Microsoft)

10:15am-10:20am (5m) Sponsored

Data Science Pioneers: Conquering the next frontier, a documentary investigating the future of data science (sponsored by Dataiku)

Jed Dougherty (Dataiku)

Jed Dougherty presents the trailer of the upcoming _Data Science Pioneers_ documentary about the passionate data scientists driving us toward technological revolution. Cut through the hype with _Data Science Pioneers_ and see what it really means to be a data scientist.

10:20am-10:40am (20m)

Data sonification: Making music from the yield curve

Alan Smith (Financial Times)

Based on a critical evaluation of the iconic yield curve chart, Alan Smith argues that combining visualization (data to pixels) with sonification (data to pitch) offers potential to improve not only aesthetic multimedia experiences but also an opportunity to take the presentation of data into the rapidly expanding universe of screenless devices and products.

10:40am-10:45am (5m)

Closing remarks

Ben Lorica (O'Reilly), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)

Program chairs, Ben Lorica, Doug Cutting, and Alistair Croll, offer closing remarks.

10:50am-11:20am (30m)

Break: Morning break sponsored by Cisco

12:00pm-1:15pm (1h 15m)

Break

12:00pm-1:15pm (1h 15m)

Thursday Topic Tables at Lunch (sponsored by IBM)

Topic Table discussions are a great way to informally network with people in similar industries or interested in the same topics.

12:30pm-1:10pm (40m) Expo Hall

Why AI fails: Overcoming AI challenges (sponsored by IBM)

Brittany Bogle (IBM)

AI will be the most disruptive class of technologies over the next decade, fueled by near-endless amounts of data and unprecedented advances in deep learning. Brittany Bogle walks you through how to address some of the major AI challenges, like trust, talent, and data.

2:45pm-3:45pm (1h)

Break: Afternoon break sponsored by Io-Tahoe

8:00am-8:30am (30m)

Speed Networking

Gather before keynotes on Thursday morning to enjoy casual conversation while meeting fellow attendees. If one of your goals at Strata is to meet new people, this session will jumpstart your networking with other attendees.

8:30am-8:45am (15m)

Break: Early morning coffee (8:00am - 8:45am)

12:00pm-1:15pm (1h 15m)

Thursday Topic Tables at Lunch (sponsored by IBM)

Topic Table discussions are a great way to informally network with people in similar industries or interested in the same topics.

12:00pm-1:15pm (1h 15m)

Thursday Business Summit Lunch

Join Strata Business Summit speakers and attendees for a networking lunch on Thursday.