Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY
 
1A 06/07
1:10pm Spark NLP in action: How SelectData uses AI to better understand home health patients David Talby (Pacific AI), Alberto Andreotti (John Snow Labs), Stacy Ashworth (SelectData), Tawny Nichols (Select Data)
2:00pm Big data at speed Ted Malaska (Capital One), Mark Grover (Lyft)
3:30pm Modeling time series in R Jared Lander (Lander Analytics)
4:20pm Analytics maturity: Industry trends and financial impacts Bill Franks (International Institute For Analytics)
1A 08
1:10pm Scalable machine learning for data cleaning Ihab Ilyas (University of Waterloo)
2:00pm Let the machines learn to improve data quality Archana Anandakrishnan (American Express)
1A 12/14
11:20am The Vega project: Building an ecosystem of tools for interactive visualization Jeffrey Heer (Trifacta | University of Washington)
1:10pm Augmented reality: Going beyond plots in 3D Bob Levy (Virtual Cove, Inc.)
3:30pm Data visualization in mixed reality with Python Anna Nicanorova (Annalect)
4:20pm UX strategies for underperforming analytics services and data products Brian O'Neill (Designing for Analytics)
1A 15/16
11:20am Democratizing deep learning with transfer learning Lars Hulstaert (Microsoft)
2:00pm Job recommendations leveraging deep learning using Analytics Zoo on Apache Spark and BigDL Guoqiong Song (Intel), Wenjing Zhan (Talroo), Jacob Eisinger (Talroo )
3:30pm Classifying job execution using deep learning Ash Munshi (Pepperdata)
4:20pm Deep learning on audio in Azure to detect sounds in real time Swetha Machanavajhala (Microsoft), Xiaoyong Zhu (Microsoft)
1A 10
11:20am TonY: Native support of TensorFlow on Hadoop Jonathan Hung (LinkedIn), Keqiu Hu (LinkedIn), Zhe Zhang (LinkedIn)
3:30pm Managing data chaos in the world of microservices Oleksii Kachaiev (Attendify)
1A 21/22
11:20am Understanding Spark tuning with auto-tuning; or, Magical spells to stop your pager going off at 2:00am Holden Karau (Independent), Rachel Warren (Salesforce Einstein), Anya Bida (Salesforce)
3:30pm Self-service modern analytics on the GovCloud Ramesh Krishnan (lmco), Steven Morgan (Lockheed Martin)
4:20pm Building turnkey recommendations for 5% of internet video Nir Yungster (JW Player), Kamil Sindi (JW Player)
1A 23/24
11:20am Progress for big data in Kubernetes Ted Dunning (MapR, now part of HPE)
3:30pm Cassandra versus cloud databases Jonathan Ellis (DataStax)
1E 07/08
11:20am Near-real-time anomaly detection at Lyft Thomas Weise (Lyft), Mark Grover (Lyft)
1:10pm A deep dive into Kafka controller Jun Rao (Confluent)
2:00pm High-performance messaging with Apache Pulsar Karthik Ramasamy (Streamlio), Matteo Merli (Streamlio)
3:30pm Machine learning for nonstationary streaming data using Structured Streaming and StreamDM Heitor Murilo Gomes (Télécom ParisTech), Albert Bifet (Télécom ParisTech)
1E 09
2:00pm Using Alluxio as a fault-tolerant pluggable optimization component of JD.com's compute frameworks tao huang (JD.com), mang zhang (JD.com), Bing Bai (JD.com)
3:30pm Kafka at PayPal: Enabling 400 billion messages a day Kevin Lu (PayPal), Maulin Vasavada (PayPal), Na Yang (PayPal)
4:20pm TuneIn: How to get your jobs tuned while you are sleeping Manoj Kumar (LinkedIn), Pralabh Kumar (LinkedIn), Arpan Agrawal (LinkedIn)
1E 10/11
4:20pm Real-time machine intelligence in IndyCar and Tour de France Yasuyuki Kataoka (NTT Innovation Institute, Inc.)
1E 12/13
11:20am Data and privacy at scale at Wikipedia Nuria Ruiz (Wikimedia)
2:00pm Digging for gold: Developing AI in healthcare against unstructured text data Chiny Driscoll (MetiStream), Jawad Khan (Rush University Medical Center )
4:20pm A day in the life of a data scientist: How do we train our teams to get started with AI? Francesca Lazzeri (Microsoft), Jaya Susan Mathew (Microsoft)
1E 14
Expo Hall
11:20am Data at Netflix: See what’s next Michelle Ufford (Netflix)
1:10pm The state of Postgres Umur Cubukcu (Citus Data)
1A 01/02
1:10pm Quick, reliable, and cost-effective ways to operationalize big data apps (sponsored by Unravel) Shivnath Babu (Unravel Data Systems | Duke University), Madhusudan Tumma (TIAA)
1A 03/04/05
11:20am Building the bridge from big data to ML, featuring Geotab (sponsored by Google Cloud) Bob Bradley (Geotab), Chad W. Jennings (Google)
3:30pm Stochastic field theory for time series Revant Nayar (FMI Technologies LLC )
1E 06
3E
8:50am Thursday keynotes Ben Lorica (O'Reilly), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
9:00am Sound design and the future of experience Amber Case (MIT Media Lab)
9:20am Practical ML today and tomorrow Hilary Mason (Cloudera Fast Forward Labs)
9:35am Quantifying forgiveness Julia Angwin (ProPublica)
10:05am Brain-based human-machine interfaces: New developments, legal and ethical issues, and potential uses Amanda Pustilnik (University of Maryland School of Law | Center for Law, Brain & Behavior, Mass. General Hospital)
10:20am The data imperative (sponsored by Zaloni) Ben Sharma (Zaloni)
10:25am Black box: How AI will amplify the best and worst of humanity Jacob Ward (CNN | Al Jazeera | PBS)
10:50am Morning break sponsored by IBM | Room: 3B | Expo Hall
2:30pm Afternoon break sponsored by Google Cloud | Room: 3B | Expo Hall
8:00am Morning Coffee | Room: 3E Foyer
8:00am Speed Networking | Room: Crystal Palace
12:00pm Lunch sponsored by MemSQL Thursday Business Summit Lunch | Room: 3D 09
12:00pm Thursday Topic Tables at Lunch | Room: Expo Hall (Hall 3B)
11:20am-12:00pm (40m) Data science and machine learning Media, Marketing, Advertising, Text and Language processing and analysis
Applying petabyte-scale analytics and machine learning to billions of news reading sessions
Andrew Montalenti (Parse.ly )
What can we learn from a one-billion-person live poll of the internet? Andrew Montalenti explains how Parse.ly has gathered a unique dataset of news reading sessions of billions of devices, peaking at over two million sessions per minute on thousands of high-traffic news and information websites, and how the company uses this data to unearth the secrets behind online content.
1:10pm-1:50pm (40m) Data science and machine learning Health and Medicine, Text and Language processing and analysis
Spark NLP in action: How SelectData uses AI to better understand home health patients
David Talby (Pacific AI), Alberto Andreotti (John Snow Labs), Stacy Ashworth (SelectData), Tawny Nichols (Select Data)
David Talby, Alberto Andreotti, Stacy Ashworth, and Tawny Nichols outline a question-answering system for accurately extracting facts from free-text patient records and share best practices for training domain-specific deep learning NLP models. The solution is based on Spark NLP, an extension of Spark ML that provides state-of-the-art performance and accuracy for natural language understanding.
2:00pm-2:40pm (40m) Data engineering and architecture Transportation and Logistics
Big data at speed
Ted Malaska (Capital One), Mark Grover (Lyft)
Many details go into building a big data system for speed, from determining a respectable latency until data access and where to store the data to solving multiregion problems—or even knowing just what data you have and where stream processing fits in. Mark Grover and Ted Malaska share challenges, best practices, and lessons learned doing big data processing and analytics at scale and at speed.
3:30pm-4:10pm (40m) Data science and machine learning Temporal data and time-series analytics
Modeling time series in R
Jared Lander (Lander Analytics)
Temporal data is being produced in ever-greater quantity, but fortunately our time series capabilities are keeping pace. Jared Lander explores techniques for modeling time series, from traditional methods such as ARMA to more modern tools such as Prophet and machine learning models like XGBoost and neural nets. Along the way, Jared shares theory and code for training these models.
4:20pm-5:00pm (40m) Data science and machine learning, Data-driven business management, Strata Business Summit Machine Learning in the enterprise
Analytics maturity: Industry trends and financial impacts
Bill Franks (International Institute For Analytics)
Drawing on a recent study of the analytics maturity level of large enterprises by the International Institute for Analytics, Bill Franks discusses how maturity varies by industry, shares key steps organizations can take to move up the maturity scale, and explains how the research correlates analytics maturity with a wide range of success metrics, including financial and reputational measures.
11:20am-12:00pm (40m) Data science and machine learning Temporal data and time-series analytics
Predicting residential occupancy and hot water usage from high-frequency, multivector utilities data
Cris Lowery (Baringa Partners), Marc Warner (ASI)
In EU households, heating and hot water alone account for 80% of energy usage. Cristobal Lowery and Marc Warner explain how future home energy management systems could improve their energy efficiency by predicting resident needs through utilities data, with a particular focus on the key data features, the need for data compression, and the data quality challenges.
1:10pm-1:50pm (40m) Data science and machine learning Data preparation, governance and privacy
Scalable machine learning for data cleaning
Ihab Ilyas (University of Waterloo)
Machine learning tools promise to help solve data curation problems. While the principles are well understood, the engineering details in configuring and deploying ML techniques are the biggest hurdle. Ihab Ilyas explains why leveraging data semantics and domain-specific knowledge is key in delivering the optimizations necessary for truly scalable ML curation solutions.
2:00pm-2:40pm (40m) Data science and machine learning Data preparation, governance and privacy, Financial Services
Let the machines learn to improve data quality
Archana Anandakrishnan (American Express)
Building accurate machine learning models hinges on the quality of the data. Errors and anomalies get in the way of data scientists doing their best work. Archana Anandakrishnan explains how American Express created an automated, scalable system for measurement and management of data quality. The methods are modular and adaptable to any domain where accurate decisions from ML models are critical.
3:30pm-4:10pm (40m) Data-driven business management, Strata Business Summit Financial Services
InnerSource for reproducible and extensible business analysis
Emily Riederer (Capital One)
Emily Riederer explains how best practices from data science, open source, and open science can solve common business pain points. Using a case example from Capital One, Emily illustrates how designing empathetic analytical tools and fostering a vibrant InnerSource community are keys to developing reproducible and extensible business analysis.
4:20pm-5:00pm (40m) Data engineering and architecture, Data science and machine learning Financial Services, Model lifecycle management
Infrastructure for deploying machine learning to production in large financial institutions: Lessons learned and best practices
Harish Doddi (Datatron), Jerry Xu (Datatron Technologies)
Large financial institutions have many data science teams (e.g., those for fraud, credit risk, and marketing), each often using diverse set of tools to build predictive models. There are many challenges involved in productionizing these predictive AI models. Harish Doddi and Jerry Xu share challenges and lessons learned deploying AI models to production in large financial institutions.
11:20am-12:00pm (40m) Data science and machine learning, Visualization and user experience
The Vega project: Building an ecosystem of tools for interactive visualization
Jeffrey Heer (Trifacta | University of Washington)
Jeffrey Heer offers an overview of Vega and Vega-Lite—high-level declarative languages for interactive visualization that support exploratory data analysis, communication, and the development of new visualization tools.
1:10pm-1:50pm (40m) Data science and machine learning, Visualization and user experience Ethics and Privacy, Financial Services, Media, Marketing, Advertising
Augmented reality: Going beyond plots in 3D
Bob Levy (Virtual Cove, Inc.)
Augmented reality opens a completely new lens on your data through which you see and accomplish amazing things. Bob Levy explains how to use simple Python scripts to leverage completely new plot types. You'll explore use cases revealing new insight into financial markets data as well as new ways of interacting with data that build trust in otherwise “black box” machine learning solutions.
2:00pm-2:40pm (40m) Data science and machine learning, Visualization and user experience
Stories beat statistics: How to master the art and science of data storytelling
Brent Dykes (Domo)
Companies collect all kinds of data and use advanced tools and techniques to find insights, but they often fail in the last mile: communicating insights effectively to drive change. Brent Dykes discusses the power that stories wield over statistics and explores the art and science of data storytelling—an essential skill in today’s data economy.
3:30pm-4:10pm (40m) Data science and machine learning, Visualization and user experience
Data visualization in mixed reality with Python
Anna Nicanorova (Annalect)
Data visualization is supposed to be our map to information. However, contemporary charting techniques have a few shortcomings, including context reduction, hard numeric grasp, and perceptual dehumanization. Anna Nicanorova explains how augmented reality can solve these issues by presenting an intuitive and interactive environment for data exploration.
4:20pm-5:00pm (40m) Data science and machine learning, Visualization and user experience Machine Learning in the enterprise
UX strategies for underperforming analytics services and data products
Brian O'Neill (Designing for Analytics)
Gartner says 85%+ of big data projects will fail, despite the fact your company may have invested millions on engineering implementation. Why are customers and employees not engaging with these products and services? Brian O'Neill explains why a "people first, technology second" mission—a design strategy, in other words—enables the best UX and business outcomes possible.
11:20am-12:00pm (40m) Data science and machine learning Deep Learning
Democratizing deep learning with transfer learning
Lars Hulstaert (Microsoft)
Transfer learning allows data scientists to leverage insights from large labeled datasets. The general idea of transfer learning is to use knowledge learned from tasks for which a lot of labeled data is available in settings where little labeled data is available. Lars Hulstaert explains what transfer learning is and how it can boost your NLP or CV pipelines.
1:10pm-1:50pm (40m) Data science and machine learning Data Platforms, Deep Learning
A high-performance system for deep learning inference and visual inspection
Moty Fania (Intel), Sergei Kom (Intel)
Moty Fania and Sergei Kom share their experience and lessons learned implementing an AI inference platform to enable internal visual inspection use cases. The platform is based on open source technologies and was designed for real-time, streaming, and online actuation.
2:00pm-2:40pm (40m) Big data and data science in the cloud, Data science and machine learning Deep Learning, Media, Marketing, Advertising
Job recommendations leveraging deep learning using Analytics Zoo on Apache Spark and BigDL
Guoqiong Song (Intel), Wenjing Zhan (Talroo), Jacob Eisinger (Talroo )
Can the talent industry make the job search/match more relevant and personalized for a candidate by leveraging deep learning techniques? Guoqiong Song, Wenjing Zhan, and Jacob Eisinger demonstrate how to leverage distributed deep learning framework BigDL on Apache Spark to predict a candidate’s probability of applying to specific jobs based on their résumé.
3:30pm-4:10pm (40m) Data science and machine learning Deep Learning
Classifying job execution using deep learning
Ash Munshi (Pepperdata)
Ash Munshi outlines a technique for labeling applications using runtime measurements of CPU, memory, and network I/O along with a deep neural network. This labeling groups the applications into buckets that have understandable characteristics, which can then be used to reason about the cluster and its performance.
4:20pm-5:00pm (40m) Big data and data science in the cloud, Data science and machine learning Deep Learning
Deep learning on audio in Azure to detect sounds in real time
Swetha Machanavajhala (Microsoft), Xiaoyong Zhu (Microsoft)
In this auditory world, the human brain processes and reacts effortlessly to a variety of sounds. While many of us take this for granted, there are over 360 million in this world who are deaf or hard of hearing. Swetha Machanavajhala and Xiaoyong Zhu explain how to make the auditory world inclusive and meet the great demand in other sectors by applying deep learning on audio in Azure.
11:20am-12:00pm (40m) Data engineering and architecture Data Platforms, Deep Learning
TonY: Native support of TensorFlow on Hadoop
Jonathan Hung (LinkedIn), Keqiu Hu (LinkedIn), Zhe Zhang (LinkedIn)
Jonathan Hung, Keqiu Hu, and Zhe Zhang offer an overview of TensorFlow on YARN (TonY), a framework to natively run TensorFlow on Hadoop. TonY enables running TensorFlow distributed training as a new type of Hadoop application. Its native Hadoop connector, together with other features, aims to run TensorFlow jobs as reliably and flexibly as other first-class citizens on Hadoop.
1:10pm-1:50pm (40m) Data engineering and architecture Data Platforms, Deep Learning, Model lifecycle management
Deep learning on YARN: Running distributed TensorFlow, MXNet, Caffe, and XGBoost on Hadoop clusters
Wangda Tan (Cloudera)
In order to train deep learning and machine learning models, you must leverage applications such as TensorFlow, MXNet, Caffe, and XGBoost. Wangda Tan discusses new features in Apache Hadoop 3.x to better support deep learning workloads and demonstrates how to run these applications on YARN.
2:00pm-2:40pm (40m) Data engineering and architecture Model lifecycle management
Kubeflow explained: Portable machine learning on Kubernetes
Michelle Casbon (Google)
Michelle Casbon demonstrates how to build a machine learning application with Kubeflow. Kubeflow makes it easy for everyone to develop, deploy, and manage portable, scalable ML everywhere and supports the full lifecycle of an ML product, including iteration via Jupyter notebooks. Join Michelle to find out what Kubeflow currently supports and the long-term vision for the project.
3:30pm-4:10pm (40m) Data engineering and architecture
Managing data chaos in the world of microservices
Oleksii Kachaiev (Attendify)
When we talk about microservices, we usually focus on the communication layer. In practice, data is the much harder and often overlooked problem. Splitting applications into independent units leads to increased complexity, such as structural and semantic changes, knowledge sharing, and data discovery. Join Alexey Kachayev to explore emerging technologies created to tackle these challenges.
4:20pm-5:00pm (40m) Data engineering and architecture
The move to a modern data platform in the cloud: Pitfalls to avoid and best practices to follow
Amandeep Khurana (Okera)
Amandeep Khurana shares critical data management practices for easy and unified data access that meets security and regulatory compliance, helping you avoid the pitfalls that could lead to complex expensive architectures.
11:20am-12:00pm (40m) Data engineering and architecture
Understanding Spark tuning with auto-tuning; or, Magical spells to stop your pager going off at 2:00am
Holden Karau (Independent), Rachel Warren (Salesforce Einstein), Anya Bida (Salesforce)
Apache Spark is an amazing distributed system, but part of the bargain we've made with the infrastructure deamons involves providing the correct set of magic numbers (aka tuning) or our jobs may be eaten by Cthulhu. Holden Karau, Rachel Warren, and Anya Bida explore auto-tuning jobs using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads.
1:10pm-1:50pm (40m) Data engineering and architecture Data Platforms, Transportation and Logistics
A/B testing at Uber: How we built a BYOM (bring your own metrics) platform
Milene Darnis (Uber)
Every new launch at Uber is vetted via robust A/B testing. Given the pace at which Uber operates, the metrics needed to assess the impact of experiments constantly evolve. Milene Darnis explains how the team built a scalable and self-serve platform that lets users plug in any metric to analyze.
2:00pm-2:40pm (40m) Data engineering and architecture Data Platforms, Health and Medicine
Aetna's advanced analytics platform, Data Fabric
Occhio Orsini (Aetna)
Occhio Orsini offers an overview of Aetna's Data Fabric platform. Join in to learn the needs and desires that led to the creation of the advanced analytics platform, explore the platform's architecture, technology, and capabilities, and understand the key technologies and capabilities that made it possible to build a hybrid solution across on-premises and cloud-hosted data centers.
3:30pm-4:10pm (40m) Big data and data science in the cloud, Data engineering and architecture
Self-service modern analytics on the GovCloud
Ramesh Krishnan (lmco), Steven Morgan (Lockheed Martin)
Lockheed Martin is a data-driven company with a massive variety and volume of data. To extract the most value from its information assets, the company is constantly exploring ways to enable effective self-service scenarios. Ramesh Krishnan and Steve Morgan discuss Lockheed Martin's journey into modern analytics and explore its analytics platform focused on leveraging AWS GovCloud.
4:20pm-5:00pm (40m) Big data and data science in the cloud, Data engineering and architecture Deep Learning, Media, Marketing, Advertising, Recommendation Systems
Building turnkey recommendations for 5% of internet video
Nir Yungster (JW Player), Kamil Sindi (JW Player)
JW Player—the world’s largest network-independent video platform, representing 5% of global internet video—provides on-demand recommendations as a service to thousands of media publishers. Nir Yungster and Kamil Sindi explain how the company is systematically improving model performance while navigating the many engineering challenges and unique needs of the diverse publishers it serves.
11:20am-12:00pm (40m) Emerging technologies & case studies
Progress for big data in Kubernetes
Ted Dunning (MapR, now part of HPE)
Stateful containers are a well-known anti-pattern, but the standard solution—managing state in a separate storage tier—is costly and complex. Recent developments have changed things dramatically for the better. In particular, you can now manage a high-performance software-defined-storage tier entirely in Kubernetes. Ted Dunning describes what's new and how it makes big data easier on Kubernetes.
1:10pm-1:50pm (40m) Data engineering and architecture
Case study: A Spark-based distributed simulation optimization architecture for portfolio optimization in retail banking
Kaushik Deka (Novantas), Ted Gibson (Novantas)
Kaushik Deka and Ted Gibson share a large-scale optimization architecture in Spark for a consumer product portfolio optimization use case in retail banking. The architecture combines a simulator that distributes computation of complex real-world scenarios and a constraint optimizer that uses business rules as constraints to meet growth targets.
2:00pm-2:40pm (40m) Data engineering and architecture Data Platforms, Financial Services
Using big data to unlock the delivery of personalized, multilingual real-time chat services for global financial service organizations
Timothy Walpole (BJSS)
Financial service clients demand increased data-driven personalization, faster insight-based decisions, and multichannel real-time access. Tim Walpole details how organizations can deliver real-time, vendor-agnostic, personalized chat services and explores issues around security, privacy, legal sign-off, data compliance, and how the internet of things can be used as a delivery platform.
3:30pm-4:10pm (40m) Big data and data science in the cloud
Cassandra versus cloud databases
Jonathan Ellis (DataStax)
Is open source Apache Cassandra still relevant in an era of hosted cloud databases? Jonathan Ellis discusses Cassandra’s strengths and weaknesses relative to Amazon DynamoDB, Microsoft CosmosDB, and Google Cloud Spanner.
4:20pm-5:00pm (40m) Big data and data science in the cloud, Data engineering and architecture Data Integration and Data Pipelines
Best practices for developing an enterprise data hub to collect and analyze 1 TB of data a day from a multiple services with Apache Kafka and Google Cloud Platform
Kenji Hayashida (Recruit Lifestyle co., ltd.), Toru Sasaki (NTT DATA Corporation)
Recruit Group and NTT DATA Corporation have developed a platform based on a data hub, utilizing Apache Kafka. This platform can handle around 1 TB/day of application logs generated by a number of services in Recruit Group. Kenji Hayashida and Toru Sasaki share best practices for and lessons learned about topics such as schema evolution and network architecture.
11:20am-12:00pm (40m) Data engineering and architecture, Streaming systems & real-time applications Temporal data and time-series analytics, Transportation and Logistics
Near-real-time anomaly detection at Lyft
Thomas Weise (Lyft), Mark Grover (Lyft)
Thomas Weise and Mark Grover explain how Lyft uses its streaming platform to detect and respond to anomalous events, using data science tools for machine learning and a process that allows for fast and predictable deployment.
1:10pm-1:50pm (40m) Streaming systems & real-time applications
A deep dive into Kafka controller
Jun Rao (Confluent)
The controller is the brain of Apache Kafka and is responsible for maintaining the consistency of the replicas. Jun Rao outlines the main data flow in the controller, then describes recent improvements to the controller that allow it to handle certain edge cases correctly and increase its performance, which allows for more partitions in a Kafka cluster.
2:00pm-2:40pm (40m) Emerging technologies & case studies
High-performance messaging with Apache Pulsar
Karthik Ramasamy (Streamlio), Matteo Merli (Streamlio)
Apache Pulsar is being used for an increasingly broad array of data ingestion tasks. When operating at scale, it's very important to ensure that the system can make use of all the available resources. Karthik Ramasamy and Matteo Merli share insights into the design decisions and the implementation techniques that allow Pulsar to achieve high performance with strong durability guarantees.
3:30pm-4:10pm (40m) Data engineering and architecture Temporal data and time-series analytics
Machine learning for nonstationary streaming data using Structured Streaming and StreamDM
Heitor Murilo Gomes (Télécom ParisTech), Albert Bifet (Télécom ParisTech)
The StreamDM library provides the largest collection of data stream mining algorithms for Spark. Heitor Murilo Gomes and Albert Bifet explain how to use StreamDM and Structured Streaming to develop, apply, and evaluate learning models specially for nonstationary streams (i.e., those with concept drifts).
4:20pm-5:00pm (40m) Data engineering and architecture
IoT edge processing with Apache NiFi, Apache MiniFi, and multiple deep learning libraries
TIMOTHY SPANN (Cloudera)
Timothy Spann leads a hands-on deep dive into using Apache MiniFi with Apache MXNet and other deep learning libraries on edge devices.
11:20am-12:00pm (40m) Data engineering and architecture, Law, ethics, governance Data Integration and Data Pipelines, Data preparation, governance and privacy, Media, Marketing, Advertising
Data discovery and lineage: Integrating streaming data in the public cloud with on-prem, classic data stores, and heterogeneous schema types
Barbara Eckman (Comcast)
Comcast’s streaming data platform comprises ingest, transformation, and storage services in the public cloud, with Apache Atlas for data discovery and lineage. Barbara Eckman explains how Comcast recently integrated on-prem data sources, including traditional data warehouses and RDBMSs, which required its data governance strategy to include relational and JSON schemas in addition to Apache Avro.
1:10pm-1:50pm (40m) Data engineering and architecture Transportation and Logistics
How Komatsu is improving mining efficiencies using the IoT and machine learning
Shawn Terry (Komatsu Mining Corp)
Global heavy equipment manufacturer Komatsu is using IoT data to continuously monitor some of the largest mining equipment to ultimately improve mine performance and efficiencies. Shawn Terry details the company's data journey and explains how it is using advanced analytics and predictive modeling to drive insights on terabytes of IoT data from connected mining equipment.
2:00pm-2:40pm (40m) Data engineering and architecture Data Platforms, Retail and e-commerce, Transportation and Logistics
Using Alluxio as a fault-tolerant pluggable optimization component of JD.com's compute frameworks
tao huang (JD.com), mang zhang (JD.com), Bing Bai (JD.com)
Tao Huang, Mang Zhang, and 白冰 explain how JD.com uses Alluxio to provide support for ad hoc and real-time stream computing, using Alluxio-compatible HDFS URLs and Alluxio as a pluggable optimization component. To give just one example, one framework, JDPresto, has seen a 10x performance improvement on average.
3:30pm-4:10pm (40m) Streaming systems & real-time applications Data Integration and Data Pipelines, Data Platforms, Financial Services
Kafka at PayPal: Enabling 400 billion messages a day
Kevin Lu (PayPal), Maulin Vasavada (PayPal), Na Yang (PayPal)
PayPal is one of the biggest Kafka users in the industry; it manages and maintains over 40 production Kafka clusters in three geodistributed data centers and supports 400 billion Kafka messages a day. Kevin Lu, Maulin Vasavada, and Na Yang explore the management and monitoring PayPal applies to Kafka, from client-perceived statistics to configuration management, failover, and data loss auditing.
4:20pm-5:00pm (40m) Data engineering and architecture
TuneIn: How to get your jobs tuned while you are sleeping
Manoj Kumar (LinkedIn), Pralabh Kumar (LinkedIn), Arpan Agrawal (LinkedIn)
Have you ever tuned a Spark or MR job? If the answer is yes, you already know how difficult it is to tune more than hundred parameters to optimize the resources used. Manoj Kumar, Pralabh Kumar, and Arpan Agrawal offer an overview of TuneIn, an auto-tuning tool developed to minimize the resource usage of jobs. Experiments have shown up to a 50% reduction in resource usage.
11:20am-12:00pm (40m) Data-driven business management, Strata Business Summit Machine Learning in the enterprise, Retail and e-commerce, Transportation and Logistics
The care and feeding of data scientists: Concrete tips for retaining your data science team
Michelangelo D'Agostino (ShopRunner)
Data scientists are hard to hire. But too often, companies struggle to find the right talent only to make avoidable mistakes that cause their best data scientists to leave. From org structure and leadership to tooling, infrastructure, and more, Michelangelo D'Agostino shares concrete (and inexpensive) tips for keeping your data scientists engaged, productive, and adding business value.
1:10pm-1:50pm (40m) Sponsored
Best practices for migrating big data workloads to Amazon Web Services (sponsored by Amazon Web Services)
Faria Bruno (Amazon Web Services)
Bruno Faria explains how to identify the components and workflows in your current environment and shares best practices to migrate these workloads to AWS.
2:00pm-2:40pm (40m) Data-driven business management, Strata Business Summit
Building it beautiful: Analyzing the effectiveness of platform products and marketing at scale
Josh Laurito (Squarespace)
Joshua Laurito explores systems Squarespace built for acquiring and enforcing consistency on obtained data and for inferring conclusions from a company’s marketing and product initiatives. Joshua discusses the intricacies of gathering and evaluating marketing and user data, from raising awareness to driving purchases, and shares results of previous analyses.
3:30pm-4:10pm (40m) Data engineering and architecture, Strata Business Summit Data Platforms, Media, Marketing, Advertising, Retail and e-commerce
Scaling data infrastructure in the fashion world; or, “What is this? Business intelligence for ants?”
Francesco Mucio (Francescomuc.io)
Francesco Mucio tells the story of how Zalando went from an old-school BI company to an AI-driven company built on a solid data platform. Along the way, he shares what Zalando learned in the process and the challenges that still lie ahead.
4:20pm-5:00pm (40m) Data-driven business management, Strata Business Summit Transportation and Logistics
Real-time machine intelligence in IndyCar and Tour de France
Yasuyuki Kataoka (NTT Innovation Institute, Inc.)
One of the challenges of sports data analytics is how to deliver machine intelligence beyond a mere real-time monitoring tool. Yasuyuki Kataoka highlights various real-time machine learning models in both IndyCar and Tour de France, sharing real-time data processing architectures, machine learning models, and demonstrations that deliver meaningful insights for players and fans.
11:20am-12:00pm (40m) Law, ethics, governance, Strata Business Summit Ethics and Privacy
Data and privacy at scale at Wikipedia
Nuria Ruiz (Wikimedia)
The Wikipedia community feels strongly that you shouldn’t have to provide personal information to participate in the free knowledge movement. Nuria Ruiz discusses the challenges that this strong privacy stance poses for the Wikimedia Foundation, including how it affects data collection, and details some creative workarounds that allow WMF to calculate metrics in a privacy-conscious way.
1:10pm-1:50pm (40m) Law, ethics, governance, Strata Business Summit Data preparation, governance and privacy, Ethics and Privacy
Enacting Data Subject Access Rights for GDPR with data services and data management
Jean-Michel Franco (Talend)
GDPR is more than another regulation to be handled by your back office. Enacting the GDPR's Data Subject Access Rights (DSAR) requires practical actions. Jean-Michel Franco outlines the practical steps to deploy governed data services.
2:00pm-2:40pm (40m) Strata Business Summit Health and Medicine, Text and Language processing and analysis
Digging for gold: Developing AI in healthcare against unstructured text data
Chiny Driscoll (MetiStream), Jawad Khan (Rush University Medical Center )
Chiny Driscoll and Jawad Khan offer an overview of a solution by Cloudera and MetiStream that lets healthcare providers automate the extraction, processing, and analysis of clinical notes within an electronic health record in batch or real time, improving care, identifying errors, and recognizing efficiencies in billing and diagnoses.
3:30pm-4:10pm (40m) Law, ethics, governance Data preparation, governance and privacy, Ethics and Privacy
Balancing stakeholder interests in personal data governance technology
LaVonne Reimer, JD (Lumenous)
GDPR asks us to rethink personal data systems—viewing UI/UX, consent management, and value-add data services through the eyes of subjects of the data. LaVonne Reimer explains why the opportunity in the $150B credit and risk industry is to deploy data governance technologies that balance the interests of individuals to control their own data with requirements for trusted data.
4:20pm-5:00pm (40m) Data-driven business management, Strata Business Summit Machine Learning in the enterprise
A day in the life of a data scientist: How do we train our teams to get started with AI?
Francesca Lazzeri (Microsoft), Jaya Susan Mathew (Microsoft)
With the growing buzz around data science, many professionals want to learn how to become a data scientist—the role Harvard Business Review called the "sexiest job of the 21st century." Francesca Lazzeri and Jaya Mathew explain what it takes to become a data scientist and how artificial intelligence solutions have started to reinvent businesses.
11:20am-12:00pm (40m) Data-driven business management, Strata Business Summit Machine Learning in the enterprise, Retail and e-commerce
Executive Briefing: From Business to AI—The missing pieces in becoming "AI ready"
Mikio Braun (Zalando)
In order to become "AI ready," an organization not only has to provide the right technical infrastructure for data collection and processing but also must learn new skills. Mikio Braun highlights three pieces companies often miss when trying to become AI ready: making the connection between business problems and AI technology, implementing AI-driven development, and running AI-based projects.
1:10pm-1:50pm (40m) Data-driven business management, Strata Business Summit Machine Learning in the enterprise, Transportation and Logistics
Executive Briefing: Analytics for executives—Building an approachable language to drive data science in your organization
Brandy Freitas (Pitney Bowes)
Data science is an approachable field given the right framing. Often, though, practitioners and executives are describing opportunities using completely different languages. Join Brandy Freitas to develop context and vocabulary around data science topics to help build a culture of data within your organization.
2:00pm-2:40pm (40m) Strata Business Summit, Streaming systems & real-time applications
Executive Briefing: What you need to know about fast data
Dean Wampler (Anyscale)
Streaming data systems, so called "fast data," promise accelerated access to information, leading to new innovations and competitive advantages. But they aren't just faster versions of big data. They force architecture changes to meet new demands for reliability and dynamic scalability, more like microservices. Dean Wampler shares what you need to know to exploit fast data successfully.
3:30pm-4:10pm (40m) Data-driven business management, Strata Business Summit
Executive Briefing: Best practices for human in the loop—The business case for active learning
Paco Nathan (derwen.ai)
Deep learning works well when you have large labeled datasets, but not every team has those assets. Paco Nathan offers an overview of active learning, an ML variant that incorporates human-in-the-loop computing. Active learning focuses input from human experts, leveraging intelligence already in the system, and provides systematic ways to explore and exploit uncertainty in your data.
4:20pm-5:00pm (40m) Sponsored
Conda, Docker, and Kubernetes: The cloud-native future of data science (sponsored by Anaconda)
Mathew Lodge (Anaconda)
The days of deploying Java code to Hadoop and Spark data lakes for data science and ML are numbered. Welcome to the future. Containers and Kubernetes make great language-agnostic distributed computing clusters: it's just as easy to deploy Python as it is Java. Mathew Lodge shows you how.
11:20am-12:00pm (40m) Data engineering and architecture, Expo Hall Data Platforms
Data at Netflix: See what’s next
Michelle Ufford (Netflix)
Michelle Ufford shares some of the cool things Netflix is doing with data and the big bets the company is making on data infrastructure, covering workflow orchestration, machine learning, interactive notebooks, centralized alerting, event-based processing, platform intelligence, and more.
1:10pm-1:50pm (40m) Data engineering and architecture, Expo Hall
The state of Postgres
Umur Cubukcu (Citus Data)
PostgreSQL is often regarded as the world’s most advanced open source database—and it’s on fire. Umur Cubukcu moves beyond the typical list of features in the next release to explore why so many new projects “just use Postgres” as their system of record (or system of engagement) at scale. Along the way, you’ll learn how PostgreSQL’s extension APIs are fueling innovations in relational databases.
2:00pm-2:40pm (40m) Data engineering and architecture, Expo Hall Model lifecycle management
Building a high-performance model serving engine from scratch using Kubernetes, GPUs, Docker, Istio, and TensorFlow
Chris Fregly (Amazon Web Services)
Chris Fregly details a full-featured, open source end-to-end TensorFlow model training and deployment system, using the latest advancements with Kubernetes, TensorFlow, and GPUs.
11:20am-12:00pm (40m) Sponsored
Assumptions, constraints, and risks: How the wrong assumptions can jeopardize any model (sponsored by IBM)
Jennifer Shin (8 Path Solutions | NYU Stern | IBM)
Common wisdom dictates that we should never make assumptions, but assumptions are essential in the creation of statistical models. Jennifer Shin explores how assumptions fit into the creation of a statistical model, the pitfalls of applying a model to data without taking the underlying assumptions into account, and how to identify datasets where the model and its assumptions are applicable.
1:10pm-1:50pm (40m) Sponsored
Quick, reliable, and cost-effective ways to operationalize big data apps (sponsored by Unravel)
Shivnath Babu (Unravel Data Systems | Duke University), Madhusudan Tumma (TIAA)
Operationalizing big data apps in a quick, reliable, and cost-effective manner remains a daunting task. Shivnath Babu and Madhusudan Tumma outline common problems and their causes and share best practices to find and fix these problems quickly and prevent such problems from happening in the first place.
2:00pm-2:40pm (40m) Sponsored
Getting the most out of advanced analytics with people (sponsored by Alteryx)
Patrick Nussbaumer (Alteryx)
There is a lot of buzz around data science and machine learning in the world today. Unfortunately, to truly innovate with data and advanced capabilities, organizations need to expand their focus beyond just a few specialists. Patrick Nussbaumer details how focusing on people can help improve analytic value and drive innovation.
3:30pm-4:10pm (40m) Data-driven business management
Why the internet of things doesn’t exist but will still reshape your business
Ajay Kulkarni (TimescaleDB)
Ajay Kulkarni explores the underlying changes that are characterizing the next wave of computing and shares several ways in which individual businesses and overall industries will be transformed.
4:20pm-5:00pm (40m) Sponsored
Assumptions, constraints, and risks: How the wrong assumptions can jeopardize any model (sponsored by IBM)
Jennifer Shin (8 Path Solutions | NYU Stern | IBM)
Common wisdom dictates that we should never make assumptions, but assumptions are essential in the creation of statistical models. Jennifer Shin explores how assumptions fit into the creation of a statistical model, the pitfalls of applying a model to data without taking the underlying assumptions into account, and how to identify datasets where the model and its assumptions are applicable.
11:20am-12:00pm (40m) Sponsored
Building the bridge from big data to ML, featuring Geotab (sponsored by Google Cloud)
Bob Bradley (Geotab), Chad W. Jennings (Google)
If your company isn’t good at analytics, it’s not ready for AI. Bob Bradley and Chad W. Jennings explain how the right data strategy can set you up for success in machine learning and artificial intelligence—the new ground for gaining competitive edge and creating business value. You'll then see an in-depth demonstration of Google technology from smart cities innovator Geotab.
1:10pm-1:50pm (40m) Sponsored
On the road to digital transformation, AI is a team sport (sponsored by Oracle + DataScience.com)
Ian Swanson (Oracle)
Ian Swanson explores why and how data scientists and line-of-business leaders must treat AI as a team sport and explains what tools are needed to deploy models and applications that truly inform decision making.
2:00pm-2:40pm (40m) Sponsored
Redis for velocity and volume: Fast data ingest and probabilistic data structures (sponsored by Redi Labs)
Kyle Davis (Redis Labs)
Kyle Davis explains how Redis can be used for ingesting high-velocity data from large-scale platforms and IoT data collections as well as for storing and querying data using probabilistic data structures that trade some precision for both higher speed and lower storage requirements. Along the way, Kyle shares examples and a demo of the solution.
3:30pm-4:10pm (40m) Financial Services, Temporal data and time-series analytics
Stochastic field theory for time series
Revant Nayar (FMI Technologies LLC )
Machine learning has so far underperformed in time series prediction (slowness and overfitting), and classical methods are ineffective at capturing nonlinearity. Revant Nayar shares an alternative approach that is faster and more transparent and does not overfit. It can also pick up regime changes in the time series and systematically captures all the nonlinearity of a given dataset.
11:20am-12:00pm (40m) Sponsored
Augmented data engineering: Leveraging machine learning in data profiling and discovery (sponsored by Io-Tahoe)
Arun Murugan (GE Digital), Jeff Miller (GE)
Arun Murugan and Jeff Miller detail how complex relationships are discovered and modeled to simplify analytics while keeping an Agile architecture for data acquisition. You’ll see how GE uses machine learning (powered by Io-Tahoe) in data discovery and profiling for data engineering of the development of a standard data model essential to enterprise use cases.
1:10pm-1:50pm (40m) Sponsored
From two weeks in Python to two hours in Pentaho: Building modern big data pipelines for machine learning (sponsored by Hitachi Vantara)
Dave Huh (Hitachi Vantara), Kevin Haas (Hitachi Vantara)
Data in most organizations today is massive, messy, and often found in silos. With so many sources to analyze, data engineers need to construct robust data pipelines using automation and minimize duplicate processes, as computation is costly for big data. David Huh shares strategies to construct data pipelines for machine learning, including one to reduce time to insight from weeks to hours.
2:00pm-2:40pm (40m) Sponsored
From analytic silos to analytic democratization: How (and why) companies make the shift (sponsored by Dataiku)
Deborah Reynolds (Pfizer), Kurt Muehmel (Dataiku)
By creating a collaborative and interactive analytic environment, a forward-thinking company may harness the best capabilities of its business analysts and data scientists to answer the company’s most pressing business questions. Deborah Reynolds and Kurt Muehmel explain how large enterprises can successfully put data at the core of everyday business decisions.
3:30pm-4:10pm (40m) Sponsored
The importance of experimental iteration: A data-centric approach to an AI project (sponsored by Globant)
Antonio Fragoso (Globant)
Antonio Fragoso explores the key aspects of implementing a natural language processing project within your organization and reveals the necessary steps for making it a success. Antonio focuses on how to leverage an iterative process that can pave the way toward building a successful product.
8:50am-9:00am (10m)
Thursday keynotes
Ben Lorica (O'Reilly), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.
9:00am-9:15am (15m)
Sound design and the future of experience
Amber Case (MIT Media Lab)
Amber Case outlines several methods that product designers and managers can use to improve everyday interactions through an understanding and application of sound design.
9:15am-9:20am (5m) Sponsored
Wait. . .pizza is a vegetable? Decoding regulations using machine learning (sponsored by IBM)
Dinesh Nirmal (IBM)
IBM Analytics’s Dinesh Nirmal solves school lunch and the struggle to keep ahead of regulations. With AI tech like deep learning and NLG, supplying meals to California’s kids leaps from enriching metadata for compliance to actionable insights for the business.
9:20am-9:30am (10m)
Practical ML today and tomorrow
Hilary Mason (Cloudera Fast Forward Labs)
Machine learning and artificial intelligence are exciting technologies, but real value comes from marrying those capabilities with the right business problems. Hilary Mason explores the current state of these technologies, investigates what's coming next in applied machine learning, and explains how to identify and execute on the right business opportunities at the right time.
9:30am-9:35am (5m) Sponsored
Derive value from analytics and AI at scale (sponsored by Intel)
马子雅 (Ziya Ma) (Intel)
Data is the fuel for analytics and AI workloads, but the challenges in using it are constant. Ziya Ma discusses how recent innovations from Intel in high-capacity persistent memory and open source software are accelerating production-scale deployments, delivering breakthrough optimizations and faster insights to a wide range of opportunities in the digital enterprise.
9:35am-9:55am (20m)
Quantifying forgiveness
Julia Angwin (ProPublica)
Algorithms are increasingly arbiters of forgiveness. Julia Angwin discusses what she has learned about forgiveness in her series of articles on algorithmic accountability and the lessons we all need to learn for the coming AI future.
9:55am-10:00am (5m) Sponsored
Smarter cities through Geotab with BigQuery ML and geospatial analytics (sponsored by Google Cloud)
Chad W. Jennings (Google)
Cities all over the world are using data and analytics to optimize infrastructure, but city planners are often held back by outdated data gathering methods and legacy analysis tools. Chad Jennings details how Geotab, a leader in IoT fleet logistics, brought BigQuery's unique machine learning and geospatial capabilities to its existing datasets to deliver a more capable solution to city planners.
10:05am-10:20am (15m) Ethics and Privacy
Brain-based human-machine interfaces: New developments, legal and ethical issues, and potential uses
Amanda Pustilnik (University of Maryland School of Law | Center for Law, Brain & Behavior, Mass. General Hospital)
Have you ever dreamed you could read minds? Do telekinesis? Maybe fly a magic carpet by thought alone? Until now, these powers have existed only in the realm of imagination or, more recently, video, AR, and VR games. Join Amanda Pustilnik to learn how brain-based human-machine interfaces are beginning to offer these powers in near-commercially-viable forms.
10:20am-10:25am (5m) Sponsored
The data imperative (sponsored by Zaloni)
Ben Sharma (Zaloni)
Once, a company could live 60-70 years on the S&P 500. Now it averages 15 years. If companies were people, this would be an epidemic on par with the Black Plague. But the same things that dragged humanity out of that dark age can drag companies out of this one.
10:25am-10:45am (20m) Ethics and Privacy
Black box: How AI will amplify the best and worst of humanity
Jacob Ward (CNN | Al Jazeera | PBS)
For most of us, our own mind is a black box—an all-powerful and utterly mysterious device that runs our lives for us, using rules and shortcuts of which we aren’t even aware. Jacob Ward reveals the relationship between the unconscious habits of our minds and the way that AI is poised to amplify them, alter them, maybe even reprogram them.
10:50am-11:20am (30m)
Break: Morning break sponsored by IBM
2:30pm-3:30pm (1h)
Break: Afternoon break sponsored by Google Cloud
8:00am-8:45am (45m)
Break: Morning Coffee
8:00am-8:30am (30m)
Speed Networking
Gather before keynotes on Thursday morning to enjoy casual conversation while meeting fellow attendees. If one of your goals at Strata is to meet new people, this session will jumpstart your networking with other attendees.
12:00pm-1:10pm (1h 10m)
Thursday Business Summit Lunch
Join Strata Business Summit speakers and attendees for a networking lunch on Thursday.
12:00pm-1:10pm (1h 10m)
Thursday Topic Tables at Lunch
Topic Table discussions are a great way to informally network with people in similar industries or interested in the same topics.