Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY
 
1A 01/02
1A 03
Add Real-time systems with Spark Streaming and Kafka to your personal schedule
9:00am TRAINING Real-time systems with Spark Streaming and Kafka Jesse Anderson (Big Data Institute)
1A 04/05
Add Minimum-Viable Machine Learning: The Applied Data Science Bootcamp (Sponsored by DXC Technology) to your personal schedule
9:00am TRAINING Minimum-Viable Machine Learning: The Applied Data Science Bootcamp (Sponsored by DXC Technology) Jerry Overton (DXC), Ashim Bose (DXC), Samir Sehovic (DXC)
1A 17
Add Apache Spark programming to your personal schedule
9:00am TRAINING Apache Spark programming
1E 17
Add Machine Learning with PyTorch to your personal schedule
9:00am TRAINING Machine Learning with PyTorch Delip Rao (R7 Speech Science)
1 E02
Add Hands-On Data Science with Python to your personal schedule
9:00am TRAINING Hands-On Data Science with Python Zachary Glassman (The Data Incubator)
1A 06/07
Add Architecting a data platform for enterprise use to your personal schedule
9:00am Architecting a data platform for enterprise use Mark Madsen (Think Big Analytics), Todd Walter (Teradata)
Add Building your first big data application on AWS to your personal schedule
1:30pm Building your first big data application on AWS Jorge A. Lopez (Amazon Web Services), Radhika Ravirala (Amazon Web Services (AWS)), Paul Sears (Amazon Web Services), Ryan Nienhuis (Amazon Web Services), Randy Ridgley (Amazon Web Services)
1A 08
Add Findata Day to your personal schedule
9:00am Findata Day Alistair Croll (Solve For Interesting), Amro Alkhatib (National Health Insurance Company - Daman), Mridul Mishra (Fidelity Investments), Patrick Angeles (Cloudera), Andreas Kohlmaier (MunichRe), Paul Lashmet (Arcadia Data), Laura Eisenhardt (iKnow Solutions), Robin Way (Corios), Theresa Johnson (Airbnb), Jane Tran (Unqork)
1A 12/14
Add Recurrent Neural Networks for timeseries analysis to your personal schedule
1:30pm Recurrent Neural Networks for timeseries analysis Bruno Gonçalves (New York University)
1A 15/16
Add Machine Learning from Scratch in TensorFlow to your personal schedule
9:00am Machine Learning from Scratch in TensorFlow Dylan Bargteil (The Data Incubator)
1A 10
Add Managing Data Science in the Enterprise to your personal schedule
9:00am Managing Data Science in the Enterprise Nick Elprin (Domino Data Lab)
Add From Theory to Data Product - Applying Data Science Methods to  Effect Business Change to your personal schedule
1:30pm From Theory to Data Product - Applying Data Science Methods to Effect Business Change Janet Forbes (T4G), Danielle Leighton (T4G), Lindsay Brin (T4G)
1A 21/22
Add Deep Learning Methods for Natural Language Processing to your personal schedule
9:00am Deep Learning Methods for Natural Language Processing Garrett Hoffman (StockTwits)
Add Natural Language Understanding at Scale with Spark NLP to your personal schedule
1:30pm Natural Language Understanding at Scale with Spark NLP David Talby (Pacific AI), Claudiu Branzan (G2 Web Services), Alexander Thomas (Indeed)
1A 23/24
Add Practical Techniques for Interpreting Machine Learning Models to your personal schedule
9:00am Practical Techniques for Interpreting Machine Learning Models Patrick Hall (H2O.ai | George Washington University), Navdeep Gill (H2O.ai), Megan Kurka (H2O.ai), Mark Chan (H2O.ai)
Add How to be fair: a tutorial for beginners to your personal schedule
1:30pm How to be fair: a tutorial for beginners Aileen Nielsen (One Drop)
1E 06
Add Data Science with Unix Power Tools to your personal schedule
1:30pm Data Science with Unix Power Tools Jeroen Janssens (Data Science Workshops B.V.)
1E 07/08
Add Learning Machine Learning using Astronomy data sets to your personal schedule
9:00am Learning Machine Learning using Astronomy data sets Viviana Acquaviva (CUNY New York City College of Technology)
Add Leveraging Spark and deep learning frameworks to understand data at scale to your personal schedule
1:30pm Leveraging Spark and deep learning frameworks to understand data at scale Vartika Singh (Cloudera), Suyash Ramineni (Cloudera), Juan Yu (Cloudera), Steven Totman (Cloudera), Marton Balassi (Cloudera)
1E 09
Add Deep learning-based search and recommendation systems using TensorFlow to your personal schedule
9:00am Deep learning-based search and recommendation systems using TensorFlow Dr. Vijay Srinivas Agneeswaran (SapientRazorfish), Abhishek Kumar (SapientRazorfish)
Add From Training to Serving: Deploying Tensorflow Models with Kubernetes to your personal schedule
1:30pm From Training to Serving: Deploying Tensorflow Models with Kubernetes Brian Foo (Google), Jay Smith (Google), David Aronchick (Google)
1E 10
Add Data Case Studies to your personal schedule
9:00am Data Case Studies Alistair Croll (Solve For Interesting), Katharina Warzel (EveryMundo), Mike Berger (Mount Sinai Health System), Sam Helmich (Deere & Company), Stephanie Fischer (datanizing GmbH), Maryam Jahanshahi (TapRecruit), Greg Quist (SmartCover Systems), Ann Nguyen (Whole Whale), Abhimanyu Verma (Novartis), Steve Otto (Navistar), Jennifer Lim (Cerner), Anand S (Gramener)
1E 11
Add Getting ready for GDPR: securing and governing hybrid, cloud and on-prem big data deployments to your personal schedule
9:00am Getting ready for GDPR: securing and governing hybrid, cloud and on-prem big data deployments Mark Donsky (Cloudera), Syed Rafice (Cloudera), Mubashir Kazia (Cloudera), Ifigeneia Derekli (Cloudera), Camila Hiskey (Cloudera)
Add Architecting a next-generation data platform to your personal schedule
1:30pm Architecting a next-generation data platform Ted Malaska (Blizzard Entertainment), Jonathan Seidman (Cloudera)
1E 12/13
Add Designing Modern Streaming Data Applications to your personal schedule
9:00am Designing Modern Streaming Data Applications Arun Kejariwal (MZ), Karthik Ramasamy (Streamlio)
Add Hands-on Kafka Streaming Microservices with Akka Streams and Kafka Streams to your personal schedule
1:30pm Hands-on Kafka Streaming Microservices with Akka Streams and Kafka Streams Dean Wampler (Lightbend), Boris Lublinsky (Lightbend)
1E 14
Add Stream Processing with Kafka and KSQL to your personal schedule
9:00am Stream Processing with Kafka and KSQL Tim Berglund (Confluent)
Add Running multidisciplinary big data workloads in the cloud to your personal schedule
1:30pm Running multidisciplinary big data workloads in the cloud Sudhanshu Arora (Cloudera), Tony Wu (Cloudera), Stefan Salandy (Cloudera), Suraj Acharya (Cloudera)
1E 15/16
Add Model serving and management at scale using open-source tools to your personal schedule
9:00am Model serving and management at scale using open-source tools Dan Crankshaw (UC Berkeley RISELab)
Add Apache Metron: Open Source Cyber Security at Scale to your personal schedule
1:30pm Apache Metron: Open Source Cyber Security at Scale Carolyn Duby (Hortonworks)
Add Opening Reception to your personal schedule
5:00pm Opening Reception | Room: Expo Hall
10:30am Morning Break | Room: TBD
3:00pm Afternoon Break | Room: TBD
12:30pm Lunch | Room: 3A
9:00am-5:00pm (8h)
Expand Your Data Science and Machine Learning Skills (Python, R, SQL, Spark, TensorFlow)
Ian Cook (Cloudera)
Advancing your career in data science requires learning new languages and frameworks—but learners face an overwhelming array of choices, with different syntaxes, conventions, and terminology. The instructor will simplify the learning process by elucidating the abstractions common to these systems. Through hands-on exercises, participants will overcome obstacles to getting started using new tools.
9:00am-5:00pm (8h)
Real-time systems with Spark Streaming and Kafka
Jesse Anderson (Big Data Institute)
To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data and how will you process that much data? Jesse Anderson explores the latest real-time frameworks and explains how to choose the right one for your company.
9:00am-5:00pm (8h) Sponsored, Strata Business Summit
Minimum-Viable Machine Learning: The Applied Data Science Bootcamp (Sponsored by DXC Technology)
Jerry Overton (DXC), Ashim Bose (DXC), Samir Sehovic (DXC)
Acquiring machine-learning (ML) technology is relatively straightforward, but ML must be applied to be useful. In this one-day boot camp, we teach students how to apply advanced analytics in ways that reshape the enterprise and improve outcomes. This training is equal parts hackathon, presentation, and group participation.
9:00am-5:00pm (8h)
Apache Spark programming
The instructor walks you through the core APIs for using Spark, fundamental mechanisms and basic internals of the framework, SQL and other high-level data access tools, and Spark’s streaming capabilities and machine learning APIs.
9:00am-5:00pm (8h) Deep Learning
Machine Learning with PyTorch
Delip Rao (R7 Speech Science)
Explore machine learning and deep learning with PyTorch and walk you through how to build effective models for real world data.
9:00am-5:00pm (8h)
Hands-On Data Science with Python
Zachary Glassman (The Data Incubator)
The Data Incubator offers a foundation in building intelligent business applications using machine learning. We will walk through all the steps - from prototyping to production - of developing a machine learning pipeline. We’ll look at data cleaning, feature engineering, model building/evaluation, and deployment. Students will extend these models into an application using a real-world dataset.
9:00am-12:30pm (3h 30m) Data engineering and architecture Data Platforms
Architecting a data platform for enterprise use
Mark Madsen (Think Big Analytics), Todd Walter (Teradata)
Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build multi-use data infrastructure that is not subject to past constraints. We will explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure.
1:30pm-5:00pm (3h 30m) Big data and data science in the cloud, Data engineering and architecture
Building your first big data application on AWS
Jorge A. Lopez (Amazon Web Services), Radhika Ravirala (Amazon Web Services (AWS)), Paul Sears (Amazon Web Services), Ryan Nienhuis (Amazon Web Services), Randy Ridgley (Amazon Web Services)
Want to learn how to use Amazon's big data web services to launch your first big data application in the cloud? Jorge Lopez walks you through building a big data application using a combination of open source technologies and AWS managed services.
9:00am-5:00pm (8h)
Findata Day
Alistair Croll (Solve For Interesting), Amro Alkhatib (National Health Insurance Company - Daman), Mridul Mishra (Fidelity Investments), Patrick Angeles (Cloudera), Andreas Kohlmaier (MunichRe), Paul Lashmet (Arcadia Data), Laura Eisenhardt (iKnow Solutions), Robin Way (Corios), Theresa Johnson (Airbnb), Jane Tran (Unqork)
From analyzing risk and detecting fraud to predicting payments and improving customer experience, take a deep dive into the ways data technologies are transforming the financial industry.
9:00am-12:30pm (3h 30m) Data science and machine learning
Building A Large-Scale Machine Learning Application Using Amazon SageMaker and Spark
David Arpin (Amazon Web Services)
Outline - What is Amazon SageMaker? Quick product overview of AWS's newest ML Platform - Create a Spark EMR cluster - Integrate SageMaker algorithms into Spark pipelines - Ensemble multiple models for a real-time prediction task
1:30pm-5:00pm (3h 30m) Data science and machine learning Deep Learning, Temporal data and time-series analytics
Recurrent Neural Networks for timeseries analysis
Bruno Gonçalves (New York University)
The world is ever changing. As a result, many of the systems and phenomena we are interested in evolve over time resulting in time evolving datasets. Timeseries often display any interesting properties and levels of correlation. In this tutorial we will introduce the students to the use of Recurrent Neural Networks and LSTMs to model and forecast different kinds of timeseries.
9:00am-5:00pm (8h) Deep Learning
Machine Learning from Scratch in TensorFlow
Dylan Bargteil (The Data Incubator)
The TensorFlow library provides for the use of data flow graphs for numerical computations, with automatic parallelization across several CPUs or GPUs. This architecture makes it ideal for implementing neural networks and other machine learning algorithms. This training will introduce TensorFlow's capabilities through its Python interface.
9:00am-12:30pm (3h 30m) Data-driven business management, Strata Business Summit Machine Learning in the enterprise
Managing Data Science in the Enterprise
Nick Elprin (Domino Data Lab)
The honeymoon era of data science is ending, and accountability is coming. Not content to wait for results that may or may not arrive, successful data science leaders deliver measurable impact on an increasing share of an enterprise’s KPIs. Nick Elprin details how leading organizations have taken a holistic approach to people, process, and technology to build a sustainable competitive advantage.
1:30pm-5:00pm (3h 30m) Data-driven business management, Strata Business Summit Machine Learning in the enterprise
From Theory to Data Product - Applying Data Science Methods to Effect Business Change
Janet Forbes (T4G), Danielle Leighton (T4G), Lindsay Brin (T4G)
This tutorial is a primer on crafting well-conceived data science projects on course toward uncovering valuable business insights. Using case studies and hands-on skills development, we will teach techniques that are essential for a variety of audiences invested in effecting real business change.
9:00am-12:30pm (3h 30m) Data science and machine learning Deep Learning, Text and Language processing and analysis
Deep Learning Methods for Natural Language Processing
Garrett Hoffman (StockTwits)
This workshop will review deep learning methods used for natural language processing and natural language understanding tasks while working on a live example with StockTwits data using python and TensorFlow. Methods we review include Word2Vec, Recurrent Neural Networks and Variants (LSTM, GRU) and Convolutional Neural Networks.
1:30pm-5:00pm (3h 30m) Data science and machine learning Text and Language processing and analysis
Natural Language Understanding at Scale with Spark NLP
David Talby (Pacific AI), Claudiu Branzan (G2 Web Services), Alexander Thomas (Indeed)
David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial for scalable NLP using the highly performant, highly scalable, open source Spark NLP library. You’ll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve.
9:00am-12:30pm (3h 30m) Ethics and Privacy, Health and Medicine
Practical Techniques for Interpreting Machine Learning Models
Patrick Hall (H2O.ai | George Washington University), Navdeep Gill (H2O.ai), Megan Kurka (H2O.ai), Mark Chan (H2O.ai)
Transparency, auditability, and stability are crucial for business adoption and human acceptance of complex machine learning models. This technical tutorial will share practical and productizable approaches for explaining, testing, and visualizing machine learning models through a series of publicly available examples using open source, Python-friendly tools such as GraphViz, H2O, and XGBoost.
1:30pm-5:00pm (3h 30m) Data science and machine learning Ethics and Privacy
How to be fair: a tutorial for beginners
Aileen Nielsen (One Drop)
There is mounting evidence that the widespread deployment of machine learning and artificial intelligence in business and government applications is likely reproducing or even amplifying existing prejudices and social inequalities. This tutorial is designed to give knowledge and tools to data scientists so they can identify and avoid bias and other unfairness in their analyses.
9:00am-12:30pm (3h 30m) Visualization and user experience
Making interactive browser-based visualizations easy in Python
James Bednar (Anaconda)
Python lets you solve data-science problems by stitching together packages from the Python ecosystem, but it can be difficult to assemble the right tools to solve real-world problems. Here we show how to use the 15+ packages covered by the new PyViz.org initiative to make it simple to build interactive plots and dashboards, even for large, streaming, and highly multidimensional data.
1:30pm-5:00pm (3h 30m) Data science and machine learning
Data Science with Unix Power Tools
Jeroen Janssens (Data Science Workshops B.V.)
The unix command line remains an amazing environment for efficiently performing tedious but essential data science tasks. By combining small, powerful, command-line tools you can quickly scrub, explore, and model your data as well as hack together prototypes. This hands-on workshop is based on the O’Reilly book Data Science at the Command Line, written by instructor Jeroen Janssens.
9:00am-12:30pm (3h 30m) Data science and machine learning
Learning Machine Learning using Astronomy data sets
Viviana Acquaviva (CUNY New York City College of Technology)
We present an intermediate Machine Learning tutorial based on actual problems in Astronomy research. Our strengths are that we use interesting, diverse, publicly available data sets; we feature students' feedback as "best and worst" content; we focus on the customization of algorithms and evaluation metrics required by scientific applications; and we propose open problems to our participants.
1:30pm-5:00pm (3h 30m) Data science and machine learning Deep Learning
Leveraging Spark and deep learning frameworks to understand data at scale
Vartika Singh (Cloudera), Suyash Ramineni (Cloudera), Juan Yu (Cloudera), Steven Totman (Cloudera), Marton Balassi (Cloudera)
Vartika Singh and Juan Yu outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks.
9:00am-12:30pm (3h 30m) Data science and machine learning Deep Learning, Recommendation Systems
Deep learning-based search and recommendation systems using TensorFlow
Dr. Vijay Srinivas Agneeswaran (SapientRazorfish), Abhishek Kumar (SapientRazorfish)
Abhishek Kumar and Vijay Srinivas Agneeswaran offer an introduction to deep learning-based recommendation and learning-to-rank systems using TensorFlow. You'll learn how to build a recommender system based on intent prediction using deep learning that is based on a real-world implementation for an ecommerce client.
1:30pm-5:00pm (3h 30m) Data engineering and architecture Model lifecycle management
From Training to Serving: Deploying Tensorflow Models with Kubernetes
Brian Foo (Google), Jay Smith (Google), David Aronchick (Google)
TensorFlow and Keras are popular libraries for training deep models due to hardware accelerator support. Join this tutorial to learn how to bring deep learning models from training to serving in a cloud production environment. You will learn how to unit-test, export, package, deploy, optimize, serve, monitor, and test models using Docker and TensorFlow Serving in Kubernetes.
9:00am-5:00pm (8h)
Data Case Studies
Alistair Croll (Solve For Interesting), Katharina Warzel (EveryMundo), Mike Berger (Mount Sinai Health System), Sam Helmich (Deere & Company), Stephanie Fischer (datanizing GmbH), Maryam Jahanshahi (TapRecruit), Greg Quist (SmartCover Systems), Ann Nguyen (Whole Whale), Abhimanyu Verma (Novartis), Steve Otto (Navistar), Jennifer Lim (Cerner), Anand S (Gramener)
Hear practical insights from household brands and global companies: the challenges they tackled, approaches they took, and the benefits—and drawbacks—of their solutions.
9:00am-12:30pm (3h 30m) Data engineering and architecture, Law, ethics, governance Data preparation, governance and privacy, Ethics and Privacy
Getting ready for GDPR: securing and governing hybrid, cloud and on-prem big data deployments
Mark Donsky (Cloudera), Syed Rafice (Cloudera), Mubashir Kazia (Cloudera), Ifigeneia Derekli (Cloudera), Camila Hiskey (Cloudera)
New regulations such as GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads that span on-prem, private cloud, multi-cloud, and hybrid cloud. We will share hands-on best practices for meeting these challenges, with special attention to GDPR.
1:30pm-5:00pm (3h 30m) Data engineering and architecture Data Platforms
Architecting a next-generation data platform
Ted Malaska (Blizzard Entertainment), Jonathan Seidman (Cloudera)
Using Customer 360 and the Internet of Things as examples, Jonathan Seidman and Ted Malaska explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Flink, Kudu, Spark Streaming, and Spark SQL and modern storage engines to enable new forms of data processing and analytics.
9:00am-12:30pm (3h 30m) Data engineering and architecture, Streaming systems and real-time applications Data Platforms
Designing Modern Streaming Data Applications
Arun Kejariwal (MZ), Karthik Ramasamy (Streamlio)
In this tutorial, we will walk the audience through the landscape of state-of-the-art systems for each stage of a end-to-end data processing pipeline, viz., messaging frameworks, streaming computing frameworks, storage frameworks for real-time data. We will also walk through case studies from IoT, Gaming and Healthcare, and share our experiences operating these systems at Internet scale.
1:30pm-5:00pm (3h 30m) Data engineering and architecture, Streaming systems and real-time applications
Hands-on Kafka Streaming Microservices with Akka Streams and Kafka Streams
Dean Wampler (Lightbend), Boris Lublinsky (Lightbend)
This hands-on tutorial builds streaming apps as microservices using Kafka with Akka Streams and Kafka Streams. We'll assess the strengths and weaknesses of each tool for particular needs, so you'll be better informed when choosing tools for your needs. We'll contrast them with Spark Streaming and Flink, including when to chose them instead. The sample apps demonstrate ML model serving ideas.
9:00am-12:30pm (3h 30m) Data engineering and architecture, Streaming systems and real-time applications
Stream Processing with Kafka and KSQL
Tim Berglund (Confluent)
A solid introduction to Apache Kafka as a streaming data platform. We'll cover its internal architecture, APIs, and platform components like Kafka Connect and Kafka Streams—then finish with an exercise processing streaming data using KSQL, the new SQL-like declarative stream processing language for Kafka.
1:30pm-5:00pm (3h 30m) Big data and data science in the cloud, Data engineering and architecture
Running multidisciplinary big data workloads in the cloud
Sudhanshu Arora (Cloudera), Tony Wu (Cloudera), Stefan Salandy (Cloudera), Suraj Acharya (Cloudera)
Organizations now run diverse, multidisciplinary big data workloads that span data engineering, analytic database, and data science applications. Many of these workloads operate on the same underlying data and the workloads themselves can be transient or long-running in nature. One of the challenges we will explore is keeping the data context consistent across these various workloads.
9:00am-12:30pm (3h 30m) Data science and machine learning Model lifecycle management
Model serving and management at scale using open-source tools
Dan Crankshaw (UC Berkeley RISELab)
This tutorial consists of three parts. First, I will present an overview of the current challenges in deploying machine applications into production and provide a survey of the current state of prediction serving infrastructure. Next, I will provide a deep dive on the Clipper serving system. Finally, I will run a hands-on workshop for getting started with Clipper.
1:30pm-5:00pm (3h 30m) Data engineering and architecture, Platform security and cybersecurity
Apache Metron: Open Source Cyber Security at Scale
Carolyn Duby (Hortonworks)
Learn how to find the cybersecurity threat needle in your event haystack using Apache Metron: a real-time, horizontally scalable, open-source platform. After this interactive overview of the platform's major features, you will be ready to analyze your own haystack back at the office.
5:00pm-6:30pm (1h 30m)
Opening Reception
Enjoy delicious snacks and beverages with fellow Strata attendees, speakers, and sponsors at the Opening Reception, happening immediately after tutorials on Tuesday.
10:30am-11:00am (30m)
Break: Morning Break
3:00pm-3:30pm (30m)
Break: Afternoon Break
12:30pm-1:30pm (1h)
Break: Lunch