Presented By O'Reilly and Cloudera
Make Data Work
September 26–27, 2016: Training
September 27–29, 2016: Tutorials & Conference
New York, NY
 
1 E 06
9:00am Conquer the time series data pipeline with SMACK Patrick McFadin (DataStax)
1 E 07/1 E 08
9:00am Building machine-learning apps with Spark: MLlib, ML Pipelines, and GraphX Vartika Singh (Cloudera), Jayant Shekhar (Sparkflows Inc.)
1:30pm Deep learning with TensorFlow Martin Wicke (Google), Joshua Gordon (Google)
1 E 10/1 E11
9:00am Practical machine learning Michael Li (The Data Incubator), Robert Schroll (The Data Incubator)
1 E 12/1 E 13
9:00am Data case studies Jen van der Meer (Reason Street), JOLENE JEFFRIES (GE Digital), David Boyle (Audience Strategies), Josh Laurito (Squarespace), Nitin Kaul (Merck & Co., Inc.), richard baumgartner (Merck), Vinee Kumar (Dept of Transportation), Joanne Chen (Truveris), Renee DiResta (New Knowledge), Jaya Kolhatkar (Walmart ), Ghazal Badiozamani (Elsevier), Mike Koelemay (Lockheed Martin), Erin Akred (DataKind), Michael Dowd (DataKind), Vinee Kumar (Dept of Transportation), Renee DiResta (New Knowledge), Madhuri kollu (Sabre), Tara Prakriya (Maana)
1 E 15/1 E 16
9:00am Just enough Scala for Spark Dean Wampler (Anyscale)
1:30pm Developing a modern enterprise data strategy Colette Glaeser (Silicon Valley Data Science), Edd Wilder-James (Google)
3D 12
9:00am Machine learning in Python Andreas Mueller (Columbia University)
1:30pm Interactive data applications in Python Bryan Van de Ven (Continuum Analytics), Sarah Bird (Continuum Analytics)
1B 01/02
9:00am Data 101 Marie Beaugureau (O'Reilly Media, Inc. ), Edd Wilder-James (Google), Ben Sharma (Zaloni), Amihai Savir (EMC), Jerry Overton (DXC), Deborah Berebichez (Metis), Julia Rodriguez (Eagle Investment Systems)
1:30pm Deploying and managing Hive, Spark, and Impala in the public cloud Andrei Savu (Cloudera), Vinithra Varadharajan (Cloudera), Jennifer Wu (Cloudera), Matthew Jacobs (Cloudera)
3D 10
9:00am R for big data Garrett Grolemund (RStudio), Nathan Stephens (RStudio, Inc.)
1:30pm Introduction to visualizations using D3 Brian Suda (optional.is)
1B 03/04
9:00am Learn stream processing with Apache Beam Tyler Akidau (Google), Jesse Anderson (Big Data Institute)
1:30pm An introduction to Apache Kafka Ian Wrigley (StreamSets)
Hall 1C
9:00am Architecting a data platform John Akred (Silicon Valley Data Science), Stephen O'Sullivan (Data Whisperers), Mauricio Vacas (Silicon Valley Data Science)
Hall 1B
9:00am Sponsored by Databricks Spark camp: Exploring Wikipedia with Spark Zoltan Toth (datapao.com)
1 E 09
9:00am A practitioner’s guide to securing your Hadoop cluster Michael Yoder (Cloudera), Benjamin Spivey (Cloudera), Mark Donsky (Okera), Mubashir Kazia (Cloudera)
1:30pm Guerrilla guide to Python and Apache Hadoop Juliet Hougland (Cloudera), srowen om (Cloudera)
1 E 14
9:00am FinData day Alistair Croll (Solve For Interesting), Juan Huerta (Goldman Sachs Consumer Lending Group), Robert Passarella (Alpha Features), Giannina Segnini (Journalism School, Columbia University), Mar Cabra (International Consortium of Investigative Journalists), Anand Sanwal (CB Insights), Michael Casey (MIT Media Lab), Diane Chang (Intuit), Jeff McMillan (Morgan Stanley), Tanvi Singh (Credit Suisse), Kelley Yohe (Swift Capital), Michelle Bonat (Data Simply), Susan Woodward (Sand Hill Econometrics), Robert Passarella (Alpha Features)
5:00pm Sponsored by Deloitte Consulting and Waterline Data Opening Reception | Room: Hall 3E
7:00am Coffee Break | 10:30am - 11:00am Morning Break sponsored by DellEMC | Room: Break
12:30pm Lunch sponsored by Google | 3:00pm - 3:30pm Afternoon Break sponsored by Cisco | Room: Break
6:30pm Startup Showcase | Room: South Concourse
9:00am-12:30pm (3h 30m) IoT & real-time
Conquer the time series data pipeline with SMACK
Patrick McFadin (DataStax)
We as an industry are collecting more data every year. IoT, web, and mobile applications send torrents of bits to our data centers that have to be processed and stored, while users expect an always-on experience—leaving little room for error. Patrick McFadin explores how successful companies do this every day with powerful data pipelines built with SMACK: Spark, Mesos, Akka, Cassandra, and Kafka.
1:30pm-5:00pm (3h 30m) Data-driven business
Data science that works: Best practices for designing data-driven improvements, making them real, and driving change in your enterprise
Jerry Overton (DXC)
Join expert Jerry Overton as he explains how to make the business and technical aspects of your data strategy work together for best results.
9:00am-12:30pm (3h 30m) Spark & beyond
Building machine-learning apps with Spark: MLlib, ML Pipelines, and GraphX
Vartika Singh (Cloudera), Jayant Shekhar (Sparkflows Inc.)
Vartika Singh and Jayant Shekhar walk you through techniques for building and tuning machine-learning apps using Spark MLlib and Spark ML Pipelines and graph processing with GraphX.
1:30pm-5:00pm (3h 30m) Data science & advanced analytics
Deep learning with TensorFlow
Martin Wicke (Google), Joshua Gordon (Google)
Martin Wicke and Josh Gordon offer hands-on experience training and deploying a machine-learning system using TensorFlow, a popular open source library. You'll learn how to build machine-learning systems from simple classifiers to complex image-based models as well as how to deploy models in production using TensorFlow Serving.
9:00am-5:00pm (8h) Data science & advanced analytics
Practical machine learning
Michael Li (The Data Incubator), Robert Schroll (The Data Incubator)
Tianhui Li and Robert Schroll of the Data Incubator offer a foundation in building intelligent business applications using machine learning, walking you through all the steps to prototyping and production—data cleaning, feature engineering, model building and evaluation, and deployment—and diving into an application for anomaly detection and a personalized recommendation engine.
9:00am-5:00pm (8h)
Data case studies
Jen van der Meer (Reason Street), JOLENE JEFFRIES (GE Digital), David Boyle (Audience Strategies), Josh Laurito (Squarespace), Nitin Kaul (Merck & Co., Inc.), richard baumgartner (Merck), Vinee Kumar (Dept of Transportation), Joanne Chen (Truveris), Renee DiResta (New Knowledge), Jaya Kolhatkar (Walmart ), Ghazal Badiozamani (Elsevier), Mike Koelemay (Lockheed Martin), Erin Akred (DataKind), Michael Dowd (DataKind), Vinee Kumar (Dept of Transportation), Renee DiResta (New Knowledge), Madhuri kollu (Sabre), Tara Prakriya (Maana)
The road to a data-driven business is paved with hard-won lessons, painful mistakes, and clever insights. We're introducing a new Tutorial Day track packed with case studies, where you can hear from practitioners across a wide range of industries.
9:00am-12:30pm (3h 30m) Spark & beyond
Just enough Scala for Spark
Dean Wampler (Anyscale)
Apache Spark is written in Scala. Hence, many if not most data engineers adopting Spark are also adopting Scala, while most data scientists continue to use Python and R. Dean Wampler offers an overview of the core features of Scala you need to use Spark effectively, using hands-on exercises with the Spark APIs.
1:30pm-5:00pm (3h 30m) Data-driven business
Developing a modern enterprise data strategy
Colette Glaeser (Silicon Valley Data Science), Edd Wilder-James (Google)
How do you reconcile the business opportunity of big data and data science with the sea of possible technologies? Fundamentally, data should serve the strategic imperatives of a business—those key aspirations that define an organization’s future vision. Edd Wilder-James and Colette Glaeser explain how to create a modern data strategy that powers data-driven business.
9:00am-12:30pm (3h 30m) Data science & advanced analytics
Machine learning in Python
Andreas Mueller (Columbia University)
Scikit-learn, which provides easy-to-use interfaces to perform advances analysis and build powerful predictive models, has emerged as one of the most popular open source machine-learning toolkits. Using scikit-learn and Python as examples, Andreas Mueller offers an overview of basic concepts of machine learning, such as supervised and unsupervised learning, cross-validation, and model selection.
1:30pm-5:00pm (3h 30m) Data science & advanced analytics
Interactive data applications in Python
Bryan Van de Ven (Continuum Analytics), Sarah Bird (Continuum Analytics)
Bryan Van de Ven and Sarah Bird demonstrate how to build intelligent apps in a week with Bokeh, Python, and optimization.
9:00am-12:30pm (3h 30m)
Data 101
Marie Beaugureau (O'Reilly Media, Inc. ), Edd Wilder-James (Google), Ben Sharma (Zaloni), Amihai Savir (EMC), Jerry Overton (DXC), Deborah Berebichez (Metis), Julia Rodriguez (Eagle Investment Systems)
Data 101 introduces you to core principles of data architecture, teaches you how to build and manage successful data teams, and inspires you to do more with your data through real-world applications. Setting the foundation for deeper dives on the following days of Strata + Hadoop World, Data 101 reinforces data fundamentals and helps you focus on how data can solve your business problems.
1:30pm-5:00pm (3h 30m) Enterprise adoption
Deploying and managing Hive, Spark, and Impala in the public cloud
Andrei Savu (Cloudera), Vinithra Varadharajan (Cloudera), Jennifer Wu (Cloudera), Matthew Jacobs (Cloudera)
Public cloud usage for Hadoop workloads is accelerating. Consequently, Hadoop components have adapted to leverage cloud infrastructure. Andrei Savu, Vinithra Varadharajan, Matthew Jacobs, and Jennifer Wu explore best practices for Hadoop deployments in the public cloud and provide detailed guidance for deploying, configuring, and managing Hive, Spark, and Impala in the public cloud.
9:00am-12:30pm (3h 30m) Data science & advanced analytics
R for big data
Garrett Grolemund (RStudio), Nathan Stephens (RStudio, Inc.)
Garrett Grolemund and Nathan Stephens explore the new sparklyr package by RStudio, which provides a familiar interface between the R language and Apache Spark and communicates with the Spark SQL and the Spark ML APIs so R users can easily manipulate and analyze data at scale.
1:30pm-5:00pm (3h 30m) Visualization & user experience
Introduction to visualizations using D3
Brian Suda (optional.is)
Visualizations are a key part of conveying any dataset. D3 is the most popular, easiest, and most extensible way to get your data online in an interactive way. Brian Suda outlines best practices for good data visualizations and explains how you can build them using D3.
9:00am-12:30pm (3h 30m) IoT & real-time
Learn stream processing with Apache Beam
Tyler Akidau (Google), Jesse Anderson (Big Data Institute)
Come learn the basics of stream processing via a guided walkthrough of the most sophisticated and portable stream processing model on the planet—Apache Beam (incubating). Tyler Akidau and Jesse Anderson cover the basics of robust stream processing (windowing, watermarks, and triggers) with the option to execute exercises on top of the runner of your choice—Flink, Spark, or Google Cloud Dataflow.
1:30pm-5:00pm (3h 30m) IoT & real-time
An introduction to Apache Kafka
Ian Wrigley (StreamSets)
Ian Wrigley demonstrates how to leverage the capabilities of Apache Kafka to collect, manage, and process stream data for both big data projects and general-purpose enterprise data integration. Ian covers system architecture, use cases, and how to write applications that publish data to, and subscribe to data from, Kafka—no prior knowledge of Kafka required.
9:00am-12:30pm (3h 30m) Spark & beyond
Architecting a data platform
John Akred (Silicon Valley Data Science), Stephen O'Sullivan (Data Whisperers), Mauricio Vacas (Silicon Valley Data Science)
What are the essential components of a data platform? John Akred, Mauricio Vacas, and Stephen O'Sullivan explain how the various parts of the Hadoop, Spark, and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads.
1:30pm-5:00pm (3h 30m) Hadoop use cases
Hadoop application architectures: Architecting a next-generation data platform for real-time ETL, data analytics, and data warehousing
Jonathan Seidman (Cloudera), Mark Grover (Lyft), Ted Malaska (Capital One)
Jonathan Seidman, Gwen Shapira, Mark Grover, and Ted Malaska demonstrate how to architect a modern, real-time big data platform and explain how to leverage components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics such as real-time ETL, change data capture, and machine learning.
9:00am-5:00pm (8h) Spark & beyond
Spark camp: Exploring Wikipedia with Spark
Zoltan Toth (datapao.com)
The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. Through hands-on examples, Zoltan Toth explores various Wikipedia datasets to illustrate a variety of ideal programming paradigms.
9:00am-12:30pm (3h 30m) Security
A practitioner’s guide to securing your Hadoop cluster
Michael Yoder (Cloudera), Benjamin Spivey (Cloudera), Mark Donsky (Okera), Mubashir Kazia (Cloudera)
Many Hadoop clusters lack even basic security controls. Michael Yoder, Ben Spivey, Mark Donsky, and Mubashir Kazia walk you through securing a Hadoop cluster. You'll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance.
1:30pm-5:00pm (3h 30m) Data science & advanced analytics
Guerrilla guide to Python and Apache Hadoop
Juliet Hougland (Cloudera), srowen om (Cloudera)
Juliet Hougland and Sean Owen offer a practical overview of the basics of using Python data tools with a Hadoop cluster, covering HDFS connectivity and dealing with raw data files, running SQL queries with a SQL-on-Hadoop system like Apache Hive or Apache Impala (incubating), and using Apache Spark to write more complex analytical jobs.
9:00am-5:00pm (8h)
FinData day
Alistair Croll (Solve For Interesting), Juan Huerta (Goldman Sachs Consumer Lending Group), Robert Passarella (Alpha Features), Giannina Segnini (Journalism School, Columbia University), Mar Cabra (International Consortium of Investigative Journalists), Anand Sanwal (CB Insights), Michael Casey (MIT Media Lab), Diane Chang (Intuit), Jeff McMillan (Morgan Stanley), Tanvi Singh (Credit Suisse), Kelley Yohe (Swift Capital), Michelle Bonat (Data Simply), Susan Woodward (Sand Hill Econometrics), Robert Passarella (Alpha Features)
Finance is information. From analyzing risk and detecting fraud to predicting payments and improving customer experience, data technologies are transforming the financial industry. And we're diving deep into this change with a new day of data-meets-finance talks, tailored for Strata + Hadoop World events in the world's financial hubs.
5:00pm-6:30pm (1h 30m) Event
Opening Reception
Grab a drink, mingle with fellow Strata + Hadoop World attendees, and see the latest technologies and products from leading companies in the data space.
7:00am-9:00am (2h)
Break: Coffee Break | 10:30am - 11:00am Morning Break sponsored by DellEMC
12:30pm-1:30pm (1h)
Break: Lunch sponsored by Google | 3:00pm - 3:30pm Afternoon Break sponsored by Cisco
6:30pm-8:00pm (1h 30m) Event
Startup Showcase
What new companies are at the leading edge of the data space? Meet some of the best, most innovative founders as they demonstrate their game-changing ideas at the Startup Showcase.