Presented By O'Reilly and Cloudera
Make Data Work
September 26–27, 2016: Training
September 27–29, 2016: Tutorials & Conference
New York, NY

Tutorials

On Tuesday, choose from a rich variety of half-day tutorials. These expert-led presentations give you a chance to dive deep into the subject matter. Please note: to attend, your registration package must include tutorials; does not include access to training courses.

Tuesday, September 27

9:00am–12:30pm Tuesday, 09/27/2016
Location: 3D 12 Level: Intermediate
Tags: pydata
Andreas Mueller (Columbia University)
Average rating: ****.
(4.00, 6 ratings)
Scikit-learn, which provides easy-to-use interfaces to perform advances analysis and build powerful predictive models, has emerged as one of the most popular open source machine-learning toolkits. Using scikit-learn and Python as examples, Andreas Mueller offers an overview of basic concepts of machine learning, such as supervised and unsupervised learning, cross-validation, and model selection. Read more.
9:00am–12:30pm Tuesday, 09/27/2016
Location: 1 E 15/1 E 16 Level: Intermediate
Dean Wampler (Anyscale)
Average rating: *****
(5.00, 4 ratings)
Apache Spark is written in Scala. Hence, many if not most data engineers adopting Spark are also adopting Scala, while most data scientists continue to use Python and R. Dean Wampler offers an overview of the core features of Scala you need to use Spark effectively, using hands-on exercises with the Spark APIs. Read more.
9:00am–12:30pm Tuesday, 09/27/2016
Location: 1 E 09 Level: Intermediate
Michael Yoder (Cloudera), Benjamin Spivey (Cloudera), Mark Donsky (Okera), Mubashir Kazia (Cloudera)
Average rating: ****.
(4.22, 9 ratings)
Many Hadoop clusters lack even basic security controls. Michael Yoder, Ben Spivey, Mark Donsky, and Mubashir Kazia walk you through securing a Hadoop cluster. You'll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance. Read more.
9:00am–12:30pm Tuesday, 09/27/2016
Location: Hall 1C Level: Intermediate
John Akred (Silicon Valley Data Science), Stephen O'Sullivan (Data Whisperers), Mauricio Vacas (Silicon Valley Data Science)
Average rating: ***..
(3.07, 15 ratings)
What are the essential components of a data platform? John Akred, Mauricio Vacas, and Stephen O'Sullivan explain how the various parts of the Hadoop, Spark, and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads. Read more.
9:00am–12:30pm Tuesday, 09/27/2016
Location: 1 E 07/1 E 08 Level: Intermediate
Vartika Singh (Cloudera), Jayant Shekhar (Sparkflows Inc.)
Average rating: ***..
(3.11, 19 ratings)
Vartika Singh and Jayant Shekhar walk you through techniques for building and tuning machine-learning apps using Spark MLlib and Spark ML Pipelines and graph processing with GraphX. Read more.
9:00am–12:30pm Tuesday, 09/27/2016
Location: 1B 03/04 Level: Beginner
Tags: real-time
Tyler Akidau (Google), Jesse Anderson (Big Data Institute)
Average rating: ****.
(4.50, 6 ratings)
Come learn the basics of stream processing via a guided walkthrough of the most sophisticated and portable stream processing model on the planet—Apache Beam (incubating). Tyler Akidau and Jesse Anderson cover the basics of robust stream processing (windowing, watermarks, and triggers) with the option to execute exercises on top of the runner of your choice—Flink, Spark, or Google Cloud Dataflow. Read more.
9:00am–12:30pm Tuesday, 09/27/2016
Location: 1 E 06 Level: Intermediate
Tags: real-time
Patrick McFadin (DataStax)
Average rating: *****
(5.00, 1 rating)
We as an industry are collecting more data every year. IoT, web, and mobile applications send torrents of bits to our data centers that have to be processed and stored, while users expect an always-on experience—leaving little room for error. Patrick McFadin explores how successful companies do this every day with powerful data pipelines built with SMACK: Spark, Mesos, Akka, Cassandra, and Kafka. Read more.
9:00am–12:30pm Tuesday, 09/27/2016
Location: 3D 10 Level: Intermediate
Tags: r-lang
Garrett Grolemund (RStudio), Nathan Stephens (RStudio, Inc.)
Average rating: ****.
(4.20, 5 ratings)
Garrett Grolemund and Nathan Stephens explore the new sparklyr package by RStudio, which provides a familiar interface between the R language and Apache Spark and communicates with the Spark SQL and the Spark ML APIs so R users can easily manipulate and analyze data at scale. Read more.
9:00am–5:00pm Tuesday, 09/27/2016
Location: 1 E 12/1 E 13
Jen van der Meer (Reason Street), JOLENE JEFFRIES (GE Digital), David Boyle (Audience Strategies), Josh Laurito (Squarespace), Nitin Kaul (Merck & Co., Inc.), richard baumgartner (Merck), Vinee Kumar (Dept of Transportation), Joanne Chen (Truveris), Renee DiResta (New Knowledge), Jaya Kolhatkar (Walmart ), Ghazal Badiozamani (Elsevier), Mike Koelemay (Lockheed Martin), Erin Akred (DataKind), Michael Dowd (DataKind), Vinee Kumar (Dept of Transportation), Renee DiResta (New Knowledge), Madhuri kollu (Sabre), Tara Prakriya (Maana)
The road to a data-driven business is paved with hard-won lessons, painful mistakes, and clever insights. We're introducing a new Tutorial Day track packed with case studies, where you can hear from practitioners across a wide range of industries. Read more.
9:00am–5:00pm Tuesday, 09/27/2016
Location: 1 E 14
Alistair Croll (Solve For Interesting), Juan Huerta (Goldman Sachs Consumer Lending Group), Robert Passarella (Alpha Features), Giannina Segnini (Journalism School, Columbia University), Mar Cabra (International Consortium of Investigative Journalists), Anand Sanwal (CB Insights), Michael Casey (MIT Media Lab), Diane Chang (Intuit), Jeff McMillan (Morgan Stanley), Tanvi Singh (Credit Suisse), Kelley Yohe (Swift Capital), Michelle Bonat (Data Simply), Susan Woodward (Sand Hill Econometrics), Robert Passarella (Alpha Features)
Average rating: ****.
(4.00, 14 ratings)
Finance is information. From analyzing risk and detecting fraud to predicting payments and improving customer experience, data technologies are transforming the financial industry. And we're diving deep into this change with a new day of data-meets-finance talks, tailored for Strata + Hadoop World events in the world's financial hubs. Read more.
9:00am–5:00pm Tuesday, 09/27/2016
Location: Hall 1B
Zoltan Toth (datapao.com)
Average rating: **...
(2.90, 10 ratings)
The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. Through hands-on examples, Zoltan Toth explores various Wikipedia datasets to illustrate a variety of ideal programming paradigms. Read more.
9:00am–5:00pm Tuesday, 09/27/2016
Location: 1 E 10/1 E11
Michael Li (The Data Incubator), Robert Schroll (The Data Incubator)
Average rating: ***..
(3.00, 6 ratings)
Tianhui Li and Robert Schroll of the Data Incubator offer a foundation in building intelligent business applications using machine learning, walking you through all the steps to prototyping and production—data cleaning, feature engineering, model building and evaluation, and deployment—and diving into an application for anomaly detection and a personalized recommendation engine. Read more.
9:00am–12:30pm Tuesday, 09/27/2016
Location: 1B 01/02
Data 101 introduces you to core principles of data architecture, teaches you how to build and manage successful data teams, and inspires you to do more with your data through real-world applications. Setting the foundation for deeper dives on the following days of Strata + Hadoop World, Data 101 reinforces data fundamentals and helps you focus on how data can solve your business problems. Read more.
1:30pm–5:00pm Tuesday, 09/27/2016
Location: Hall 1C Level: Intermediate
Jonathan Seidman (Cloudera), Mark Grover (Lyft), Ted Malaska (Capital One)
Average rating: ****.
(4.08, 13 ratings)
Jonathan Seidman, Gwen Shapira, Mark Grover, and Ted Malaska demonstrate how to architect a modern, real-time big data platform and explain how to leverage components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics such as real-time ETL, change data capture, and machine learning. Read more.
1:30pm–5:00pm Tuesday, 09/27/2016
Location: 1 E 09 Level: Intermediate
Tags: pydata
Juliet Hougland (Cloudera), srowen om (Cloudera)
Average rating: ***..
(3.67, 3 ratings)
Juliet Hougland and Sean Owen offer a practical overview of the basics of using Python data tools with a Hadoop cluster, covering HDFS connectivity and dealing with raw data files, running SQL queries with a SQL-on-Hadoop system like Apache Hive or Apache Impala (incubating), and using Apache Spark to write more complex analytical jobs. Read more.
1:30pm–5:00pm Tuesday, 09/27/2016
Location: 1 E 15/1 E 16 Level: Intermediate
Colette Glaeser (Silicon Valley Data Science), Edd Wilder-James (Google)
Average rating: ****.
(4.57, 7 ratings)
How do you reconcile the business opportunity of big data and data science with the sea of possible technologies? Fundamentally, data should serve the strategic imperatives of a business—those key aspirations that define an organization’s future vision. Edd Wilder-James and Colette Glaeser explain how to create a modern data strategy that powers data-driven business. Read more.
1:30pm–5:00pm Tuesday, 09/27/2016
Location: 1B 01/02 Level: Intermediate
Tags: cloud
Andrei Savu (Cloudera), Vinithra Varadharajan (Cloudera), Jennifer Wu (Cloudera), Matthew Jacobs (Cloudera)
Public cloud usage for Hadoop workloads is accelerating. Consequently, Hadoop components have adapted to leverage cloud infrastructure. Andrei Savu, Vinithra Varadharajan, Matthew Jacobs, and Jennifer Wu explore best practices for Hadoop deployments in the public cloud and provide detailed guidance for deploying, configuring, and managing Hive, Spark, and Impala in the public cloud. Read more.
1:30pm–5:00pm Tuesday, 09/27/2016
Location: 1B 03/04 Level: Beginner
Tags: real-time
Ian Wrigley (StreamSets)
Average rating: *****
(5.00, 7 ratings)
Ian Wrigley demonstrates how to leverage the capabilities of Apache Kafka to collect, manage, and process stream data for both big data projects and general-purpose enterprise data integration. Ian covers system architecture, use cases, and how to write applications that publish data to, and subscribe to data from, Kafka—no prior knowledge of Kafka required. Read more.
1:30pm–5:00pm Tuesday, 09/27/2016
Location: 1 E 07/1 E 08 Level: Intermediate
Martin Wicke (Google), Joshua Gordon (Google)
Average rating: ***..
(3.47, 15 ratings)
Martin Wicke and Josh Gordon offer hands-on experience training and deploying a machine-learning system using TensorFlow, a popular open source library. You'll learn how to build machine-learning systems from simple classifiers to complex image-based models as well as how to deploy models in production using TensorFlow Serving. Read more.
1:30pm–5:00pm Tuesday, 09/27/2016
Location: 3D 10 Level: Non-technical
Brian Suda (optional.is)
Average rating: *****
(5.00, 2 ratings)
Visualizations are a key part of conveying any dataset. D3 is the most popular, easiest, and most extensible way to get your data online in an interactive way. Brian Suda outlines best practices for good data visualizations and explains how you can build them using D3. Read more.
1:30pm–5:00pm Tuesday, 09/27/2016
Location: 1 E 06 Level: Intermediate
Tags: r-lang
Average rating: ****.
(4.29, 7 ratings)
Join expert Jerry Overton as he explains how to make the business and technical aspects of your data strategy work together for best results. Read more.
1:30pm–5:00pm Tuesday, 09/27/2016
Location: 3D 12 Level: Intermediate
Bryan Van de Ven (Continuum Analytics), Sarah Bird (Continuum Analytics)
Average rating: ****.
(4.00, 4 ratings)
Bryan Van de Ven and Sarah Bird demonstrate how to build intelligent apps in a week with Bokeh, Python, and optimization. Read more.