Presented By O’Reilly and Cloudera

San Jose • London • New York

Make Data Work

March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Tutorials

On Tuesday, March 6, choose from all-day and half-day tutorials. These expert-led presentations give you a chance to dive deep into the subject matter. Please note: to attend, you must register for a Gold or Silver pass; does not include access to training courses.

Tuesday, March 6

9:00am–12:30pm Tuesday, March 6, 2018

Big data analytics and machine learning techniques to drive and grow business

Location: 210 A/E

Burcu Baran (LinkedIn), Wei Di (LinkedIn), Michael Li (LinkedIn), Chi-Yi Kuan (LinkedIn)

Average rating:

(4.44, 9 ratings)

Burcu Baran, Wei Di, Michael Li, and Chi-Yi Kuan walk you through the big data analytics and data science lifecycle and share their experience and lessons learned leveraging advanced analytics and machine learning techniques such as predictive modeling to drive and grow business at LinkedIn. Read more.

9:00am–12:30pm Tuesday, March 6, 2018

A deep dive into running data analytic workloads in the cloud

Location: 210 D/H

Jason Wang (Cloudera), Mala Ramakrishnan (Cloudera), Stefan Salandy (Cloudera), Aishwarya Venkataraman (Cloudera), Vinithra Varadharajan (Cloudera), Aaron Myers (Cloudera, Inc.)

Average rating:

(3.25, 4 ratings)

Aishwarya Venkataraman, Jason Wang, Mala Ramakrishnan, Stefan Salandy, and Vinithra Varadharajan lead a deep dive into running data analytic workloads in a managed service capacity in the public cloud and highlight cloud infrastructure best practices. Read more.

9:00am–12:30pm Tuesday, March 6, 2018

Using R and Python for scalable data science, machine learning, and AI

Location: LL21 C/D

Mario Inchiosa (Microsoft), Vanja Paunic (Microsoft), Robert Horton (Microsoft), Debraj GuhaThakurta (Microsoft), Ali-Kazim Zaidi (Microsoft), Tomas Singliar (Microsoft), John-Mark Agosta (Microsoft)

Average rating:

(4.00, 4 ratings)

R and Python top the list of languages used in data science and machine learning, and data scientists and engineers fluent in one of these languages are increasingly marketable. Come learn how to build and operationalize machine learning models using distributed functions and do scalable, end-to-end data science in R and Python on single machines, Spark clusters, and cloud-based infrastructure. Read more.

9:00am–12:30pm Tuesday, March 6, 2018

Modern real-time streaming architectures

Location: 210 B/F

Secondary topics: Graphs and Time-series

Karthik Ramasamy (Streamlio), Sanjeev Kulkarni (Streamlio), Sijie Guo (StreamNative), Arun Kejariwal (Independent)

Average rating:

(5.00, 2 ratings)

Across diverse segments in industry, there has been a shift in focus from big data to fast data. Karthik Ramasamy, Sanjeev Kulkarni, Arun Kejariwal, and Sijie Guo walk you through state-of-the-art streaming architectures, streaming frameworks, and streaming algorithms, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them. Read more.

9:00am–12:30pm Tuesday, March 6, 2018

Stream processing with Kafka

Location: 210 C/G

Tim Berglund (Confluent)

Average rating:

(4.36, 11 ratings)

Tim Berglund leads a basic architectural introduction to Kafka and walks you through using Kafka Streams and KSQL to process streaming data. Read more.

9:00am–12:30pm Tuesday, March 6, 2018

Learning PyTorch by building a recommender system

Location: LL21 A

Secondary topics: Graphs and Time-series

Mo Patel (Independent), Neejole Patel (Virginia Tech)

Average rating:

(2.50, 4 ratings)

Since its arrival in early 2017, PyTorch has won over many deep learning researchers and developers due to its dynamic computation framework. Mo Patel and Neejole Patel walk you through using PyTorch to build a content recommendation model. Read more.

9:00am–12:30pm Tuesday, March 6, 2018

Building your first big data application on AWS

Location: LL21 B

Jorge Lopez (Amazon Web Services), Radhika Ravirala (Amazon Web Services), Paul Sears (Amazon Web Services), Ryan Nienhuis (Amazon Web Services), Randy Ridgley (Amazon Web Services)

Average rating:

(4.50, 2 ratings)

Want to learn how to use Amazon's big data web services to launch your first big data application in the cloud? Jorge Lopez walks you through building a big data application using a combination of open source technologies and AWS managed services. Read more.

9:00am–12:30pm Tuesday, March 6, 2018

Getting started with TensorFlow

Location: LL21 E/F

Martin Görner (Google)

Average rating:

(5.00, 3 ratings)

Martin Görner walks you through training and deploying a machine learning system using popular open source library TensorFlow. Martin takes you from a conceptual overview all the way to building complex classifiers and explains how you can apply deep learning to complex problems in science and industry. Read more.

9:00am–5:00pm Tuesday, March 6, 2018

Spark camp: Apache Spark 2.0 for analytics and text mining with Spark ML

Location: LL20 D

Joseph Kambourakis (databricks)

Join Joseph Kambourakis for an introduction to Apache Spark 2.0 core concepts with a focus on Spark's machine learning library, using text mining on real-world data as the primary end-to-end use case. Read more.

9:00am–12:30pm Tuesday, March 6, 2018

Getting ready for GDPR: Securing and governing hybrid, cloud, and on-premises big data deployments

Location: LL20 C

Mark Donsky (Okera), Andre Araujo (Cloudera), Syed Rafice (Cloudera), Mubashir Kazia (Cloudera)

Average rating:

(2.00, 1 rating)

New regulations are driving compliance, governance, and security challenges for big data, and infosec and security groups must ensure a consistently secured and governed environment across multiple workloads that span a variety of deployments. Mark Donsky, Andre Araujo, Syed Rafice, and Mubashir Kazia walk you through securing a Hadoop cluster, with special attention to GDPR. Read more.

9:00am–5:00pm Tuesday, March 6, 2018

Media and Ad Tech Day

Location: LL20 B

David Boyle (Audience Strategies), Violeta Hennessey (Warner Bros.), April Chen (Civis Analytics), Sridhar Alla (BlueWhale), Noah Gift (UC Davis), Blake Irvine (Netflix), Kevin Lyons (Nielsen Marketing Cloud), Jennifer Webb (SuprFanz), Rizwan Patel (Caesars Entertainment), Anthony Accardo (Disney), Amanda Gerdes (Blizzard Entertainment), Violeta Hennessey (Warner Bros.), Aneesh Karve (Quilt), David Boyle (Audience Strategies), Pete Skomoroch (Workday)

Hear from innovators in ad tech, measurement, automation, and audience engagement about where the media industry is today—and where it's likely to go next. Read more.

9:00am–5:00pm Tuesday, March 6, 2018

Data Case Studies

Location: LL20 A

Madhav Madaboosi (BP), Meenakshisundaram Thandavarayan (Infosys), Matt Conners (Microsoft), Katie Malone (Civis Analytics), Mike Prorock (mesur.io), Thomas Miller (Northwestern University), Ann Nguyen (Whole Whale), Jennie Shin (Kaiser Permanente), Valentin Bercovici (PencilDATA), Wayde Fleener (General Mills), Joe Dumoulin (Next IT), Jules Malin (GoPro), Taylor Martin Martin (O'Reilly Media), Divya Ramachandran (Captricity)

Hear practical insights from household brands and global companies: the challenges they tackled, approaches they took, and the benefits—and drawbacks—of their solutions. Read more.

1:30pm–5:00pm Tuesday, March 6, 2018

A/B testing at scale: Accelerating software innovation

Location: LL21 C/D

Ronny Kohavi (Microsoft), Alex Deng (Microsoft), Somit Gupta (Microsoft), Paul Raff (Microsoft)

Average rating:

(4.00, 3 ratings)

Controlled experiments such as A/B tests have revolutionized the way software is being developed, allowing real users to objectively evaluate new ideas. Ronny Kohavi, Alex Deng, Somit Gupta, and Paul Raff lead an introduction to A/B testing and share lessons learned from one of the largest A/B testing platforms on the planet, running at Microsoft, which executes over 10K experiments a year. Read more.

1:30pm–5:00pm Tuesday, March 6, 2018

Time series data: Architecture and use cases

Location: 210 B/F

Secondary topics: Graphs and Time-series

Ted Malaska (Capital One)

Average rating:

(2.80, 5 ratings)

If you have data that has a time factor to it, then you need to think in terms of time series datasets. Ted Malaska explores time series in all of its forms, from tumbling windows to sessionization in batch or in streaming. You'll gain exposure to the tools and background you need to be successful in the world of time-oriented data. Read more.

1:30pm–5:00pm Tuesday, March 6, 2018

Natural language understanding at scale with spaCy and Spark NLP

Location: LL20 C

David Talby (Pacific AI), Claudiu Branzan (Accenture), Alex Thomas (John Snow Labs)

Average rating:

(5.00, 1 rating)

Natural language processing is a key component in many data science systems. David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial on scalable NLP, using spaCy for building annotation pipelines, Spark NLP for building distributed natural language machine-learned pipelines, and Spark ML and TensorFlow for using deep learning to build and apply word embeddings. Read more.

1:30pm–5:00pm Tuesday, March 6, 2018

Deep learning-based search and recommendation systems using TensorFlow

Location: LL21 E/F

Abhishek Kumar (Publicis Sapient), Vijay Agneeswaran (Walmart Labs)

Average rating:

(4.00, 3 ratings)

Abhishek Kumar and Vijay Srinivas Agneeswaran offer an introduction to deep learning-based recommendation and learning-to-rank systems using TensorFlow. You'll learn how to build a recommender system based on intent prediction using deep learning that is based on a real-world implementation for an ecommerce client. Read more.

1:30pm–5:00pm Tuesday, March 6, 2018

Streaming applications as microservices using Kafka, Akka Streams, and Kafka Streams

Location: 210 C/G

Dean Wampler (Anyscale), Boris Lublinsky (Lightbend)

Average rating:

(3.50, 2 ratings)

Join Dean Wampler and Boris Lublinsky to learn how to build two microservice streaming applications based on Kafka using Akka Streams and Kafka Streams for data processing. You'll explore the strengths and weaknesses of each tool for particular design needs and contrast them with Spark Streaming and Flink, so you'll know when to choose them instead. Read more.

1:30pm–5:00pm Tuesday, March 6, 2018

Custom interactive visualizations and dashboards for one billion datapoints on a laptop in 30 lines of Python

Location: 210 D/H

James Bednar (Anaconda), Philipp Rudiger (Anaconda)

Average rating:

(4.50, 2 ratings)

Python lets you solve data science problems by stitching together packages from its ecosystem, but it can be difficult to choose packages that work well together. James Bednar and Philipp Rudiger walk you through a concise, fast, easily customizable, and fully reproducible recipe for interactive visualization of millions or billions of datapoints—all in just 30 lines of Python code. Read more.

1:30pm–5:00pm Tuesday, March 6, 2018

How to use Impala's query plan and profile to fix performance issues

Location: LL21 A

Juan Yu (Cloudera)

Average rating:

(4.75, 4 ratings)

Apache Impala (incubating) is an exceptional, best-of-breed massively parallel processing SQL query engine that is a fundamental component of the big data software stack. Juan Yu demystifies the cost model Impala Planner uses and how Impala optimizes queries and explains how to identify performance bottleneck through query plan and profile and how to drive Impala to its full potential. Read more.

1:30pm–5:00pm Tuesday, March 6, 2018

Managing data science in the enterprise

Location: 210 A/E

Nick Elprin (Domino Data Lab)

Average rating:

(5.00, 2 ratings)

The honeymoon era of data science is ending, and accountability is coming. Not content to wait for results that may or may not arrive, successful data science leaders deliver measurable impact on an increasing share of an enterprise's KPIs. Nick Elprin details how leading organizations have taken a holistic approach to people, process, and technology to build a sustainable competitive advantage. Read more.

1:30pm–5:00pm Tuesday, March 6, 2018

Deploying deep learning with TensorFlow

Location: LL21 B

Ron Bodkin (Google), Brian Foo (Google)

Average rating:

(3.00, 2 ratings)

TensorFlow and Keras are popular libraries for machine learning because of their support for deep learning and GPU deployment. Join Ron Bodkin and Brian Foo to learn how to execute these libraries in production with vision and recommendation models and how to export, package, deploy, optimize, serve, monitor, and test models using Docker and TensorFlow Serving in Kubernetes. Read more.

Presented by

Elite Sponsors

Strategic Sponsors

Zettabyte Sponsor

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com