Presented By O’Reilly and Cloudera

San Francisco • London • New York

Make Data Work

September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Schedule List View Grid View

Topics

1A 17

9:00am TRAINING Apache Spark programming Kenneth Jones (Databricks, Inc.)

1A 06/07

9:00am Architecting a data platform for enterprise use Mark Madsen (Teradata), Todd Walter (Archimedata)

1:30pm Architecting a next-generation data platform Ted Malaska (Capital One), Jonathan Seidman (Cloudera)

1A 08

9:00am Findata Day Alistair Croll (Solve For Interesting), Robert Passarella (Alpha Features), Amro Alkhatib (National Health Insurance Company-Daman), Mridul Mishra (Fidelity Investments), Patrick Angeles (Cloudera), James Psota (Panjiva ), Andreas Kohlmaier (Munich Re), Paul Lashmet (Arcadia Data), Nick Curcuru (Mastercard), Robin Way (Corios), Theresa Johnson (Airbnb), Jane Tran (Unqork), Swatee Singh (American Express)

1A 12/14

9:00am Managing data science in the enterprise Joshua Poduska (Domino Data Lab), Patrick Harrison (S&P Global)

1:30pm Recurrent neural networks for time series analysis Bruno Goncalves (Data For Science)

1A 15/16

9:00am TRAINING Real-time systems with Spark Streaming and Kafka Jesse Anderson (Big Data Institute)

1A 10

9:00am Building a large-scale machine learning application using Amazon SageMaker and Spark David Arpin (Amazon Web Services)

1:30pm Data science with Unix power tools Jeroen Janssens (Data Science Workshops)

1A 21/22

9:00am Deep learning methods for natural language processing Garrett Hoffman (StockTwits)

1:30pm Natural language understanding at scale with Spark NLP David Talby (Pacific AI), Claudiu Branzan (Accenture), Alex Thomas (John Snow Labs)

1A 23/24

9:00am Practical techniques for interpreting machine learning models Patrick Hall (bnh.ai | H2O.ai), Avni Wadhwa (H20.ai), Mark Chan (H2O.ai)

1:30pm Hands-on Kafka streaming microservices with Akka Streams and Kafka Streams Dean Wampler (Anyscale), Boris Lublinsky (Lightbend)

1E 07/08

9:00am Stream processing with Kafka and KSQL Tim Berglund (Confluent)

1:30pm Leveraging Spark and deep learning frameworks to understand data at scale Vartika Singh (Cloudera), Alan Silva (Cloudera), Alex Bleakley (Cloudera), Steven Totman (Cloudera), Mirko Kämpf (Cloudera), Syed Nasar (Cloudera)

1E 09

9:00am Making interactive browser-based visualizations easy in Python James Bednar (Anaconda)

1:30pm From training to serving: Deploying TensorFlow models with Kubernetes Brian Foo (Google), Holden Karau (Independent), Jay Smith (Google)

1E 10

9:00am Data Case Studies Paco Nathan (derwen.ai), Katharina Warzel (EveryMundo), Mike Berger (Mount Sinai Health System), Sam Helmich (Deere & Company), Stephanie Fischer (datanizing GmbH), Maryam Jahanshahi (TapRecruit), Greg Quist (SmartCover Systems), Ann Nguyen (Whole Whale), Steve Otto (Navistar), Jennifer Lim (Cerner), S Anand (Gramener), Ian Brooks (Cloudera)

1E 11

9:00am Getting ready for GDPR: Securing and governing hybrid, cloud, and on-premises big data deployments, step by step Mark Donsky (Okera), Syed Rafice (Cloudera), Mubashir Kazia (Cloudera), Ifigeneia Derekli (Cloudera), Camila Hiskey (Cloudera)

1:30pm How to be fair: A tutorial for beginners Aileen Nielsen (Skillman Consulting)

1E 12/13

9:00am Designing modern streaming data applications Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio)

1:30pm Building your first big data application on AWS Jorge Lopez (Amazon Web Services), Radhika Ravirala (Amazon Web Services), Paul Sears (Amazon Web Services), Faria Bruno (Amazon Web Services)

1E 14

9:00am Learning machine learning using astronomy datasets Viviana Acquaviva (CUNY New York City College of Technology)

1:30pm Running multidisciplinary big data workloads in the cloud Sudhanshu Arora (Cloudera), Stefan Salandy (Cloudera), Suraj Acharya (Cloudera), Brandon Freeman (Cloudera), Jason Wang (Cloudera), Shravan Pabba (Cloudera)

1E 17

9:00am TRAINING Hands-on data science with Python Zachary Glassman (The Data Incubator)

1E 15/16

9:00am Deep learning-based search and recommendation systems using TensorFlow Vijay Agneeswaran (Walmart Labs), Abhishek Kumar (Publicis Sapient)

1:30pm From theory to data product: Applying data science methods to effect business change Janet Forbes (T4G), Danielle Leighton (T4G), Lindsay Brin (T4G)

1A 01/02

9:00am TRAINING Expand your data science and machine learning skills with Python, R, SQL, Spark, and TensorFlow Ian Cook (Cloudera)

1A 03

9:00am TRAINING Machine learning from scratch in TensorFlow Dylan Bargteil (The Data Incubator)

1A 04/05

9:00am TRAINING Minimum viable machine learning: The applied data science bootcamp (sponsored by DXC Technology) Jerry Overton (DXC), Ashim Bose (DXC), Samir Sehovic (DXC)

1E 06

9:00am Model serving and management at scale using open source tools Dan Crankshaw (UC Berkeley RISELab)

1:30pm Apache Metron: Open source cybersecurity at scale Carolyn Duby (Cloudera)

5:00pm Opening Reception | Room: 3B | Expo Hall

12:30pm Lunch | Room: 3A

10:30am Morning Break | Room: 1A & 1E Halls

3:00pm Afternoon Break | Room: 1A & 1E Halls

9:00am-5:00pm (8h)

Apache Spark programming

Kenneth Jones (Databricks, Inc.)

Ken Jones walks you through the core APIs for using Spark, fundamental mechanisms and basic internals of the framework, SQL and other high-level data access tools, and Spark’s streaming capabilities and machine learning APIs.

9:00am-12:30pm (3h 30m) Data engineering and architecture Data Platforms

Architecting a data platform for enterprise use

Mark Madsen (Teradata), Todd Walter (Archimedata)

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that is not subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure.

1:30pm-5:00pm (3h 30m) Data engineering and architecture Data Platforms

Architecting a next-generation data platform

Ted Malaska (Capital One), Jonathan Seidman (Cloudera)

Using Customer 360 and the internet of things as examples, Jonathan Seidman and Ted Malaska explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, including components like Kafka, Flink, Kudu, Spark Streaming, and Spark SQL and modern storage engines to enable new forms of data processing and analytics.

9:00am-5:00pm (8h)

Findata Day

Alistair Croll (Solve For Interesting), Robert Passarella (Alpha Features), Amro Alkhatib (National Health Insurance Company-Daman), Mridul Mishra (Fidelity Investments), Patrick Angeles (Cloudera), James Psota (Panjiva ), Andreas Kohlmaier (Munich Re), Paul Lashmet (Arcadia Data), Nick Curcuru (Mastercard), Robin Way (Corios), Theresa Johnson (Airbnb), Jane Tran (Unqork), Swatee Singh (American Express)

From analyzing risk and detecting fraud to predicting payments and improving customer experience, take a deep dive into the ways data technologies are transforming the financial industry.

9:00am-12:30pm (3h 30m) Data-driven business management, Strata Business Summit Machine Learning in the enterprise

Managing data science in the enterprise

Joshua Poduska (Domino Data Lab), Patrick Harrison (S&P Global)

The honeymoon era of data science is ending, and accountability is coming. Successful data science leaders deliver measurable impact on an increasing share of an enterprise’s KPIs. Joshua Poduska and Patrick Harrison detail how leading organizations have taken a holistic approach to people, process, and technology to build a sustainable competitive advantage

1:30pm-5:00pm (3h 30m) Data science and machine learning Deep Learning, Temporal data and time-series analytics

Recurrent neural networks for time series analysis

Bruno Goncalves (Data For Science)

Time series are everywhere around us. Understanding them requires taking into account the sequence of values seen in previous steps and even long-term temporal correlations. Join Bruno Gonçalves to learn how to use recurrent neural networks to model and forecast time series and discover the advantages and disadvantages of recurrent neural networks with respect to more traditional approaches.

9:00am-5:00pm (8h)

Real-time systems with Spark Streaming and Kafka

Jesse Anderson (Big Data Institute)

To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data and how will you process that much data? Jesse Anderson explores the latest real-time frameworks and explains how to choose the right one for your company.

9:00am-12:30pm (3h 30m) Data science and machine learning

Building a large-scale machine learning application using Amazon SageMaker and Spark

David Arpin (Amazon Web Services)

David Arpin walks you through building a machine learning application, from data manipulation to algorithm training to deployment to a real-time prediction endpoint, using Spark and Amazon SageMaker.

1:30pm-5:00pm (3h 30m) Data science and machine learning

Data science with Unix power tools

Jeroen Janssens (Data Science Workshops)

The Unix command line remains an amazing environment for efficiently performing tedious but essential data science tasks. By combining small, powerful command-line tools, you can quickly scrub, explore, and model your data as well as hack together prototypes. Join Jeroen Janssens for a hands-on workshop based on his book Data Science at the Command Line.

9:00am-12:30pm (3h 30m) Data science and machine learning Deep Learning, Text and Language processing and analysis

Deep learning methods for natural language processing

Garrett Hoffman (StockTwits)

Garrett Hoffman walks you through deep learning methods for natural language processing and natural language understanding tasks, using a live example in Python and TensorFlow with StockTwits data. Methods include word2vec, recurrent neural networks and variants (LSTM, GRU), and convolutional neural networks.

1:30pm-5:00pm (3h 30m) Data science and machine learning Text and Language processing and analysis

Natural language understanding at scale with Spark NLP

David Talby (Pacific AI), Claudiu Branzan (Accenture), Alex Thomas (John Snow Labs)

David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial for scalable NLP using the highly performant, highly scalable open source Spark NLP library. You’ll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve.

9:00am-12:30pm (3h 30m) Ethics and Privacy, Health and Medicine

Practical techniques for interpreting machine learning models

Patrick Hall (bnh.ai | H2O.ai), Avni Wadhwa (H20.ai), Mark Chan (H2O.ai)

Transparency, auditability, and stability are crucial for business adoption and human acceptance of complex machine learning models. Patrick Hall, Avni Wadhwa, and Mark Chan share practical and productizable approaches for explaining, testing, and visualizing machine learning models using open source, Python-friendly tools such as GraphViz, H2O, and XGBoost.

1:30pm-5:00pm (3h 30m) Data engineering and architecture, Streaming systems & real-time applications

Hands-on Kafka streaming microservices with Akka Streams and Kafka Streams

Dean Wampler (Anyscale), Boris Lublinsky (Lightbend)

Dean Wampler and Boris Lublinsky walk you through building streaming apps as microservices using Akka Streams and Kafka Streams. Dean and Boris discuss the strengths and weaknesses of each tool for particular design needs and contrast them with Spark Streaming and Flink, so you'll know when to choose them instead. You'll also discover a few ML model serving ideas along the way.

9:00am-12:30pm (3h 30m) Data engineering and architecture, Streaming systems & real-time applications

Stream processing with Kafka and KSQL

Tim Berglund (Confluent)

Tim Berglund leads this solid introduction to Apache Kafka as a streaming data platform. You'll cover the internal architecture, APIs, and platform components like Kafka Connect and Kafka Streams, then finish with an exercise processing streaming data using KSQL, the new SQL-like declarative stream processing language for Kafka.

1:30pm-5:00pm (3h 30m) Data science and machine learning Deep Learning

Leveraging Spark and deep learning frameworks to understand data at scale

Vartika Singh (Cloudera), Alan Silva (Cloudera), Alex Bleakley (Cloudera), Steven Totman (Cloudera), Mirko Kämpf (Cloudera), Syed Nasar (Cloudera)

Vartika Singh, Alan Silva, Alex Bleakley, Steven Totman, Mirko Kämpf, and Syed Nasar outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks.

9:00am-12:30pm (3h 30m) Visualization and user experience

Making interactive browser-based visualizations easy in Python

James Bednar (Anaconda)

Python lets you solve data science problems by stitching together packages from the Python ecosystem, but it can be difficult to assemble the right tools to solve real-world problems. James Bednar walks you through using the 15+ packages covered by the new PyViz.org initiative to make it simple to build interactive plots and dashboards, even for large, streaming, and highly multidimensional data.

1:30pm-5:00pm (3h 30m) Data engineering and architecture Model lifecycle management

From training to serving: Deploying TensorFlow models with Kubernetes

Brian Foo (Google), Holden Karau (Independent), Jay Smith (Google)

TensorFlow and Keras are popular libraries for training deep models due to hardware accelerator support. Brian Foo, Jay Smith, and Holden Karau explain how to bring deep learning models from training to serving in a cloud production environment. You'll learn how to unit-test, export, package, deploy, optimize, serve, monitor, and test models using Docker and TensorFlow Serving in Kubernetes.

9:00am-5:00pm (8h)

Data Case Studies

Paco Nathan (derwen.ai), Katharina Warzel (EveryMundo), Mike Berger (Mount Sinai Health System), Sam Helmich (Deere & Company), Stephanie Fischer (datanizing GmbH), Maryam Jahanshahi (TapRecruit), Greg Quist (SmartCover Systems), Ann Nguyen (Whole Whale), Steve Otto (Navistar), Jennifer Lim (Cerner), S Anand (Gramener), Ian Brooks (Cloudera)

Hear practical insights from household brands and global companies: the challenges they tackled, approaches they took, and the benefits—and drawbacks—of their solutions.

9:00am-12:30pm (3h 30m) Data engineering and architecture, Law, ethics, governance Data preparation, governance and privacy, Ethics and Privacy

Getting ready for GDPR: Securing and governing hybrid, cloud, and on-premises big data deployments, step by step

Mark Donsky (Okera), Syed Rafice (Cloudera), Mubashir Kazia (Cloudera), Ifigeneia Derekli (Cloudera), Camila Hiskey (Cloudera)

New regulations such as GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads. Mark Donsky, Syed Rafice, Mubashir Kazia, Ifigeneia Derekli, and Camila Hiskey share hands-on best practices for meeting these challenges, with special attention paid to GDPR.

1:30pm-5:00pm (3h 30m) Data science and machine learning Ethics and Privacy

How to be fair: A tutorial for beginners

Aileen Nielsen (Skillman Consulting)

There is mounting evidence that the widespread deployment of machine learning and artificial intelligence in business and government applications is reproducing or even amplifying existing prejudices and social inequalities. Aileen Nielsen demonstrates how to identify and avoid bias and other unfairness in your analyses.

9:00am-12:30pm (3h 30m) Data engineering and architecture, Streaming systems & real-time applications Data Platforms

Designing modern streaming data applications

Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio)

Arun Kejariwal and Karthik Ramasamy lead a journey through the landscape of state-of-the-art systems for each stage of an end-to-end data processing pipeline, covering messaging frameworks, streaming computing frameworks, storage frameworks for real-time data, and more. They also share case studies from the IoT, gaming, and healthcare and their experience operating these systems at internet scale.

1:30pm-5:00pm (3h 30m) Big data and data science in the cloud, Data engineering and architecture

Building your first big data application on AWS

Jorge Lopez (Amazon Web Services), Radhika Ravirala (Amazon Web Services), Paul Sears (Amazon Web Services), Faria Bruno (Amazon Web Services)

Want to learn how to use Amazon's big data web services to launch your first big data application in the cloud? Jorge Lopez, Radhika Ravirala, Paul Sears, and Bruno Faria walk you through building a big data application using a combination of open source technologies and AWS managed services.

9:00am-12:30pm (3h 30m) Data science and machine learning

Learning machine learning using astronomy datasets

Viviana Acquaviva (CUNY New York City College of Technology)

Using interesting, diverse publicly available datasets and actual problems in astronomy research, Viviana Acquaviva leads an intermediate tutorial on machine learning. You'll learn how to customize algorithms and evaluation metrics required by scientific applications and discover best practices for choosing, developing, and evaluating machine learning algorithms in "real-world" datasets.

1:30pm-5:00pm (3h 30m) Big data and data science in the cloud, Data engineering and architecture

Running multidisciplinary big data workloads in the cloud

Sudhanshu Arora (Cloudera), Stefan Salandy (Cloudera), Suraj Acharya (Cloudera), Brandon Freeman (Cloudera), Jason Wang (Cloudera), Shravan Pabba (Cloudera)

Attend this tutorial to learn how to successfully run a data analytics pipeline in the cloud and integrate data engineering and data analytic workflows and explore considerations and best practices for data analytics pipelines in the cloud. Along the way, you'll see how to share metadata across workloads in a big data PaaS.

9:00am-5:00pm (8h)

Hands-on data science with Python

Zachary Glassman (The Data Incubator)

Zachary Glassman leads a hands-on dive into building intelligent business applications using machine learning, walking you through all the steps of developing a machine learning pipeline. You'll explore data cleaning, feature engineering, model building and evaluation, and deployment and extend these models into two applications using a real-world dataset.

9:00am-12:30pm (3h 30m) Data science and machine learning Deep Learning, Recommendation Systems

Deep learning-based search and recommendation systems using TensorFlow

Vijay Agneeswaran (Walmart Labs), Abhishek Kumar (Publicis Sapient)

Abhishek Kumar and Vijay Srinivas Agneeswaran offer an introduction to deep learning-based recommendation and learning-to-rank systems using TensorFlow. You'll learn how to build a recommender system based on intent prediction using deep learning that is based on a real-world implementation for an ecommerce client.

1:30pm-5:00pm (3h 30m) Data-driven business management, Strata Business Summit Machine Learning in the enterprise

From theory to data product: Applying data science methods to effect business change

Janet Forbes (T4G), Danielle Leighton (T4G), Lindsay Brin (T4G)

Janet Forbes, Danielle Leighton, and Lindsay Brin lead a primer on crafting well-conceived data science projects that uncover valuable business insights. Using case studies and hands-on skills development, Janet, Danielle, and Lindsay walk you through essential techniques for effecting real business change.

9:00am-5:00pm (8h)

Expand your data science and machine learning skills with Python, R, SQL, Spark, and TensorFlow

Ian Cook (Cloudera)

Advancing your career in data science requires learning new languages and frameworks—but learners face an overwhelming array of choices, each with different syntaxes, conventions, and terminology. Ian Cook simplifies the learning process by elucidating the abstractions common to these systems. Through hands-on exercises, you'll overcome obstacles to getting started using new tools.

9:00am-5:00pm (8h) Deep Learning

Machine learning from scratch in TensorFlow

Dylan Bargteil (The Data Incubator)

The TensorFlow library provides for the use of data flow graphs for numerical computations, with automatic parallelization across several CPUs or GPUs. This architecture makes it ideal for implementing neural networks and other machine learning algorithms. Dylan Bargteil introduces TensorFlow's capabilities through its Python interface.

9:00am-5:00pm (8h) Sponsored, Strata Business Summit

Minimum viable machine learning: The applied data science bootcamp (sponsored by DXC Technology)

Jerry Overton (DXC), Ashim Bose (DXC), Samir Sehovic (DXC)

Acquiring machine learning (ML) technology is relatively straightforward, but ML must be applied to be useful. In this one-day boot camp that is equal parts hackathon, presentation, and group participation, Jerry Overton, Ashim Bose, and Samir Sehovic teach you how to apply advanced analytics in ways that reshape the enterprise and improve outcomes.

9:00am-12:30pm (3h 30m) Data science and machine learning Model lifecycle management

Model serving and management at scale using open source tools

Dan Crankshaw (UC Berkeley RISELab)

Dan Crankshaw offers an overview of the current challenges in deploying machine applications into production and the current state of prediction serving infrastructure. He then leads a deep dive into the Clipper serving system and shows you how to get started.

1:30pm-5:00pm (3h 30m) Data engineering and architecture, Platform security and cybersecurity

Apache Metron: Open source cybersecurity at scale

Carolyn Duby (Cloudera)

Carolyn Duby shows you how to find the cybersecurity threat needle in your event haystack using Apache Metron: a real-time, horizontally scalable open source platform. After this interactive overview of the platform's major features, you'll be ready to analyze your own haystack back at the office.

5:00pm-6:30pm (1h 30m)

Opening Reception

Enjoy delicious snacks and beverages with fellow Strata attendees, speakers, and sponsors at the Opening Reception, happening immediately after tutorials on Tuesday.

12:30pm-1:30pm (1h)

Break: Lunch

10:30am-11:00am (30m)

Break: Morning Break

3:00pm-3:30pm (30m)

Break: Afternoon Break

Presented by

Elite Sponsors

Strategic Sponsors

Zettabyte Sponsors

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Supporting Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com

Schedule List ViewGrid View

Topics

Sponsorship Opportunities

Partner Opportunities

Contact Us

Schedule List View Grid View