Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Making self-service data science a reality

Matt Brandwein (Cloudera), Tristan Zajonc (Cloudera)
1:50pm2:30pm Thursday, March 16, 2017
Data science & advanced analytics
Location: 210 A/E Level: Intermediate
Average rating: ***..
(3.33, 3 ratings)

Who is this presentation for?

  • Data scientists, data engineers, and enterprise architects

Prerequisite knowledge

  • Practical experience as a practicing data scientist, preferably using either Python or R, or an architect charged with supporting data scientists with a Hadoop-based platform

What you'll learn

  • Discover common problems to look out for when scaling data science on Hadoop
  • Learn best practices and new tools to overcome those challenges

Description

There’s a lot of hype around data science in the enterprise. With Apache Hadoop and Apache Spark, it seems it should be easy for data scientists to use massive amounts of new data and compute to deliver better machine-learning models faster. But in reality, most data science still runs on a laptop, not on an enterprise data platform. The problem is the mismatch between typical enterprise requirements for a shared environment—security, governance, meeting job SLAs—and the practical needs of a data scientist, such as the ability to use popular R and Python packages, the freedom to customize the environment, and integration with versioning and scheduling tools.

As a result, enterprise data science and data platform teams often segregate, and both lose: models built on small data can still take months to deploy, while the resulting data silos increase both costs and security risks. Meeting this challenge is complex and requires a novel full stack approach, one that can meet the needs of both idiosyncratic data scientists and the platform teams who support them.

Matt Brandwein and Tristan Zajonc explore the common, specific, real-world technical challenges facing both audiences and discuss relevant improvements coming to the Hadoop ecosystem. Along the way, they cover best practices for configuring a data science environment and introduce new tools designed to make self-service data science a reality.

Photo of Matt Brandwein

Matt Brandwein

Cloudera

Matt Brandwein leads the machine learning product team at Cloudera, guiding the platform experience for data scientists and data engineers, including products like Cloudera Data Science Workbench. Previously, he led Cloudera’s product marketing team, with roles spanning product, solution, and partner marketing, and built enterprise search and data discovery products at Endeca/Oracle. Matt holds degrees in computer science and mathematics from the University of Massachusetts Amherst.

Photo of Tristan Zajonc

Tristan Zajonc

Cloudera

Tristan Zajonc is CTO for machine learning at Cloudera. Previously, Tristan led engineering for Cloudera Data Science Workbench and was the cofounder and CEO of enterprise data science platform Sense (acquired by Cloudera in 2016). He has over 15 years’ experience in applied data science, machine learning, and machine learning systems development across academia and industry and holds a PhD from Harvard University.