Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Making self-service data science a reality

Matt Brandwein (Cloudera), Tristan Zajonc (Cloudera)
1:50pm2:30pm Thursday, March 16, 2017
Data science & advanced analytics
Location: 210 A/E Level: Intermediate
Average rating: ***..
(3.33, 3 ratings)

Who is this presentation for?

  • Data scientists, data engineers, and enterprise architects

Prerequisite knowledge

  • Practical experience as a practicing data scientist, preferably using either Python or R, or an architect charged with supporting data scientists with a Hadoop-based platform

What you'll learn

  • Discover common problems to look out for when scaling data science on Hadoop
  • Learn best practices and new tools to overcome those challenges

Description

There’s a lot of hype around data science in the enterprise. With Apache Hadoop and Apache Spark, it seems it should be easy for data scientists to use massive amounts of new data and compute to deliver better machine-learning models faster. But in reality, most data science still runs on a laptop, not on an enterprise data platform. The problem is the mismatch between typical enterprise requirements for a shared environment—security, governance, meeting job SLAs—and the practical needs of a data scientist, such as the ability to use popular R and Python packages, the freedom to customize the environment, and integration with versioning and scheduling tools.

As a result, enterprise data science and data platform teams often segregate, and both lose: models built on small data can still take months to deploy, while the resulting data silos increase both costs and security risks. Meeting this challenge is complex and requires a novel full stack approach, one that can meet the needs of both idiosyncratic data scientists and the platform teams who support them.

Matt Brandwein and Tristan Zajonc explore the common, specific, real-world technical challenges facing both audiences and discuss relevant improvements coming to the Hadoop ecosystem. Along the way, they cover best practices for configuring a data science environment and introduce new tools designed to make self-service data science a reality.

Photo of Matt Brandwein

Matt Brandwein

Cloudera

Matt Brandwein is director of product management at Cloudera, driving the platform’s experience for data scientists and data engineers. Before that, Matt led Cloudera’s product marketing team, with roles spanning product, solution, and partner marketing. Previously, he built enterprise search and data discovery products at Endeca/Oracle. Matt holds degrees in computer science and mathematics from the University of Massachusetts Amherst.

Photo of Tristan Zajonc

Tristan Zajonc

Cloudera

Tristan Zajonc is a senior engineering manager at Cloudera. Previously, he was cofounder and CEO of Sense, a visiting fellow at Harvard’s Institute for Quantitative Social Science, and a consultant at the World Bank. Tristan holds a PhD in public policy and an MPA in international development from Harvard and a BA in economics from Pomona College.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)