There’s a lot of hype around data science in the enterprise. With Apache Hadoop and Apache Spark, it seems it should be easy for data scientists to use massive amounts of new data and compute to deliver better machine-learning models faster. But in reality, most data science still runs on a laptop, not on an enterprise data platform. The problem is the mismatch between typical enterprise requirements for a shared environment—security, governance, meeting job SLAs—and the practical needs of a data scientist, such as the ability to use popular R and Python packages, the freedom to customize the environment, and integration with versioning and scheduling tools.
As a result, enterprise data science and data platform teams often segregate, and both lose: models built on small data can still take months to deploy, while the resulting data silos increase both costs and security risks. Meeting this challenge is complex and requires a novel full stack approach, one that can meet the needs of both idiosyncratic data scientists and the platform teams who support them.
Matt Brandwein and Tristan Zajonc explore the common, specific, real-world technical challenges facing both audiences and discuss relevant improvements coming to the Hadoop ecosystem. Along the way, they cover best practices for configuring a data science environment and introduce new tools designed to make self-service data science a reality.
Matt Brandwein leads the machine learning product team at Cloudera, guiding the platform experience for data scientists and data engineers, including products like Cloudera Data Science Workbench. Previously, he led Cloudera’s product marketing team, with roles spanning product, solution, and partner marketing, and built enterprise search and data discovery products at Endeca/Oracle. Matt holds degrees in computer science and mathematics from the University of Massachusetts Amherst.
Tristan Zajonc is CTO for machine learning at Cloudera. Previously, Tristan led engineering for Cloudera Data Science Workbench and was the cofounder and CEO of enterprise data science platform Sense (acquired by Cloudera in 2016). He has over 15 years’ experience in applied data science, machine learning, and machine learning systems development across academia and industry and holds a PhD from Harvard University.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.