Presented By O'Reilly and Cloudera
Make Data Work
September 26–27, 2016: Training
September 27–29, 2016: Tutorials & Conference
New York, NY

Data science at scale: Using Spark and Hadoop (Day 2)

Bruce Martin (Cloudera)
9:00am–5:00pm Tuesday, 09/27/2016
Location: 1 C03
Average rating: ****.
(4.67, 3 ratings)

Prerequisite knowledge

This course is intended for two audiences: advanced analysts without distributed computing skills and engineers proficient in distributed computing without advanced analytical skills. Students should be comfortable with the Linux command line and have proficiency in a scripting language; Python is strongly preferred, but familiarity with Perl or Ruby is sufficient.


Data scientists build information platforms to provide deep insight and answer previously unimaginable questions. Spark and Hadoop are transforming how data scientists work by allowing interactive and iterative data analysis at scale. Learn how Spark and Hadoop enable data scientists to help companies reduce costs, increase profits, improve products, retain customers, and identify new opportunities.

The instructor explores what data scientists do, the problems they solve, and the tools and techniques they use. Through in-class simulations and exercises, they walk attendees through applying data science methods to real-world challenges in different industries, offering preparation for and experience of data scientist roles in the field.

Topics include:

  • How to identify potential business use cases where data science can provide impactful results
  • How to obtain, clean, and combine disparate data sources to create a coherent picture for analysis
  • What statistical methods to leverage for data exploration that will provide critical insight into your data
  • Where and when to leverage Hadoop streaming and Apache Spark for data science pipelines
  • What machine-learning technique to use for a particular data science project
  • How to implement and manage recommenders using Spark‚Äôs MLlib and how to set up and evaluate data experiments
  • The pitfalls of deploying new analytics projects to production at scale
Photo of Bruce Martin

Bruce Martin


Bruce Martin is a senior instructor at Cloudera, where he teaches courses on data science, Apache Spark, Apache Hadoop, and data analysis. Previously, Bruce was principal architect and director of advanced concepts at SunGard Higher Education, where he developed the software architecture for SunGard’s Course Signals Early Intervention System, which uses machine learning algorithms to predict the success of students enrolled in university courses. His other roles have included senior staff engineer at Sun Microsystems and researcher at Hewlett-Packard Laboratories. Bruce has written many papers on data management and distributed system technologies and frequently presents his work at academic and industrial conferences. Bruce has authored patents on distributed object technologies. He holds a PhD and master’s degree in computer science from the University of California, San Diego, and a bachelor’s degree in computer science from the University of California, Berkeley.