Apache Spark continues to gain momentum as the new processing paradigm for Apache Hadoop, and for the data scientist, it has a lot to like: natively distributed, REPL, Python APIs in addition to native Scala, and a library of machine learning algorithms, MLlib.
Spark 1.2 includes an implementation of random decision forests, an important and popular ensemble classifier/regressor algorithm. This talk will introduce Spark, Scala, and random decision forests to the curious, and demonstrate the process of analyzing a real-world data set with them. The session will cover loading data and understanding the data set, and introduce ideas like training and test set evaluation, ensemble methods, feature types, and supporting concepts like impurity and entropy.
- Become familiar with Spark basics using its Scala API
- Understand the decision tree and random decision forest algorithms
- See a simple, narrated data science workflow in action on a real data set
Sean is director of data science for EMEA at Cloudera. Previously, Sean founded Myrrix Ltd, producing a real-time recommender and clustering product evolved from Mahout. Myrrix is now part of Cloudera. Sean was a primary author of recommender components in Apache Mahout, and has been a committer and PMC member for the project. He is co-author of Advanced Analytics on Spark and Mahout in Action. Sean was previously a senior engineer at Google.
©2015, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.