Ensemble machine-learning methods are often used when the true prediction function is not easily approximated by a single algorithm. Practitioners may prefer ensemble algorithms when model performance is valued above other factors such as model complexity and training time. Due to their flexibility and ability to perform better than individual models, ensembles are the technique used to win many Kaggle competitions.
Erin Ledell covers the basics of ensemble learning and offers an introduction to the scalable open source machine-learning library H2O. Erin then gives a demonstration of the H2O Ensemble package, which reduces the computational burden of ensemble learning while retaining superior model performance.
The H2O Ensemble software implements the Super Learner, or stacking, ensemble algorithm, using distributed-base learning algorithms from H2O. The Super Learner algorithm learns the optimal combination of the base learner fits. (This 2007 article, “Super Learner,” demonstrates why the Super Learner ensemble represents an asymptotically optimal system for learning.) Erin dives into these advanced topics and provides code demos for attendees to try out on their own.
Erin Ledell is a statistician and machine-learning scientist at H2O.ai. Erin is the main author of H2O Ensemble. Before joining H2O, she was the principal data scientist at Wise.io and Marvin Mobile Security (acquired by Veracode in 2012) and the founder of DataScientific, Inc. Erin received her PhD in biostatistics from the University of California, Berkeley, with a designated emphasis in computational science and engineering. Her research focuses on ensemble machine learning, learning from imbalanced binary-outcome data, influence-curve-based variance estimation, and statistical computing. Erin also holds a BS and MA in mathematics.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.