Andy Konwinski introduces you to Apache Spark 2.0 core concepts with a focus on Spark’s machine-learning library, using text mining on real-world data as the primary end-to-end use case.
Join Andy to explore and wrangle data using Spark’s DataSet and DataFrame abstractions. You’ll use the Spark ML API to build an ML pipeline to transform free text into useful features via Spark ML’s Transformer abstraction (e.g., one-hot encoding and term frequency counting) and learn about model selection, training/fitting, and validation/inspection, as well as parameter tuning with grid search parameter selection.
The class will consist of approximately 50% hands-on programming labs in Scala and 50% lecture and discussion.
Andy Konwinski is a founder and VP at Databricks. He has been working on Spark since the early days of the project, starting during his PhD in the UC Berkeley AMPLab, and has contributed as a software engineer to Spark’s performance evaluation components, testing infrastructure, documentation, and more. He was also a creator of the Apache Mesos project, contributed to the Hadoop Job Scheduler, and led the creation of the UC Berkeley AMP Camps and the Spark Summits. Andy coauthored Learning Spark from O’Reilly.
Comments on this page are now closed.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.