Presented By O'Reilly and Cloudera
Make Data Work
September 26–27, 2016: Training
September 27–29, 2016: Tutorials & Conference
New York, NY

Tuning Spark machine-learning workloads

2:05pm–2:45pm Wednesday, 09/28/2016
Spark & beyond
Location: Hall 1B Level: Intermediate
Average rating: ***..
(3.75, 4 ratings)

Prerequisite knowledge

  • Familiarity with Spark, big data, and machine learning
  • What you'll learn

  • Understand how Spark's efficiency and performance can reduce performance costs of big data workloads
  • Learn how Spark tunables provide additional opportunity to reduce total cost of ownership (TCO) for businesses by increasing performance
  • Explore a systematic, hybrid top-down methodology that can help reduce tuning iterations
  • Description

    Spark’s efficiency and speed can help big data administrators reduce the total cost of ownership (TCO) of their existing clusters. This is because Spark’s performance advantages allow it to complete processing in drastically shorter batch windows with higher performance per dollar. Raj Krishnamurthy offers a detailed walk-through of an alternating least squares-based matrix factorization workload. Using this methodology, Raj has been able to improve runtimes by a factor of 2.22.

    Since Spark has a large number of tunables, a bottom-up approach to finding the optimal runtime by varying Spark workers and Spark worker cores can create an explosion of tuning runs for a given workload because of the multiplicative nature of possible configurations. The discussed methodology uses a hybrid top-bottom approach that searches the configuration space carefully and reduces the combinatorial explosion of possible tuning runs. This methodology has even been successfully applied to complex Spark workflows consisting of Spark SQL and ML Pipelines (and achieved substantial performance improvements) and a variety of other cluster architectures.

    Photo of Raj Krishnamurthy

    Raj Krishnamurthy


    Raj Krishnamurthy designs and develops system stacks consisting of software and hardware elements for emerging and contemporary data analytics workloads. He has been a technical staff member in the Enterprise Systems division at IBM since 2006. His work has impacted several platforms, software products, and roadmaps in IBM—both on mainframes and Power Systems. Raj holds 76+ patents (with 60+ still pending) and has written a number of external peer-reviewed publications. Raj holds a PhD in computer science and an MS/BS degree in electrical engineering.

    Comments on this page are now closed.


    09/28/2016 10:55am EDT

    Will you be sharing the slides from the session?