Spark’s efficiency and speed can help big data administrators reduce the total cost of ownership (TCO) of their existing clusters. This is because Spark’s performance advantages allow it to complete processing in drastically shorter batch windows with higher performance per dollar. Raj Krishnamurthy offers a detailed walk-through of an alternating least squares-based matrix factorization workload. Using this methodology, Raj has been able to improve runtimes by a factor of 2.22.
Since Spark has a large number of tunables, a bottom-up approach to finding the optimal runtime by varying Spark workers and Spark worker cores can create an explosion of tuning runs for a given workload because of the multiplicative nature of possible configurations. The discussed methodology uses a hybrid top-bottom approach that searches the configuration space carefully and reduces the combinatorial explosion of possible tuning runs. This methodology has even been successfully applied to complex Spark workflows consisting of Spark SQL and ML Pipelines (and achieved substantial performance improvements) and a variety of other cluster architectures.
Raj Krishnamurthy designs and develops system stacks consisting of software and hardware elements for emerging and contemporary data analytics workloads. He has been a technical staff member in the Enterprise Systems division at IBM since 2006. His work has impacted several platforms, software products, and roadmaps in IBM—both on mainframes and Power Systems. Raj holds 76+ patents (with 60+ still pending) and has written a number of external peer-reviewed publications. Raj holds a PhD in computer science and an MS/BS degree in electrical engineering.
Comments on this page are now closed.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.