Many iterative machine-learning algorithms can only operate efficiently when a large matrix of training data fits in the main memory. Running these algorithms over big data requires large numbers of machines with large amounts of RAM, which can quickly become very expensive. Compressing the matrices with general purpose algorithms like gzip doesn’t improve performance because decompression speed is on par with that of reading data from disk.
Frederick Reiss and Arvind Surve offer an overview of compressed linear algebra, a technique for compressing training data and performing key operations in the compressed domain that lets you build models over big data with small machines. Compressed linear algebra uses actionable compression to represent matrices of training data. Unlike general-purpose compression, actionable compression allows operations to proceed directly over the compressed data. Frederick and Arvind show that it is possible to implement critical linear algebra operations in the compressed domain, delivering performance that matches, and in some cases greatly exceeds, conventional numerical libraries operating over uncompressed data.
Frederick and Arvind then describe an end-to-end implementation of compressed linear algebra in Apache SystemML, a language and system for implementing scalable machine-learning algorithms on Apache Spark and Hadoop MapReduce. Incorporating compressed linear algebra into SystemML’s runtime and optimizer can achieve performance improvements of more than 25x with no changes to the algorithm code. Frederick and Arvind start with a brief description of Apache SystemML before using instructive examples in SystemML’s R-like domain-specific language to describe the problem of fitting large training sets into the main memory. Frederick and Arvind conclude with detailed, end-to-end performance results involving key machine-learning algorithms and reference datasets.
Fred Reiss is chief architect and one of the founding employees of the IBM Spark Technology Center in San Francisco. Previously, Fred worked for IBM Research Almaden for nine years, where he worked on the SystemML and SystemT projects as well as on the research prototype of DB2 with BLU Acceleration. He has over 25 peer-reviewed publications and six patents. Fred holds a PhD from UC Berkeley.
Arvind Surve is Data Scientist, Architect in AI & ML, IBM Analytics group. Arvind is a SystemML contributor and committer. He has worked for IBM for 19+ years. Arvind has presented at the 2015 Data Engineering Conference in Tokyo,
Strata Data 2017 Conference, San Jose, and to the Chicago Spark User group, Chicago. He holds an MS in digital electronics and communication systems and an MBA in finance and marketing.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.