Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Compressed linear algebra in Apache SystemML

4:20pm5:00pm Thursday, March 16, 2017
Data science & advanced analytics
Location: 230 C Level: Advanced
Secondary topics:  Hardcore Data Science

Who is this presentation for?

  • Computer science researchers and advanced developers

Prerequisite knowledge

  • A solid grasp of computer programming concepts
  • Basic knowledge of computer architecture, statistics, and machine learning

What you'll learn

  • Explore compressed linear algebra and learn how it can be implemented in SystemML

Description

Many iterative machine-learning algorithms can only operate efficiently when a large matrix of training data fits in the main memory. Running these algorithms over big data requires large numbers of machines with large amounts of RAM, which can quickly become very expensive. Compressing the matrices with general purpose algorithms like gzip doesn’t improve performance because decompression speed is on par with that of reading data from disk.

Frederick Reiss and Arvind Surve offer an overview of compressed linear algebra, a technique for compressing training data and performing key operations in the compressed domain that lets you build models over big data with small machines. Compressed linear algebra uses actionable compression to represent matrices of training data. Unlike general-purpose compression, actionable compression allows operations to proceed directly over the compressed data. Frederick and Arvind show that it is possible to implement critical linear algebra operations in the compressed domain, delivering performance that matches, and in some cases greatly exceeds, conventional numerical libraries operating over uncompressed data.

Frederick and Arvind then describe an end-to-end implementation of compressed linear algebra in Apache SystemML, a language and system for implementing scalable machine-learning algorithms on Apache Spark and Hadoop MapReduce. Incorporating compressed linear algebra into SystemML’s runtime and optimizer can achieve performance improvements of more than 25x with no changes to the algorithm code. Frederick and Arvind start with a brief description of Apache SystemML before using instructive examples in SystemML’s R-like domain-specific language to describe the problem of fitting large training sets into the main memory. Frederick and Arvind conclude with detailed, end-to-end performance results involving key machine-learning algorithms and reference datasets.

Photo of Frederick Reiss

Frederick Reiss

IBM

Fred Reiss is chief architect and one of the founding employees of the IBM Spark Technology Center in San Francisco. Previously, Fred worked for IBM Research Almaden for nine years, where he worked on the SystemML and SystemT projects as well as on the research prototype of DB2 with BLU Acceleration. He has over 25 peer-reviewed publications and six patents. Fred holds a PhD from UC Berkeley.

Photo of Arvind Surve

Arvind Surve

IBM

Arvind Surve is Data Scientist, Architect in AI & ML, IBM Analytics group. Arvind is a SystemML contributor and committer. He has worked for IBM for 19+ years. Arvind has presented at the 2015 Data Engineering Conference in Tokyo,
Strata Data 2017 Conference, San Jose, and to the Chicago Spark User group, Chicago. He holds an MS in digital electronics and communication systems and an MBA in finance and marketing.