Presented By O'Reilly and Cloudera
Make Data Work
Dec 4–5, 2017: Training
Dec 5–7, 2017: Tutorials & Conference

A recommendation system for wide transactions

1:45pm2:25pm Wednesday, December 6, 2017

Who is this presentation for?

  • Data scientists and product managers

Prerequisite knowledge

  • A general familiarity with recommendation systems

What you'll learn

  • Learn how to build recommendation systems for wide data, using Apache Spark and Elasticsearch to build and serve a real-time recommendation system in the cloud


Many applications we use today are powered by the cloud and mobile. One of the critical components that drives engagement for cloud platforms is the recommendation engine. Recommendation systems are becoming pervasive, but as both users and the number of products offered on a platform scale, we are hit with two distinct challenges: engineering and machine learning.

Bargava Subramanian and Harjinder Mistry share data engineering and machine learning strategies for building an efficient real-time recommendation engine when the transaction data is both big and wide. They also outline a novel way of generating frequent patterns using collaborative filtering and matrix factorization on Apache Spark and serving it using Elasticsearch in the cloud. Bargava and Harjinder define wide data as that in which the number of transactions in a transaction basket is greater than 1,000. Some examples of big and wide data include the financial Instruments traded by a portfolio manager in a day, the products shipped from a warehouse, and the software components in a cloud platform.

Standard approaches to wide data have been market basket analysis (frequent pattern mining), collaborative filtering (matrix factorization), and deep learning. Apache Spark lends itself nicely to building a data science pipeline, from ingestion to data processing and machine learning. But as the data becomes wider, model training performance takes a hit. Bargava and Harjinder explain how they used the Alternating Least Squares algorithm in Spark to generate frequent itemsets. The new approach was faster and scaled well for big and wide data.

Photo of Bargava Subramanian

Bargava Subramanian


Bargava Subramanian is a cofounder and deep learning engineer at Binaize in Bangalore, India. He has 15 years’ experience delivering business analytics and machine learning solutions to B2B companies. He mentors organizations in their data science journey. He holds a master’s degree from the University of Maryland, College Park. He’s an ardent NBA fan.

Photo of Harjindersingh Mistry

Harjindersingh Mistry


Harjinder Mistry is a principal research engineer at Ola, where he is building a cloud-native data-science platform to solve challenging problems of fleet management. Previously, he engineered data platforms for a couple of interesting data-science projects: at Red Hat and the Watson ML platform at IBM. Earlier, he spent several years in the DB2 SQL Query Optimizer team, building and fixing the mathematical model that decides the query execution plan. Harjinder holds an MTech from IIIT, Bangalore, India.