Many applications we use today are powered by the cloud and mobile. One of the critical components that drives engagement for cloud platforms is the recommendation engine. Recommendation systems are becoming pervasive, but as both users and the number of products offered on a platform scale, we are hit with two distinct challenges: engineering and machine learning.
Bargava Subramanian and Harjinder Mistry share data engineering and machine learning strategies for building an efficient real-time recommendation engine when the transaction data is both big and wide. They also outline a novel way of generating frequent patterns using collaborative filtering and matrix factorization on Apache Spark and serving it using Elasticsearch in the cloud. Bargava and Harjinder define wide data as that in which the number of transactions in a transaction basket is greater than 1,000. Some examples of big and wide data include the financial Instruments traded by a portfolio manager in a day, the products shipped from a warehouse, and the software components in a cloud platform.
Standard approaches to wide data have been market basket analysis (frequent pattern mining), collaborative filtering (matrix factorization), and deep learning. Apache Spark lends itself nicely to building a data science pipeline, from ingestion to data processing and machine learning. But as the data becomes wider, model training performance takes a hit. Bargava and Harjinder explain how they used the Alternating Least Squares algorithm in Spark to generate frequent itemsets. The new approach was faster and scaled well for big and wide data.
Bargava Subramanian is a cofounder and deep learning engineer at Binaize in Bangalore, India. He has 15 years’ experience delivering business analytics and machine learning solutions to B2B companies. He mentors organizations in their data science journey. He holds a master’s degree from the University of Maryland, College Park. He’s an ardent NBA fan.
Harjinder Mistry is a principal research engineer at Ola, where he is building a cloud-native data-science platform to solve challenging problems of fleet management. Previously, he engineered data platforms for a couple of interesting data-science projects: OpenShift.io at Red Hat and the Watson ML platform at IBM. Earlier, he spent several years in the DB2 SQL Query Optimizer team, building and fixing the mathematical model that decides the query execution plan. Harjinder holds an MTech from IIIT, Bangalore, India.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com