Presented By O'Reilly and Cloudera
December 5-6, 2016: Training
December 6–8, 2016: Tutorials & Conference

Crawling and tracking millions of ecommerce products at scale

Qiaoliang Xiang (ShopBack)
2:35pm–3:15pm Thursday, December 8, 2016
Hadoop use cases
Location: 328/329 Level: Beginner
Tags: ecommerce

Prerequisite Knowledge

  • A basic understanding of ecommerce product catalogs, Hadoop, and Python

What you'll learn

  • Understand how to deal with heterogeneous product data, how to implement flexible and robust systems, and how to scale systems using Hadoop-related tools


ShopBack is Southeast Asia’s largest and fastest growing online loyalty platform, which aims to redefine the ecommerce market in Southeast Asia. It works closely with more than 500 online merchants (e.g., Taobao, Lazada, and Groupon) to promote their products and helps customers find best deals across multiple merchants. Customers are then redirected to partners’ websites to make purchases and get cash rebates.

To make it convenient for customers to search and compare products at Shopback, Shopback must crawl and manage a huge number of products from top merchants’ websites, estimated at about 25 million. Product structures are essentially heterogeneous across multiple merchants, differing in many aspects, such as category trees, attribute definitions, attribute values, and product descriptions. However, a smooth customer experience requires products information to be accurate, consistent, and up to date.

To address the above data problem, Shopback built an ecommerce product catalog management system. Shopback conducted a modularized design to make sure the system is flexible, scalable, and extensible and broke down the system into components such as crawling, parsing, storing, mapping, and tracking. Each component runs independently and communicates with the others via message queues (i.e., Apache Kafka).

In the real world, product information changes from time to time, so Shopback’s database needs to be updated frequently to present timely information. A naive (brute force) approach would recrawl all products regularly. However, this is time and resource intensive. Based on the observations that not all products are changed daily nor are they equally important, a cost-effective tracking strategy was implemented.

Qiaoliang Xiang walks you through how to crawl and update products, how to scale it using big data tools, and how to design a modularized system.

Topics include:

  • How to crawl all known products and discover new products at scale
  • How to parse product pages efficiently
  • How to store heterogeneous products in a unified way
  • How to map heterogeneous product structures to a consistent structure
  • How to track product changes in a cost-effective way
Photo of Qiaoliang Xiang

Qiaoliang Xiang


Qiaoliang Xiang is currently the head of data science at ShopBack, where he focuses on setting up big data infrastructure to store and process data, building data pipelines to provide clean, accurate, and consistent data, creating self-service reporting tools to satisfy other teams’ data requests, and developing data science products to serve customers better. Previously, he was a data scientist at Lazada working on product attribute extraction, a data engineer at Visa analyzing financial transactions, and a research assistant at NUS and NTU focusing on information retrieval, machine learning, and natural language processing. Qiaoliang holds a MEng from Nanyang Technological University, Singapore.