ShopBack is Southeast Asia’s largest and fastest growing online loyalty platform, which aims to redefine the ecommerce market in Southeast Asia. It works closely with more than 500 online merchants (e.g., Taobao, Lazada, and Groupon) to promote their products and helps customers find best deals across multiple merchants. Customers are then redirected to partners’ websites to make purchases and get cash rebates.
To make it convenient for customers to search and compare products at Shopback, Shopback must crawl and manage a huge number of products from top merchants’ websites, estimated at about 25 million. Product structures are essentially heterogeneous across multiple merchants, differing in many aspects, such as category trees, attribute definitions, attribute values, and product descriptions. However, a smooth customer experience requires products information to be accurate, consistent, and up to date.
To address the above data problem, Shopback built an ecommerce product catalog management system. Shopback conducted a modularized design to make sure the system is flexible, scalable, and extensible and broke down the system into components such as crawling, parsing, storing, mapping, and tracking. Each component runs independently and communicates with the others via message queues (i.e., Apache Kafka).
In the real world, product information changes from time to time, so Shopback’s database needs to be updated frequently to present timely information. A naive (brute force) approach would recrawl all products regularly. However, this is time and resource intensive. Based on the observations that not all products are changed daily nor are they equally important, a cost-effective tracking strategy was implemented.
Qiaoliang Xiang walks you through how to crawl and update products, how to scale it using big data tools, and how to design a modularized system.
Qiaoliang Xiang is currently the head of data science at ShopBack, where he focuses on setting up big data infrastructure to store and process data, building data pipelines to provide clean, accurate, and consistent data, creating self-service reporting tools to satisfy other teams’ data requests, and developing data science products to serve customers better. Previously, he was a data scientist at Lazada working on product attribute extraction, a data engineer at Visa analyzing financial transactions, and a research assistant at NUS and NTU focusing on information retrieval, machine learning, and natural language processing. Qiaoliang holds a MEng from Nanyang Technological University, Singapore.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.