Presented By O'Reilly and Cloudera
Make Data Work
March 28–29, 2016: Training
March 29–31, 2016: Conference
San Jose, CA

How LinkedIn built a text analytics platform at scale

Chi-Yi Kuan (LinkedIn), Weidong Zhang (LinkedIn), Tiger Zhang (LinkedIn)
11:00am–11:40am Thursday, 03/31/2016
Average rating: ****.
(4.29, 24 ratings)

Today, businesses around the world are increasingly collecting tremendous amounts of unstructured data—in the form of text—from multiple channels such as product reviews, market research, customer-care conversations, and social media. While it is clear that text contains valuable information, how to best analyze this data at scale is often less so. For example, social media is incredibly text heavy, but it contains a lot of noise (i.e., information that is not relevant to businesses and products). Identifying and filtering this noise is a critical step before any further analysis can be performed.

In order to analyze this massive amount of unstructured text at scale, Linkedin has built a “voice of member” platform to derive insights such as customer value propositions from the massive amount of data within its ecosystem. Chi-Yi Kuan, Weidong Zhang, and Yongzheng Zhang discuss how LinkedIn leverages its highly scalable big data system to ingest and process an enormous amount of text documents from various internal and external data sources. This includes high-performance ETL in Hadoop (data standardization, member segmentation enrichment, title parsing, and content grouping), state-of-the-art natural language processing and machine learning with Spark, and heterogeneous data layout and data as a service (DaaS) with Elasticsearch.

Topics include:

  • Scalable architecture with ETL to integrate various data sources
  • Title parsing and content grouping to provide effective information
  • Near real-time data warehouse and BI reporting with Elasticsearch
  • Data-relevancy resolution via machine learning in Spark
  • Scalable natural language processing in Spark/Azkaban
Photo of Chi-Yi Kuan

Chi-Yi Kuan


Chi-Yi Kuan is director of data science at LinkedIn. He has over 15 years of extensive experience applying big data analytics, business intelligence, risk and fraud management, data science, and marketing mix modeling across various business domains (social network, ecommerce, SaaS, and consulting) at both Fortune 500 firms and startups. Chi-Yi is dedicated to helping organizations become more data driven and profitable. He combines deep expertise in analytics and data science with business acumen and dynamic technology leadership.

Photo of Weidong Zhang

Weidong Zhang


Weidong Zhang is an engineering manager on the Data Analytics Infrastructure team at LinkedIn and leads the marketing and customer-service data warehouse vertical. Weidong has a passion for analytics, research, and data-driven decision making. He spent 10+ years in the data warehouse ETL and BI reporting fields and leverages his knowledge with business intelligence and Hadoop’s massive data-processing capability to address business needs. Weidong earned his PhD in computation fluid dynamics.

Photo of Tiger Zhang

Tiger Zhang


Yongzheng Zhang is a senior manager of data mining at LinkedIn and an active researcher and practitioner of text mining and machine learning. He’s developed many practical and scalable solutions for utilizing unstructured data for ecommerce and social networking applications, including search, merchandising, social commerce, and customer-service excellence. Yongzheng is a highly regarded expert in text mining and has published and presented many papers in top journals and at conferences. He also organizes tutorials and workshops on sentiment analysis at prestigious conferences. He holds a PhD in computer science from Dalhousie University in Canada.