Presented By O'Reilly and Cloudera
December 5-6, 2016: Training
December 6–8, 2016: Tutorials & Conference

Understanding the voice of members via text mining: How Linkedin built a text analytics engine at scale

Chi-Yi Kuan (LinkedIn), Weidong Zhang (LinkedIn), Tiger Zhang (LinkedIn)
12:05pm–12:45pm Thursday, December 8, 2016
Data science and advanced analytics
Location: Summit 1 Level: Intermediate
Average rating: ****.
(4.50, 8 ratings)

Prerequisite Knowledge

  • Familiarity with big data and data mining

What you'll learn

  • Learn how to design and build a highly performing and scalable text analytics platform


Today, businesses around the world are increasingly collecting tremendous amounts of unstructured data—in the form of text—from multiple channels such as product reviews, market research, customer-care conversations, and social media. While it is clear that text contains valuable information, how to best analyze this data at scale is often less so. For example, social media is incredibly text heavy, but it contains a lot of noise (i.e., information that is not relevant to businesses and products). Identifying and filtering this noise is a critical step before any further analysis can be performed.

In order to analyze this massive amount of unstructured text at scale, Linkedin has built a “voice of member” platform to derive insights such as customer value propositions from the massive amount of data within its ecosystem. Chi-Yi Kuan, Weidong Zhang, and Yongzheng Zhang discuss how LinkedIn leverages its highly scalable big data system to ingest and process an enormous amount of text documents from various internal and external data sources. The platform effectively leverages big data infrastructure and tools such as Hadoop and Spark, integrates a huge volume of unstructured data from multiple channels (e.g., member profiles, user behaviors, and social networks), and mines knowledge and insights from unstructured text via advanced machine-learning and text-mining techniques.

Topics include:

  • Linkedin’s big data infrastructure
  • Ingestion and integration of unstructured data from multiple channels
  • Near real-time text storage and analysis
  • Highly performing machine-learning algorithms
  • Scalable natural language processing techniques
  • Personalized data visualization solutions
Photo of Chi-Yi Kuan

Chi-Yi Kuan


Chi-Yi Kuan is director of data science at LinkedIn. He has over 15 years of extensive experience applying big data analytics, business intelligence, risk and fraud management, data science, and marketing mix modeling across various business domains (social network, ecommerce, SaaS, and consulting) at both Fortune 500 firms and startups. Chi-Yi is dedicated to helping organizations become more data driven and profitable. He combines deep expertise in analytics and data science with business acumen and dynamic technology leadership.

Photo of Weidong Zhang

Weidong Zhang


Weidong Zhang is an engineering manager on the Data Analytics Infrastructure team at LinkedIn and leads the marketing and customer-service data warehouse vertical. Weidong has a passion for analytics, research, and data-driven decision making. He spent 10+ years in the data warehouse ETL and BI reporting fields and leverages his knowledge with business intelligence and Hadoop’s massive data-processing capability to address business needs. Weidong earned his PhD in computation fluid dynamics.

Photo of Tiger Zhang

Tiger Zhang


Yongzheng Zhang is a senior manager of data mining at LinkedIn and an active researcher and practitioner of text mining and machine learning. He’s developed many practical and scalable solutions for utilizing unstructured data for ecommerce and social networking applications, including search, merchandising, social commerce, and customer-service excellence. Yongzheng is a highly regarded expert in text mining and has published and presented many papers in top journals and at conferences. He also organizes tutorials and workshops on sentiment analysis at prestigious conferences. He holds a PhD in computer science from Dalhousie University in Canada.