Today, businesses around the world are increasingly collecting tremendous amounts of unstructured data—in the form of text—from multiple channels such as product reviews, market research, customer-care conversations, and social media. While it is clear that text contains valuable information, how to best analyze this data at scale is often less so. For example, social media is incredibly text heavy, but it contains a lot of noise (i.e., information that is not relevant to businesses and products). Identifying and filtering this noise is a critical step before any further analysis can be performed.
In order to analyze this massive amount of unstructured text at scale, Linkedin has built a “voice of member” platform to derive insights such as customer value propositions from the massive amount of data within its ecosystem. Chi-Yi Kuan, Weidong Zhang, and Yongzheng Zhang discuss how LinkedIn leverages its highly scalable big data system to ingest and process an enormous amount of text documents from various internal and external data sources. This includes high-performance ETL in Hadoop (data standardization, member segmentation enrichment, title parsing, and content grouping), state-of-the-art natural language processing and machine learning with Spark, and heterogeneous data layout and data as a service (DaaS) with Elasticsearch.
Chi-Yi Kuan is director of data science at LinkedIn. He has over 15 years of extensive experience applying big data analytics, business intelligence, risk and fraud management, data science, and marketing mix modeling across various business domains (social network, ecommerce, SaaS, and consulting) at both Fortune 500 firms and startups. Chi-Yi is dedicated to helping organizations become more data driven and profitable. He combines deep expertise in analytics and data science with business acumen and dynamic technology leadership.
Weidong Zhang is an engineering manager on the Data Analytics Infrastructure team at LinkedIn and leads the marketing and customer-service data warehouse vertical. Weidong has a passion for analytics, research, and data-driven decision making. He spent 10+ years in the data warehouse ETL and BI reporting fields and leverages his knowledge with business intelligence and Hadoop’s massive data-processing capability to address business needs. Weidong earned his PhD in computation fluid dynamics.
Yongzheng Zhang is a senior manager of data mining at LinkedIn and an active researcher and practitioner of text mining and machine learning. He’s developed many practical and scalable solutions for utilizing unstructured data for ecommerce and social networking applications, including search, merchandising, social commerce, and customer-service excellence. Yongzheng is a highly regarded expert in text mining and has published and presented many papers in top journals and at conferences. He also organizes tutorials and workshops on sentiment analysis at prestigious conferences. He holds a PhD in computer science from Dalhousie University in Canada.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.