Presented By O'Reilly and Cloudera
Make Data Work
5–7 May, 2015 • London, UK

Taming the firehose: Build analytics over 45 billion tweets using Elasticsearch and Spark

Anirudh Koul (Microsoft), Shashank Singh (Microsoft)
11:45–12:25 Thursday, 7/05/2015
Hadoop & Beyond
Location: Buckingham Room - Palace Suite
Average rating: ****.
(4.78, 9 ratings)

Prerequisite Knowledge

General technical background, the session builds up from the basics.

Description

Every day, over half a billion tweets are generated. And processing them for analytics can seem to be a Herculean task. We at Microsoft deal with such social data sets on a daily basis, and in this talk we share our experiences building a real time search, analytics, and trends pipeline over social data, with the power of Elasticsearch, Azure, and Spark.

While Elasticsearch is highly scalable, fine tuning the architecture to respond in under 900 milliseconds for 45 billion documents (while indexing) is still a tough task. We will discuss several aspects including design of search cluster, experimentation setup for performance tuning, learnings from cloud services, fault tolerance, monitoring, customer facing APIs, lowering costs and other best practices, to get the most out of your hardware.

Next, we talk about enabling analytics over this data using stream processing. We will discuss annotating tweets with natural language processing tools and text-based classifiers, doing temporal analytics, and eventually building applications like topical trend generation (for example, TV show trends for Xbox). Such a case study will be a good example of bridging the gap between the fields of data science and data engineering.

Photo of Anirudh Koul

Anirudh Koul

Microsoft

Anirudh Koul is a data scientist at Microsoft. He brings eight years of applied research experience on petabyte-scale social media datasets including Facebook, Twitter, Yahoo Answers, Quora, Foursquare, and Bing. He has worked on a variety of machine learning, natural language processing, and information retrieval-related projects at Yahoo, Microsoft, and Carnegie Mellon University. Rapidly prototyping ideas, he has won over two dozen innovation, programming, and 24 hour-hackathon contests organized by companies including Facebook, Google, Microsoft, IBM, and Yahoo. Koul was also the keynote speaker at the SMX conference in Munich (March 2014), where he spoke about trends in applying machine learning on big data. You can read more about him here: http://linkedin.com/in/anirudhkoul

Photo of Shashank Singh

Shashank Singh

Microsoft

Shashank is a software engineer at Microsoft. Wearing several caps over the past decade, he has been building production pipelines for large scale data processing. Previously, he served as a project lead at HCL America.

Comments on this page are now closed.

Comments

Picture of Anirudh Koul
Anirudh Koul
20/05/2015 19:09 BST

@Louis. Glad you enjoyed it. The slides should be up this weekend.

Picture of louis v
louis v
18/05/2015 13:50 BST

great session, are the slides available ?