Every day, over half a billion tweets are generated. And processing them for analytics can seem to be a Herculean task. We at Microsoft deal with such social data sets on a daily basis, and in this talk we share our experiences building a real time search, analytics, and trends pipeline over social data, with the power of Elasticsearch, Azure, and Spark.
While Elasticsearch is highly scalable, fine tuning the architecture to respond in under 900 milliseconds for 45 billion documents (while indexing) is still a tough task. We will discuss several aspects including design of search cluster, experimentation setup for performance tuning, learnings from cloud services, fault tolerance, monitoring, customer facing APIs, lowering costs and other best practices, to get the most out of your hardware.
Next, we talk about enabling analytics over this data using stream processing. We will discuss annotating tweets with natural language processing tools and text-based classifiers, doing temporal analytics, and eventually building applications like topical trend generation (for example, TV show trends for Xbox). Such a case study will be a good example of bridging the gap between the fields of data science and data engineering.
Anirudh Koul is a data scientist at Microsoft. He brings eight years of applied research experience on petabyte-scale social media datasets including Facebook, Twitter, Yahoo Answers, Quora, Foursquare, and Bing. He has worked on a variety of machine learning, natural language processing, and information retrieval-related projects at Yahoo, Microsoft, and Carnegie Mellon University. Rapidly prototyping ideas, he has won over two dozen innovation, programming, and 24 hour-hackathon contests organized by companies including Facebook, Google, Microsoft, IBM, and Yahoo. Koul was also the keynote speaker at the SMX conference in Munich (March 2014), where he spoke about trends in applying machine learning on big data. You can read more about him here: http://linkedin.com/in/anirudhkoul
Shashank is a software engineer at Microsoft. Wearing several caps over the past decade, he has been building production pipelines for large scale data processing. Previously, he served as a project lead at HCL America.
Comments on this page are now closed.
©2015, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.