Presented By O'Reilly and Cloudera
Make Data Work
September 26–27, 2016: Training
September 27–29, 2016: Tutorials & Conference
New York, NY

Beyond Hadoop at Yahoo: Interactive analytics with Druid

Himanshu Gupta (Yahoo)
4:35pm–5:15pm Wednesday, 09/28/2016
Data innovations
Location: 1 E 07/1 E 08 Level: Beginner
Average rating: ***..
(3.33, 3 ratings)

Prerequisite knowledge

  • A general understanding of using big data systems to process and analyze various user and application events
  • Druid experience (useful but not required)
  • What you'll learn

  • Learn how to scale Druid to power analytics
  • Description

    Yahoo initially built Hadoop as an answer to a very acute pain around efficiently storing and processing large volumes of data. Since Yahoo open sourced Hadoop, it has become widely adopted in the technology world. However, time has taught us that when a system becomes extremely popular for solving one class of problems, its limitations in solving other problems become more apparent. Himanshu Gupta explains why Yahoo has been increasingly investing in interactive analytics and how it leverages Druid to power a variety of internal- and external-facing data applications.

    Millions of users around the globe interact with Yahoo through their web browsers and mobile devices, and these interactions generate billions of events every day. As Yahoo’s data volumes have grown, it’s faced increasing demand to make the data more accessible, both to internal users and to its customers. Not all of Yahoo’s end users are backend analysts, and many have no prior experience with traditional analytic tools, so Yahoo wanted to build simple, interactive data applications that anyone could use to derive insights from data. To support these use cases, Yahoo elected to invest in the Druid open source project.

    Today, Yahoo has multiple Druid clusters to support analytics for a variety of use cases, such as application performance, user activity, ads metrics, and many more. Each demands that Yahoo’s data applications update in real time and handle interactive ad hoc querying at a very high scale. Himanshu explores Yahoo’s use cases with Druid, shares the lessons learned from scaling Druid deployment, monitoring clusters, and ingesting data, and offers strategies for accelerating queries by leveraging approximate sketch-based algorithms.

    Photo of Himanshu Gupta

    Himanshu Gupta


    Himanshu Gupta is a software engineer at Yahoo and a Druid project committer. Himanshu has been working with Hadoop-based data pipelines and related platforms for the past few years and currently focuses on use of Druid inside Yahoo. Outside of work, Himanshu has written a video game for mobile, published solutions to pretty much all the exercises in How to Prove It, and dabbled in AI and ML algorithms. He’s a computer science autodidact and holds an MS degree in physics from the Indian Institute of Technology, Kanpur.