Yahoo initially built Hadoop as an answer to a very acute pain around efficiently storing and processing large volumes of data. Since Yahoo open sourced Hadoop, it has become widely adopted in the technology world. However, time has taught us that when a system becomes extremely popular for solving one class of problems, its limitations in solving other problems become more apparent. Himanshu Gupta explains why Yahoo has been increasingly investing in interactive analytics and how it leverages Druid to power a variety of internal- and external-facing data applications.
Millions of users around the globe interact with Yahoo through their web browsers and mobile devices, and these interactions generate billions of events every day. As Yahoo’s data volumes have grown, it’s faced increasing demand to make the data more accessible, both to internal users and to its customers. Not all of Yahoo’s end users are backend analysts, and many have no prior experience with traditional analytic tools, so Yahoo wanted to build simple, interactive data applications that anyone could use to derive insights from data. To support these use cases, Yahoo elected to invest in the Druid open source project.
Today, Yahoo has multiple Druid clusters to support analytics for a variety of use cases, such as application performance, user activity, ads metrics, and many more. Each demands that Yahoo’s data applications update in real time and handle interactive ad hoc querying at a very high scale. Himanshu explores Yahoo’s use cases with Druid, shares the lessons learned from scaling Druid deployment, monitoring clusters, and ingesting data, and offers strategies for accelerating queries by leveraging approximate sketch-based algorithms.
Himanshu Gupta is a software engineer at Yahoo and a Druid project committer. Himanshu has been working with Hadoop-based data pipelines and related platforms for the past few years and currently focuses on use of Druid inside Yahoo. Outside of work, Himanshu has written a video game for mobile, published solutions to pretty much all the exercises in How to Prove It, and dabbled in AI and ML algorithms. He’s a computer science autodidact and holds an MS degree in physics from the Indian Institute of Technology, Kanpur.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.