Skip to main content

Beyond Hadoop MapReduce: Interactive Advertising Insights with Shark @ Yahoo!

Nandu Jayakumar (Oracle), Tim Tully (Yahoo!)
Data in Action
Ballroom CD
Average rating: ****.
(4.00, 7 ratings)

Understanding user sentiment, improving user engagement, and maximizing ROI for the advertising dollar spent without harming user experiences are all crucial to Yahoo!’s business. In order to effectively perform these tasks, we ingest hundreds of TB of advertising data every day on Hadoop clusters of thousands of machines. Many of the algorithms we use to measure user engagement can be modeled as multiway self-join queries that are very expensive to compute on very large datasets.

The challenge we face is how to effectively query this vast amount of information and come up with interesting insights. Over the course of the last year, we have been developing a new data platform for user affinity analysis using Shark and Spark.

In this talk, we discuss our use cases, and the advanced streaming algorithms we have implemented on top of these platforms, and the general architecture to provide interactive, real-time insightful analytics to our data scientists. The deployment of these new systems, along with the novel algorithms (min hashing, mod hashing, other sketches) can reduce the runtime of such analytics from hours to seconds.

Photo of Nandu Jayakumar

Nandu Jayakumar

Vice President, Development, Oracle

Nandu Jayakumar is a software architect and engineering leader at Oracle. Before that he was responsible for the long-term architecture of data systems and was Senior Director of data platform development at Visa. Previously, as a senior leader of Yahoo’s well-regarded data team, Nandu built key pieces of Yahoo’s data processing tools and platforms over several iterations, which were used to improve user engagement on Yahoo websites and mobile apps. He also designed large-scale advertising systems and contributed code to Shark (SQL on Spark) during his time there. Nandu holds a bachelor’s degree in electronics engineering from Bangalore University and a master’s degree in computer science from Stanford University, where he focused on databases and distributed systems.

Photo of Tim Tully

Tim Tully

Distinguished Architect, Yahoo!

Tim Tully is Distinguished Architect at Yahoo! and is an experienced big data expert. At Yahoo!, he has designed the Yahoo! Data technology platform, including data warehousing, aggregation, visualization, instrumentation, ETL and anything else involving analytics. Currently, he leads the architecture of multi-petabyte solutions at Yahoo on Hadoop and other big data ecosystems, and is responsible for bringing Spark and Shark to Yahoo. He is also a Winner of prestigious Yahoo! Individual Superstar award for 2011.