Building Scalable Big Data Infrastructure Using Open Source Software

Hadoop in Practice Great America Ballroom K
Average rating: ***..
(3.57, 7 ratings)

With the need for Big Data comes the need for the apt tools to work with the data. Data scientists and engineers need the best tools available to efficiently build data models and finely tuned algorithms . The right tools and infrastructure to collect and store data in a time and space efficient manner become indispensable. And on top of all this data collection requirements , you still need to make sure your latency on site is minimal. Any mechanism put in to collect analytics data must not interfere with the performance of the site. And yes it would be great to have a platform that let’s data scientist run AB tests with minimal effort. And above all is these how do you scale?

StumbleUpon, the leading personalized discovery engine on the web for the last decade is in the midst of what one could call a data and information explosion. Producing over 50GBs of data a day, there is a real need to manage all this information and make it accessible to all relevant stakeholders in the company. This talk will focus on our Analytics infrastructure platforms that solve these problems and help our analysts and data scientists extract the most value out of our data.

At StumbleUpon we believe in free and open source software and in this talk we ll demonstrate how state of the art open source systems like Hadoop , Hbase , Kafka & Redis are being used to build a world class data platform. We will talk about how we have implemented Kafka to collect logs efficiently and how data gets organized into optimized partitioned tables in Hive/HDFS, the most favored tool among analysts for adhoc querying. We will also discuss how we leverage hbase to collect millions of data points & metrics in near real time, while adding minimal latency to the site. Additionally, we will also talk about the adoption of Scala at Stumbleupon that has gone a long way in helping us build complex back end systems in record time . Akka’s actors and remoting models have made developing concurrent systems easier and more robust like never before.

This talk will elaborate on how these technologies are put to use and work harmoniously to build a big data infrastructure that is fast, scalable and most importantly user friendly.

Photo of Sam William

Sam William

Stumbleupon Inc

Sam William is an Analytics Engineer at Stumbleupon. Before that, he worked as a software Engineer at the Content Platform group at Yahoo.

Comments on this page are now closed.


raghuram gururajan
02/28/2013 1:34am PST

Hi Can you please share the slides for the session


Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at

Media Partner Opportunities

For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners

Press and Media

For media-related inquiries, contact Maureen Jennings at

Contact Us

View a complete list of Strata contacts