High Level Abstractions Make Big Data Useful for Real People

Melissa Santos (Big Cartel)
Business & Industry
Location: 120-121
Average rating: ****.
(4.14, 21 ratings)
Slides:   external link

Etsy’s big data stack is used by a large number of people across the company, from engineers and analysts to product managers. At least one hundred people have contributed to a hadoop job in the last two years, and more than thirty committed to the repository in the month of May. This widespread usage is enabled by the upfront work the company has done to turn the logged info into conceptually useful ideas. For example, Events and Visits are first-class objects, and we can easily search for particular sequences of Events within Visits. Most people don’t need to know exactly how these objects are defined, and anyone can talk in plain English about what they are trying to measure and be likely to describe something that translates easily to code. Because these are objects and not just rows of data, we can and do implement business rules at this stage of the processing. This keeps our definitions of traffic source, conversion, bot identification, visit duration, and total order amount (among other things) constant for everyone.
This talk will cover a few of the technical details of how we created these abstractions, but focuses more on how we work with them, and how they can be used to enable employees across the whole company to actively take part in the collection and analysis of data. It is relevant to all skill levels, but will be most useful for intermediate or expert audience members who will have the power to implement some of these ideas.

Extended Description:
1. Overview of the Etsy data stack: how the data flows through the system, who the users are
a. Logging from the webstack → hdfs → processed into Scala objects
b. Users include: analysts, everyone in product who works on an experiment, finance, marketing, merchandising
c. A lot of indirect use: Hadoop used to feed our data warehouse for further analytics, recommendations and similar items datasets pushed back to production.
2. Concepts we’ve created abstractions for
a. Event – a single log record
b. Visit – a string of Events from the same browser
c. MatchPredicate – a funnel-based tool for searching a Visit for a series of events. Very flexible, very useful.
3. Real uses of those abstractions in jobs
a. we’ll walk through a job written by a Product Manager to answer a real question
b. and part of a more complex job used by the Data Science team
4. Takeaways for other companies/lessons learned at Etsy
a. Communication about data is important! Better questions, better logging. Helps get people in the mindset of thinking about what questions they’ll want to answer about new code.
b. Your company will benefit from having the data in a format that lines up with the questions you care about answering
c. No more than 1 slide on the pros/cons of using Scala/Scalding (first class language instead of DSL/slow builds)
5. What we would like to do in the future
a. Get the entire company to some level of awareness of what data is available and that there b. are tools and teams to help them use it
c. Make SQL and Scala training available to anyone who is interested
d. Clear understanding by product teams of how and why they should include events in new code

Photo of Melissa  Santos

Melissa Santos

Big Cartel

Melissa Santos has over a decade of experience with all parts of the data pipeline from ETLs to modeling. At Etsy, her role includes teaching both engineers and non-technical people how to get the data they need. She has a PhD. in Applied Math and runs the blog allowedtoapply.tumblr.com .

Comments on this page are now closed.

Comments

Picture of Melissa  Santos
Melissa Santos
1-12-2014 20:25 CET
Picture of Melissa  Santos
Melissa Santos
1-12-2014 20:25 CET

https://dl.dropboxusercontent.com/u/47882/Santos%20Strata%202014.pdf

Picture of Safrir Dagan
Safrir Dagan
26-11-2014 13:38 CET

Hi, can you post the slides? thanks.