Etsy’s big data stack is used by a large number of people across the company, from engineers and analysts to product managers. At least one hundred people have contributed to a hadoop job in the last two years, and more than thirty committed to the repository in the month of May. This widespread usage is enabled by the upfront work the company has done to turn the logged info into conceptually useful ideas. For example, Events and Visits are first-class objects, and we can easily search for particular sequences of Events within Visits. Most people don’t need to know exactly how these objects are defined, and anyone can talk in plain English about what they are trying to measure and be likely to describe something that translates easily to code. Because these are objects and not just rows of data, we can and do implement business rules at this stage of the processing. This keeps our definitions of traffic source, conversion, bot identification, visit duration, and total order amount (among other things) constant for everyone.
This talk will cover a few of the technical details of how we created these abstractions, but focuses more on how we work with them, and how they can be used to enable employees across the whole company to actively take part in the collection and analysis of data. It is relevant to all skill levels, but will be most useful for intermediate or expert audience members who will have the power to implement some of these ideas.
1. Overview of the Etsy data stack: how the data flows through the system, who the users are
a. Logging from the webstack → hdfs → processed into Scala objects
b. Users include: analysts, everyone in product who works on an experiment, finance, marketing, merchandising
c. A lot of indirect use: Hadoop used to feed our data warehouse for further analytics, recommendations and similar items datasets pushed back to production.
2. Concepts we’ve created abstractions for
a. Event – a single log record
b. Visit – a string of Events from the same browser
c. MatchPredicate – a funnel-based tool for searching a Visit for a series of events. Very flexible, very useful.
3. Real uses of those abstractions in jobs
a. we’ll walk through a job written by a Product Manager to answer a real question
b. and part of a more complex job used by the Data Science team
4. Takeaways for other companies/lessons learned at Etsy
a. Communication about data is important! Better questions, better logging. Helps get people in the mindset of thinking about what questions they’ll want to answer about new code.
b. Your company will benefit from having the data in a format that lines up with the questions you care about answering
c. No more than 1 slide on the pros/cons of using Scala/Scalding (first class language instead of DSL/slow builds)
5. What we would like to do in the future
a. Get the entire company to some level of awareness of what data is available and that there b. are tools and teams to help them use it
c. Make SQL and Scala training available to anyone who is interested
d. Clear understanding by product teams of how and why they should include events in new code
Melissa Santos has over a decade of experience with all parts of the data pipeline from ETLs to modeling. At Etsy, her role includes teaching both engineers and non-technical people how to get the data they need. She has a PhD. in Applied Math and runs the blog allowedtoapply.tumblr.com .
Comments on this page are now closed.
For exhibition and sponsorship opportunities, email email@example.com
For information on trade opportunities with O'Reilly conferences, email firstname.lastname@example.org
For media-related inquiries, contact Maureen Jennings at email@example.com
View a complete list of Strata + Hadoop World contacts
©2015, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.