March 16–17, 2015: Training
March 17–19, 2015: Conference
Boston, MA

Architecting for Data Science

1:15pm–2:45pm Thursday, 03/19/2015
Big Data, Reactive and its variants
Location: 304
Average rating: ****.
(4.20, 10 ratings)
Slides:   external link

Prerequisite Knowledge

If you have only limited experience with data science, this talk will serve as a useful survey and will help you get started on the right track. If you have worked with data science, machine learning, or recommendation systems before, then you may appreciate more fully the tensions that influence our design decisions.

Description

We first deployed data science and machine learning for real-time recommendation engines at if(we) with great enthusiasm, but soon suffered a nasty surprise; it felt like we hit a speed bump in the middle of the highway. Having built a web infrastructure capable of delivering complex functionality with rapid iterative cycles, we were stunned by how long it took to try out new ideas.

Thanks to the platform we designed, today things are different. We enforce a rigorous approach to data modeling and, most importantly, all data access occurs through event history. Our framework processes events in reactive style, so the code runs the same way during production real-time streaming as it does during development and model evaluation.

Our prediction platform software is key to enabling rapid iterative cycles and responsiveness to changing business needs. Initially, we needed a great deal of custom code to meet production scalability and performance requirements for new recommendation algorithms. Today, we just drop in the same reactive feature definitions used in development. Whereas it might have taken three to six months to try new ideas, it’s now easy to go from idea to production validation in just a few days.

We will share some of the difficult tradeoffs that we managed. Putting expressiveness as a top priority forced a number of design decisions and influenced the development of a Scala-based feature DSL. We chose to focus on performance ahead of scalability, meaning we operated without the benefit of Hadoop ecosystem technology, preferring a custom in-memory data representation instead. We also emphasize creating interesting features as machine learning inputs, an approach that leads to good results even with the most common machine learning algorithms.

In this talk we’ll take a deep dive through the evolution of the if(we) data science architecture. In addition to telling the story, we’ll demonstrate our approach working through examples with data from Kaggle competitions. We’d like you to take this home and try it yourself and will have source code available (https://github.com/ifwe/antelope).

Johann Schleier-Smith

if(we)

Johann leads if(we) with partner, co-founder and long-time friend, Greg Tseng. Under Johann’s leadership, if(we) conceived, developed and refined Tagged, a social networking product supporting 300 million users in over 200 countries. With balanced interests in software development, data science, product design, and building businesses, Johann works closely the team to meet the trends of 21st century social life, always keen to adapt cutting-edge technology to internet-size and internet-speed applications.

Johann holds an A.B. in Physics and Mathematics from Harvard University and pursued a Ph.D. in Physics at Stanford for several years, before leaving to fully focus on his entrepreneurial career. He is also an advisor to the Immunity Project, a non-profit initiative dedicated to developing a free vaccine for HIV/AIDS. Outside of the office, Johann can be found riding waves while kitesurfing in summer, and riding snow in winter.