Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Tracking data lineage at Stitch Fix

Neelesh Salian (Stitch Fix)
4:35pm–5:15pm Wednesday, 09/12/2018
Data engineering and architecture
Location: 1A 23/24 Level: Intermediate
Secondary topics:  Data Integration and Data Pipelines, Data preparation, governance and privacy
Average rating: *....
(1.33, 3 ratings)

Who is this presentation for?

  • Security engineers, data engineers, software engineers, engineering managers, and architects

Prerequisite knowledge

  • A basic understanding of data management, AWS, Apache Spark, Apache Hadoop, and Apache Hive

What you'll learn

  • Understand how Stitch Fix manages its data warehouse and how it built a service that helps track and maintain information that is constantly in flux

Description

Personalization allows Stitch Fix to style its clients and provide recommendations to help them find what they love. To do this, the company gathers information about a client’s preferences up front when they sign up from the service and learns more about them as they become longer-term customers. This information is important for making recommendations but also must be protected and managed with care.

The data science team at Stitch Fix is the primary owner of the recommendation systems. Backing them up is the data platform team, who maintain the data infrastructure, data warehouse, and supporting tools and services. This data warehouse has several different data sources that read and write into it. This includes a logging pipeline for events, every Spark-based ETL, and daily snapshots of structured data from Stitch Fix applications.

Neelesh Srinivas Salian explains Stitch Fix’s process to better understand the movement and evolution of data within its data warehouse, from the initial ingestion from outside sources through all of its ETLs. Neelesh also details how Stitch Fix built a service that helps the company understand the lineage information that is associated with each table in the data warehouse. This service helps the company understand the source, parentage, and journey of all data in the warehouse. Although Stitch Fix makes sure to anonymize and filter out sensitive information from this data, the company needs a more flexible long-term solution as the business expands.

Photo of Neelesh Salian

Neelesh Salian

Stitch Fix

Neelesh Srinivas Salian is a software engineer on the data platform team at Stitch Fix, where he works on the compute infrastructure used by the company’s data scientists. Previously, he worked at Cloudera, where he worked with Apache projects like YARN, Spark, and Kafka.