Presented By O’Reilly and Cloudera
Make Data Work
21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Improving ad hoc and production workflows at Stitch Fix

Neelesh Salian (Stitch Fix)
11:1511:55 Thursday, 24 May 2018
Secondary topics:  Data Platforms, E-commerce and Retail
Average rating: *....
(1.00, 1 rating)

Who is this presentation for?

  • Managers, developers, data engineers, software engineers, and big data engineers

Prerequisite knowledge

  • Familiarity with big data, Spark, and cloud environments

What you'll learn

  • Explore Stitch Fix's compute infrastructure

Description

Stitch Fix aspires to help you find the style that you will love. Data, the backbone of the business, is used to help with styling recommendations, demand modeling, user acquisition, and merchandise planning and also to influence business decisions throughout the organization. These decisions are backed by algorithms and data collected and interpreted based on client preferences. Neelesh Srinivas Salian offers an overview of the compute infrastructure used by the data science team at Stitch Fix, covering the architecture, tools within the larger ecosystem, and the challenges that the team overcame along the way.

Apache Spark plays an important role in Stitch Fix’s data platform, and the company’s data scientists use Spark for their ETL and Presto for their ad hoc queries. The goal for the team running the compute infrastructure is to understand and make the data scientists’ lives easier, particularly in terms of usability of Spark, by building tools that expedite the process of getting started with Spark and transitioning from an ad hoc to a production workflow. The compute infrastructure is a part of the data platform that is responsible for all the needs of data scientists as Stitch Fix.

Neelesh shares Stitch Fix’s journey, exploring its ad hoc and production infrastructure and detailing its in-house tools and how they work in synergy with open source frameworks in a cloud environment. Neelesh also discusses the additional improvements to the infrastructure that help persist information for future use and optimization and explains how the implementation of Amazon’s EMR FS has helped make it easier to read from the S3 source.

Photo of Neelesh Salian

Neelesh Salian

Stitch Fix

Neelesh Srinivas Salian is a software engineer on the data platform team at Stitch Fix, where he works on the compute infrastructure used by the company’s data scientists. Previously, he worked at Cloudera, where he worked with Apache projects like YARN, Spark, and Kafka.