Presented By O’Reilly and Cloudera
Make Data Work
21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

A Compute Infrastructure for Data Scientists

17:2518:05 Wednesday, 23 May 2018

Who is this presentation for?

Manager, Developer, Data Engineer, Software Engineer, Big Data Engineer

Prerequisite knowledge

Big Data Spark Cloud Environment

What you'll learn

Sharing our experiences building a compute infrastructure at Stitch Fix would help individuals improve their own tools at their organizations. The lessons and challenges we learnt would hopefully help organizations to build more tooling infrastructure that is robust and user friendly for its engineers or data scientists.

Description

Stitch Fix is a data science company that aspires to help you find the style that you will love. Data, the backbone of the business, is used to help with styling recommendations, demand modeling, user acquisition, and merchandise planning and also to influence business decisions throughout the organization. These decisions are backed by algorithms and data collected and interpreted based on client preferences.

Neelesh offers an overview of the compute Infrastructure used by data scientists at Stitch Fix. Apache Spark plays an important role in Stitch Fix’s data platform, and the company’s data scientists use Spark for their ETL and Presto for their ad-hoc queries. The goal for the team running the compute infrastructure is to understand and make the data scientists’ lives easier, particularly in terms of usability of Spark, by building tools that make it easier to get started with Spark and transition themselves to a daily workflow. The compute infrastructure is a part of the data platform that is responsible for all the needs of data scientists as Stitch Fix.

Neelesh focuses on Stitch Fix’s journey, exploring its Spark setup and offering an overview of its in-house tools and how they work in synergy with open source frameworks in a cloud environment. Neelesh also covers the additional improvements to the infrastructure that help persist information for future use and optimization and explains how the implementation of Amazon’s EMR FS has helped make it easier to read from the S3 source.

Photo of Neelesh Srinivas Salian

Neelesh Srinivas Salian

Stitch Fix

Neelesh Srinivas Salian is a software engineer on the data platform team at Stitch Fix, where he works on the compute infrastructure used by data scientists, particularly focusing on the Apache Spark ecosystem. Previously, he worked at Cloudera, where he worked with Apache projects like YARN, Spark, and Kafka. Neelesh holds a master’s degree in computer science with a focus on cloud computing from North Carolina State University and a bachelor’s degree in computer engineering from the University of Mumbai, India.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)