A productive data science platform: Beyond a hosted-notebooks solution at LinkedIn

Swasti Kakker (LinkedIn), Manu Ram Pandit (LinkedIn), Vidya Ravivarma (LinkedIn)

1:15pm–1:55pm Wednesday, September 25, 2019

Location: 1A 21/22

Data Engineering and Architecture

Secondary topics: Data, Analytics, and AI Architecture, Media and Advertising

Average rating:

(3.55, 11 ratings)

Who is this presentation for?

Data science developers

Level

Intermediate

Description

There are a lot of individual tools to do ad hoc analysis on big data stored in distributed databases and file systems that integrate with Apache Hadoop and Apache Spark. Most of these tools are either enterprise products or are derived from open source like Presto, Apache Pig, and Apache Hive. Developers spend a large portion of their time in ad hoc analysis and the develop-test-productionize cycle.

Some of the challenges data engineers and scientists encounter are discovering and leveraging existing algorithms, solutions, or models your peers have published or tested; the ability to run the experiment on multiple clusters or datasets; optimizing the jobs through Dr. Elephant reports; custom libraries and environments for running data experiments and the customized authoring experience; the ability to perform visualizations seamlessly based on the results of the job run on the cluster, for which currently you would need to ingest the data into another application or process explicitly; enforcing developer best practices and peer review for the queries or code that will be executed before productionizing; versioning your work, and the ability to revert in case of incompatible changes; productionizing the work; scheduling data analysis; and support for polyglot authoring with productive authoring for R Shiny apps, TensorFlow, or PySpark jobs.

LinkedIn is a data-driven company. Every team consumes and produces data that improves user experience on LinkedIn. Swasti Kakker, Manu Ram Pandit, and Vidya Ravivarma to explore the scalable, extensible unified platform LinkedIn is building leveraging Jupyter Hub, Jupyter Notebook, Docker and Kubernetes, MySQL, Git, and Restli that enforces productive data science and improves development experience.

Prerequisite knowledge

A basic understanding of computer science and big data

What you'll learn

Learn to use Jupyter notebooks in your company ecosystem, host Jupyter notebooks in Kubernetes, and modify Jupyter notebooks interface to suit your use cases
Understand microservice APIs for scaling Jupyter notebooks and custom Docker images that provide heterogeneous data analytics support

Swasti Kakker

Swasti Kakker is a software development engineer on the data team at LinkedIn, where she worked on the design and implementation of hosted notebooks, specifically a hosted solutions of Jupyter notebooks. She works closely with stakeholders to understand the expectations and requirements of the platform that would improve developer productivity. Previously, she worked with the Spark team, discussing how Dr. Elephant can improve Spark History Server to make it more scalable to cater to traffic. She’s also contributed to adding the Spark heuristics in Dr. Elephant after understanding the needs of the stakeholders (mainly Spark developers) which gave her good knowledge about Spark infrastructure, Spark parameters, and how to tune them efficiently. Her passion lies in increasing and improving developer productivity by designing and implementing scalable platforms.

Manu Ram Pandit

Manu Ram Pandit is a senior software engineer on the data analytics and infrastructure team at LinkedIn, where he’s influenced design and implementation of hosted notebooks, providing a seamless experience to end users. He works closely with customers, engineers, and product to understand and define the requirements and design of the system. He has extensive experience in building complex and scalable applications. Previously, he was with Paytm, Amadeus, and Samsung, where he built scalable applications for various domains.

Vidya Ravivarma

Vidya Ravivarma is a senior software engineer on the data analytics and infrastructure team at LinkedIn. She focuses on the design and implementation of building platform to improve developer productivity via hosted notebooks. She contributed to design and development of dynamic unified ACL management system for GDPR enforcement on datasets produced via LinkedIn’s metrics platform. She interacts closely with data analysts, scientists, engineers, and stakeholders to understand their requirements to build scalable and flexible solutions and platforms that enhance their productivity. Previously, she was at Yahoo, mainly in data science and engineering and web development. This provided her with the insights to develop a scalable, productive data science platform.