Sep 23–26, 2019

Productive Data Science Platform - Beyond Hosted notebooks solution at LinkedIn

Swasti Kakker (LinkedIn), Manu Ram Pandit (LinkedIn), Vidya Ravivarma (LinkedIn)
1:15pm1:55pm Wednesday, September 25, 2019
Location: 1A 21/22
Secondary topics:  Data, Analytics, and AI Architecture, Media and Advertising

Who is this presentation for?

Areas: Data science development, Jupyter Notebooks, Kubernetes

Level

Intermediate

Prerequisite knowledge

Basic computer science and big data knowledge.

What you'll learn

# Using Jupyter notebooks in your company ecosystem # Hosting Jupyter notebooks in Kubernetes # Modifying Jupyter notebooks interface to suit your use cases # Micro service APIs for scaling Jupyter notebooks # Custom docker images providing heterogeneous data analytics support

Description

There are a lot of individual tools to do ad-hoc analysis on Big Data stored in distributed databases and file systems that integrate with Apache Hadoop and Apache Spark. Most of these tools are either Enterprise products or are derived from open source like Presto, Apache Pig, and Apache Hive. Developers currently spend a large portion of their time in ad hoc analysis and the develop-test-productionize cycle.
Outlined below are some of the challenges data engineers/scientists encounter:

  1. Discover, leverage existing algorithms, solutions or models that their peers have published/tested.
  2. Ability to run the experiment on multiple clusters or datasets.
  3. Optimize jobs through Dr. Elephant6 reports.
  4. Usage of custom libraries, environments for running data experiments. Customized authoring experience.
  5. Ability to perform visualizations seamlessly based on the results of the job which is run on the cluster. For which currently the user would need to ingest the data into another application/process explicitly.
  6. Enforce developer best practices and peer review for the queries/code that will be executed before productionizing.
  7. Version their work. Also to have the ability to revert in case of incompatible changes.
  8. Productionize the work.
  9. Schedule data analysis.
  10. Support for Polyglot authoring: Productive authoring for R-shiny apps, TensorFlow or PySpark jobs.

LinkedIn is a data driven company. Every team consumes and produces data that improve user experience on LinkedIn. We are building a scalable, extensible unified platform leveraging Jupyter Hub, Jupyter Notebook, Docker and Kubernetes, MySQL, Git and Restli that enforce productive data science and improve development experience.

Photo of Swasti Kakker

Swasti Kakker

LinkedIn

Swasti is a Software Development Engineer working with the LinkedIn Data team. Her passion lies in increasing and improving Developer Productivity by designing and implementing scalable platforms for the same. In her 2-year tenure at LinkedIn, she has worked the design and implementation of Hosted Notebooks at LinkedIn which focuses on providing a hosted solution of Jupyter Notebooks. She has worked closely with the stakeholders to understand the expectations and requirements of the platform which would improve Developer productivity. Prior to this, she has closely worked with the Spark team, discussing how Spark History Server can be improved to make it more scalable to cater to the traffic by Dr. Elephant. She has also contributed to adding the Spark Heuristics in Dr. Elephant after understanding the needs of the stakeholders (mainly Spark developers) which give her a good knowledge about Spark infrastructure, Spark parameters and how to tune them efficiently.

Photo of Manu Ram Pandit

Manu Ram Pandit

LinkedIn

Manu is a Sr Software engineer at Linkedin for the last one and half years working with Data Analytics and Infrastructure team. He has an extensive experience in building complex and scalable applications. During his tenure at LinkedIn, he has influenced design and implementation of Hosted notebooks at LinkedIn providing seamless experience to end users. He works closely with customers, engineers and product to understand and define the requirements and design of the system. Prior to joining LinkedIn, he has worked with Paytm, Amadeus and Samsung wherein he was building scalable applications for various domains.

Photo of Vidya Ravivarma

Vidya Ravivarma

LinkedIn

Vidya Ravivarma is a Sr Software Engineer at LinkedIn since the last one and half years with Data Analytics and Infrastructure team. She is focusing on design and implementation of building platform to improve developer productivity via Hosted Notebooks. Before this, she contributed to design and development of dynamic unified ACL management system for GDPR enforcement on datasets produced via LinkedIn’s metrics platform. She interacts closely with data analysts, scientists, engineers and stakeholders to understand their requirements to build scalable and flexible solutions/platform that enhances their productivity. Prior to LinkedIn, she worked at Yahoo, for three years mainly in data science and engineering and web development. This provides her with insights to developing a scalable, productive data science platform.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Contact us

confreg@oreilly.com

For conference registration information and customer service

partners@oreilly.com

For more information on community discounts and trade opportunities with O’Reilly conferences

strataconf@oreilly.com

For information on exhibiting or sponsoring a conference

Contact list

View a complete list of Strata Data Conference contacts