A productive data science platform: Beyond a hosted-notebooks solution at LinkedIn
Who is this presentation for?
- Data science developers
There are a lot of individual tools to do ad hoc analysis on big data stored in distributed databases and file systems that integrate with Apache Hadoop and Apache Spark. Most of these tools are either enterprise products or are derived from open source like Presto, Apache Pig, and Apache Hive. Developers spend a large portion of their time in ad hoc analysis and the develop-test-productionize cycle.
Some of the challenges data engineers and scientists encounter are discovering and leveraging existing algorithms, solutions, or models your peers have published or tested; the ability to run the experiment on multiple clusters or datasets; optimizing the jobs through Dr. Elephant reports; custom libraries and environments for running data experiments and the customized authoring experience; the ability to perform visualizations seamlessly based on the results of the job run on the cluster, for which currently you would need to ingest the data into another application or process explicitly; enforcing developer best practices and peer review for the queries or code that will be executed before productionizing; versioning your work, and the ability to revert in case of incompatible changes; productionizing the work; scheduling data analysis; and support for polyglot authoring with productive authoring for R Shiny apps, TensorFlow, or PySpark jobs.
LinkedIn is a data-driven company. Every team consumes and produces data that improves user experience on LinkedIn. Swasti Kakker, Manu Ram Pandit, and Vidya Ravivarma to explore the scalable, extensible unified platform LinkedIn is building leveraging Jupyter Hub, Jupyter Notebook, Docker and Kubernetes, MySQL, Git, and Restli that enforces productive data science and improves development experience.
- A basic understanding of computer science and big data
What you'll learn
- Learn to use Jupyter notebooks in your company ecosystem, host Jupyter notebooks in Kubernetes, and modify Jupyter notebooks interface to suit your use cases
- Understand microservice APIs for scaling Jupyter notebooks and custom Docker images that provide heterogeneous data analytics support
Swasti Kakker is a software development engineer on the LinkedIn Data team at LinkedIn. Her passion lies in increasing and improving developer productivity by designing and implementing scalable platforms for the same. In her two-year tenure at LinkedIn, she’s worked on the design and implementation of hosted notebooks at LinkedIn, which focuses on providing a hosted solution of Jupyter notebooks. She’s worked closely with the stakeholders to understand the expectations and requirements of the platform that would improve developer productivity. Previously, she worked with the Spark team, discussing how Spark History Server can be improved to make it more scalable to cater to the traffic by Dr. Elephant. She’s also contributed to adding the Spark heuristics in Dr. Elephant after understanding the needs of the stakeholders (mainly Spark developers) which gave her good knowledge about Spark infrastructure, Spark parameters, and how to tune them efficiently.
Manu Ram Pandit
Manu Ram Pandit is a senior software engineer on the data analytics and infrastructure team at LinkedIn. He has extensive experience in building complex and scalable applications. During his tenure at LinkedIn, he’s influenced design and implementation of hosted notebooks, providing a seamless experience to end users. He works closely with customers, engineers, and product to understand and define the requirements and design of the system. Previously, he was with Paytm, Amadeus, and Samsung, where he built scalable applications for various domains.
Vidya Ravivarma is a senior software engineer on the data analytics and infrastructure team at LinkedIn. She focuses on the design and implementation of building platform to improve developer productivity via hosted notebooks. She contributed to design and development of dynamic unified ACL management system for GDPR enforcement on datasets produced via LinkedIn’s metrics platform. She interacts closely with data analysts, scientists, engineers, and stakeholders to understand their requirements to build scalable and flexible solutions and platforms that enhance their productivity. Previously, she was at Yahoo, mainly in data science and engineering and web development. This provided her with the insights to develop a scalable, productive data science platform.
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
View a complete list of Strata Data Conference contacts