Peng Du and Randy Wei offer an overview of Uber’s data science workbench, which provides a central platform for data scientists to perform interactive data analysis through notebooks like Jupyter and RStudio, share and collaborate on scripts, and publish results to dashboards and is seamlessly integrated with other Uber services, providing convenient features such as task scheduling, model publishing, and job monitoring.
Uber’s data science workbench provides clients with a scalable compute environment through dedicated Docker containers spawned by requests for notebook instances and a YARN/Mesos managed cluster for compute engines such as Spark, Hive, and Presto. Socialization features are supported in the workbench where clients can share, comment, and collaborate on notebook scripts with appropriate access control. All files, including scripts and results, are maintained by a version control system so that people can track progress and compare results.
In order to improve the productivity of data scientists, the workbench is also integrated with multiple services in Uber. A matured script can be scheduled as a periodical task in Uber’s job scheduling service, and people can publish their results through dashboard services like Shiny and models through Uber’s machine-learning platform. Last but not least, for complicated tasks that involve long-time running jobs in Spark, Hive, or Presto, the workbench will register the jobs in Uber’s monitoring service so that people can check the progress and debug information from them.
Peng Du is a senior software engineer in Uber. He holds a PhD in computer science and an MA in applied mathematics, both from the University of California, San Diego.
Randy Wei is a software engineer in Uber. He holds a bachelor’s degree in computer science from the University of California, Berkeley.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.