Kevin McCormick explains the story of two approaches which were used internally at AWS to accelerate new ML algorithm development, and easily package Jupyter notebooks for scheduled execution, by creating custom Jupyter kernels that automatically create Docker containers, and dispatch them to either a distributed training service or job execution environment.
Missing data plagues nearly every data science problem. Often, people just drop or ignore missing data. However, this usually ends up with bad results. Matt Brems explains how bad dropping or ignoring missing data can be and teaches you how to handle missing data the right way by leveraging Jupyter notebooks to properly reweight or impute your data.
Will Farr shares examples of Jupyter use within the LIGO and Virgo Scientific Collaborations and offers lessons about the (many) advantages and (few) disadvantages of Jupyter for large, global scientific collaborations. Along the way, Will speculates on Jupyter's future role in gravitational wave astronomy.
Netflix is reimagining what a Jupyter notebook is, who works with it, and what you can do with it. Michelle Ufford shares how Netflix leverages notebooks today and describes a brief vision for the future.
Moderated by: Kevin Bates
Data science and analytics departments are now common place for enterprises determined to maximize their operations. While Jupyter Notebooks have significantly decreased the cost of admission into this space, enterprises are finding that data science at scale is difficult within the current framework. Jupyter Enterprise Gateway is designed to address these scalability issues for the enterprise.
Over the past 18 months, Ian Allison and James Colliander have deployed Jupyter to more than 8,000 users at universities across Canada. Ian and James offer an overview of the Syzygy platform and explain how they plan to scale and deliver the service nationally and how they intend to make Jupyter integral to the working experience of students, researchers, and faculty members.
Dave Stuart explains how Jupyter was used inside the US Department of Defense and the greater intelligence community to empower thousands of "citizen data scientists" to build and share analytics in order to meet the community’s dynamic challenges.
Michelle Gill explains how data science methodologies and tools can be used to link information from different scientific fields and accelerate discovery in a variety of areas, including the biological sciences.
Laura Noren offers an overview of a research project on the various infrastructure models supporting data science in research settings in terms of funding, educational uses, and research utilization. Laura then shares some of the findings, comparing the national federation model currently established in Canada to the more grassroots efforts in many US universities.
We are generating vast amounts of data, but it's not the data itself that is valuable—it's the information and knowledge that can come from this data. Tracy Teal explains how to bring people to data and empower them to address their questions, reach their potential, and solve issues that are important in science, scholarship, and society.
Kerim Kalafala and Nicholai L'Esperance share their experiences using Jupyter notebooks as a critical aid in designing the next generation of IBM Power and Z processors, focusing on analytics on graphs consisting of hundreds of millions of nodes. Along the way, Kerim and Nicholai explain how they leverage Jupyter notebooks as part of their overall design system.
Scott Sanderson explores how interactivity can and should influence the design of software libraries, details how the needs of interactive users differ from the needs of application developers, and shares techniques for improving the usability of libraries in interactive environments without sacrificing robustness in noninteractive environments.
Massive collections of data on the Earth's changing environment, collected by satellite sensors and generated by Earth system models, are being exposed via web APIs by multiple providers. Tyler Erickson highlights the use of JupyterLab and Jupyter widgets in analyzing complex high-dimensional datasets, providing insights into how our Earth is changing and what the future might look like.
Joel Grus (Allen Institute for Artificial Intelligence)
I have been using and teaching Python for many years. I wrote a best-selling book about learning data science. And here's my confession: I don't like notebooks. (There are dozens of us!) I'll explain why I find notebooks difficult, show how they frustrate my preferred pedagogy, demonstrate how I prefer to work, and discuss what Jupyter could do to win me over.
In an effort to broaden graduates' mathematical toolkit and address gender equity in STEM education, Rob Newton has led the implementation of Python projects across his school's entire ninth-grade math courses. Now every student in the ninth grade completes three python projects that introduce programming and integrate them with the ideas developed in class.
For the last four years, Douglas Blank has used nothing but Jupyter in the classroom—from a first-year writing course to a course on assembly language, from biology to computer science, from lectures to homework. Join in to learn how Douglas has leveraged Jupyter and discover the successes and failures he experienced along the way. Nicole Petrozzo then offers a student's perspective.
IBM has leveraged the Jupyter stack in many of its products to offer industry-leading and business-critical services to its clients. Luciano Resende explores some of the open source initiatives that IBM is leading in the Jupyter ecosystem to address enterprise requirements in the community.
Gerald Rouselle reviews some of the trends in modern data and analytics ecosystems for large enterprises and shares some of the key challenges and opportunities for Jupyter adoption. He also details some recent examples and experiments in incorporating Jupyter in commercial products and platforms.
David Schaaf explains how data science and data engineering can work together in cross-functional teams—with Jupyter notebooks at the center of collaboration and the analytic workflow—to more effectively and more quickly deliver results to decision makers.
Jupyter is built on a set of extensible, reusable building blocks, expressed through various open protocols, APIs, and standards. For many use cases, these are combined to provide extensible software architecture for interactive computing with data. Paco Nathan shares a few somewhat unexpected things that emerged in 2018.
Jupyter's straightforward, out-of-the-box experience has been important for its success in widespread adoption. But good defaults only go so far. Join Afshin Darian, M Pacer, Min Ragan-Kelley, and Matthias Bussonnier to go beyond the defaults and make Jupyter your own.
The Data Science Modules program at UC Berkeley creates short explorations into data science using notebooks to allow students to work hands-on with a dataset relevant to their course. Mariah Rogers, Ronald Walker, and Julian Kudszus explain the logistics behind such a program and the indispensable features of JupyterHub that enable such a unique learning experience.
Keynote by Dan Romuald Mbanga
Christopher Cho demonstrates how Kubernetes can be easily leveraged to build a complete deep learning pipeline, including data ingestion and aggregation, preprocessing, ML training, and serving with the mighty Kubernetes APIs.
The nbinteract package converts Jupyter notebooks with widgets into interactive, standalone HTML pages. Its built-in support for function-driven plotting makes authoring interactive pages simpler by allowing users to focus on data, not callbacks. Sam Lau and Caleb Siu offer an overview of nbinteract and walk you through the steps to publish an interactive web page from a Jupyter notebook.
Climate science is being flooded with petabytes of data, overwhelming traditional modes of data analysis. The Pangeo project is building a platform to take big data climate science into the cloud using SciPy and large-scale interactive computing tools. Join Ryan Abernathey and Yuvi Panda to find out what the Pangeo team is building and why and learn how to use it.
Hundreds of PayPal's data scientists, analysts, and developers use Jupyter to access data spread across filesystem, relational, document, and key-value stores, enabling complex analytics and an easy way to build, train, and deploy machine learning models. Romit Mehta and Praveen Kanamarlapudi explain how PayPal built its Jupyter infrastructure and powerful extensions.
The key to successful threat detection in cybersecurity is fast response. George Williams, Harini Kannan, and Alex Comerford offer an overview of specialized extensions they have built for data scientists working in cybersecurity that can be used and deployed via JupyterHub.
Reproducible data is essential for notebooks that work across time, across contributors, and across machines. Jackson Brown and Aneesh Karve demonstrate how to use an open source data registry to create reproducible data dependencies for Jupyter and share a case study in open science over terabyte-size image datasets.
In silico prediction of chemical properties has seen vast improvements in both veracity and volume of data but is currently hamstrung by a lack of transparent, reproducible workflows coupled with environments for visualization and analysis. Chris Harris offers an overview of a platform that uses Jupyter notebooks to enable an end-to-end workflow from simulation setup to visualizing the results.
Sandra Savchenko-de Jong offers an overview of Renku, a highly scalable and secure open software platform designed to make (data) science reproducible, foster collaboration between scientists, and share resources in a federated environment.
Using an nteract project, papermill, Matthew Seal walks you through how Netflix uses notebooks to track user jobs and make a simple interface for work submission. You’ll get an inside peek at how Netflix is tackling the scheduling problem for a range of users who want easily managed workflows.
Fernando Perez (UC Berkeley and Lawrence Berkeley National Laboratory)
In 2018, UC Berkeley launched a new major in data science, anchored by two core courses that are the fastest-growing in the history of the university. Fernando Pérez discusses the program and explains how the core courses, which now reach roughly 40% of the campus population, are extending data science into specific domains that cover virtually all disciplinary areas of the campus.
Vijay Reddy walks you through the process of building machine learning models with TensorFlow. You'll learn about data exploration, feature engineering, model creation, training, evaluation, deployment, and more.
Bo Peng (The University of Texas, MD Anderson Cancer Center)
Bo Peng offers an overview of Script of Scripts (SoS), a Python 3-based workflow engine with a Jupyter frontend that allows the use of multiple kernels in one notebook. This unique combination enables users to analyze data using multiple scripting languages in one notebook and, if needed, convert scripts to workflows in situ to analyze large amounts of data on remote systems.
David Koop (University of Massachusetts Dartmouth)
Dataflow notebooks build on the Jupyter Notebook environment by adding constructs to make dependencies between cells explicit and clear. David Koop offers an overview of the Dataflow kernel, shows how it can be used to robustly link cells as a notebook is developed, and demonstrates how that notebook can be reused and extended without impacting its reproducibility.
New challenges are emerging for Jupyter, open information, and investing in the future. You, the innovators of this growing knowledge commons, will determine how we meet these challenges and sustain the ecosystem. Carol Willing shows how you can start.
SWAN, CERN’s service for web-based analysis, leverages the power of Jupyter to provide the high energy physics community access to state-of-the-art infrastructure and services through a web service. Diogo Castro offers an overview of SWAN and explains how researchers and students are using it in their work.
Stephanie Stattel and Paul Ivanov walk you through a series of extensions that demonstrate the power and flexibility of JupyterLab’s architecture, from targeted functionality modifications to more extreme atmospheric changes that require extensive decoupling and flexibility within JupyterLab.
John Miller offers an overview of the Emacs IPython Notebook (EIN), a full-featured client for the Jupyter Notebook in Emacs, and shares a brief history of its development.
Drawing on his experience with the Pangeo project, Ryan Abernathey makes the case for the large-scale migration of scientific data and research to the cloud. The cloud offers a way to make the largest datasets instantly accessible to the most sophisticated computational techniques. A global scientific data commons could usher in a golden age of data-driven discovery.
mark hansen (Columbia Journalism School | The Brown Institute for Media Innovation)
Beyond Twitter, Facebook, and similar networks, without question, data, code, and algorithms are forming systems of power in our society. Mark Hansen explains why it is crucial that journalists—explainers of last resort—be able to interrogate these systems, holding power to account.
In Capital One's recent exploration of "notebook" offerings, JupyterHub emerged as a top contender that could serve as a potential platform for analytics even in highly regulated industries like financial services. David Schaaf and Shivraj Ramanan discuss Capital One's journey and explain how Jupyter has become a part of the company's ever-growing analytics toolkit.
Satellite imagery can be a critical resource during disasters and humanitarian crises. While the community has improved data sharing, we still struggle to create reusable data science to solve problems on the ground. Sean Gorman offers an overview of GBDX Notebooks, a step toward creating an open data science community built around Jupyter to stream imagery and share analysis at scale.
MapD Core is an open source analytical SQL engine that has been designed from the ground up to harness the parallelism inherent in GPUs. This enables queries on billions of rows of data in milliseconds. Randy Zwitch offers an overview of the MapD kernel extension for the Jupyter Notebook and explains how to use it in a typical machine learning workflow.
Nicolas Fernandez offers an overview of Clustergrammer-Widget, an interactive heatmap Jupyter widget that enables users to easily explore high-dimensional data within a Jupyter notebook and share their interactive visualizations using nbviewer.
Chakri Cherukuri offers an overview of the interactive widget ecosystem available in the Jupyter notebook and illustrates how Jupyter widgets can be used to build rich visualizations of machine learning models. Along the way, Chakri walks you through algorithms like regression, clustering, and optimization and shares a wizard for building and training deep learning models with diagnostic plots.
Julia Meinwald outlines a few effective ways Two Sigma has identified to support the unseen labor maintaining a healthy open source ecosystem and details how the company’s thinking on this topic has evolved.