Over the last five years, there’s been a loud drumbeat announcing that big data is changing everything—but to all the normal folks, the people who don’t have data as their primary product, when they look at the technologies that make up the traditional big data suite, they find them so incomprehensibly different that they seem nearly alien in nature. The normal folks needed something to bridge the technological gap to get into big data, something that felt like normal enterprise data and ETL tools but that could, if needed, scale, interact with, and/or be pushed out to the cloud. That bridge can be made from a very unexpected tool, the Jupyter Notebook.
A few months ago Netflix started posting blog posts about what appeared to be the misuse of a familiar tool: Jupyter Notebook—the CS equivalent of a printing calculator. Instead of simply thinking of Jupyter as an interactive programing tool, what if, in addition to the interactive aspects of Jupyter, you took finished notebooks and had a tool that would let you run them noninteractively while providing parameterized inputs. That upside-down use of a notebook transforms them from an interactive programming environment to a self-documenting ETL tool. Netflix further pointed out that if you have cloud-based glue and scheduling systems (something the company has built internally but hasn’t publicly released), you then can scale the system as well.
Mike Lutz explains how Samtec (a midsize manufacturing company) read this and was thrilled with this solution—it was a way it could jump its Python-ETL-writing developers directly into the cloud. Except for one problem. Netflix didn’t offer how a small company would do the glue and scheduling. Mike details the open source infrastructure Samtec assembled to fill the gaps in the Netflix Jupyter system in order to make to work for small groups using Jupyter/JupyterHub, nteract(Netflix) papermill, Apache Airflow, Docker (optionally Kubernetes), a cloud data service (S3), and cloud compute/VPN services AWS, EC2, and VPN.
Mike Lutz is an infrastructure lead at Samtec. Traditionally living in the data communications world, he stumbled into data (and big data) as a way to manage the floods of information that were being generated in his many telemetry and internet of things adventures.
Comments on this page are now closed.
For exhibition and sponsorship opportunities, email email@example.com
For information on trade opportunities with O'Reilly conferences, email firstname.lastname@example.org
View a complete list of OSCON contacts
©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Links for topics covered in talk:
If you have any questions about the session this is a good place to ask.
If you would like to get some extra background in the technologies I’m going to talk about, here are a few other sessions I see on the schedule that look like they might help: