Fueling innovative software
July 15-18, 2019
Portland, OR

Big data for the small fry

Mike Lutz (Samtec)
5:05pm5:45pm Thursday, July 18, 2019
The Next Architecture
Location: E143/144
Average rating: ****.
(4.50, 2 ratings)

Who is this presentation for?

  • IT and data people in smaller non-data-focused companies

Level

Beginner

Description

Over the last five years, there’s been a loud drumbeat announcing that big data is changing everything—but to all the normal folks, the people who don’t have data as their primary product, when they look at the technologies that make up the traditional big data suite, they find them so incomprehensibly different that they seem nearly alien in nature. The normal folks needed something to bridge the technological gap to get into big data, something that felt like normal enterprise data and ETL tools but that could, if needed, scale, interact with, and/or be pushed out to the cloud. That bridge can be made from a very unexpected tool, the Jupyter Notebook.

A few months ago Netflix started posting blog posts about what appeared to be the misuse of a familiar tool: Jupyter Notebook—the CS equivalent of a printing calculator. Instead of simply thinking of Jupyter as an interactive programing tool, what if, in addition to the interactive aspects of Jupyter, you took finished notebooks and had a tool that would let you run them noninteractively while providing parameterized inputs. That upside-down use of a notebook transforms them from an interactive programming environment to a self-documenting ETL tool. Netflix further pointed out that if you have cloud-based glue and scheduling systems (something the company has built internally but hasn’t publicly released), you then can scale the system as well.

Mike Lutz explains how Samtec (a midsize manufacturing company) read this and was thrilled with this solution—it was a way it could jump its Python-ETL-writing developers directly into the cloud. Except for one problem. Netflix didn’t offer how a small company would do the glue and scheduling. Mike details the open source infrastructure Samtec assembled to fill the gaps in the Netflix Jupyter system in order to make to work for small groups using Jupyter/JupyterHub, nteract(Netflix) papermill, Apache Airflow, Docker (optionally Kubernetes), a cloud data service (S3), and cloud compute/VPN services AWS, EC2, and VPN.

Prerequisite knowledge

  • A basic understanding of your company's data
  • Experience with scripting languages (e.g., Python)

What you'll learn

  • Understand how any size company can start leveraging big data and AI using open source tools and how even small and non-data-centric groups can get into big data via Jupyter today
  • Learn how Jupyter + Papermill + Airflow + cloud CPU and storage help bridge gap to big data
Photo of Mike Lutz

Mike Lutz

Samtec

Mike Lutz is an infrastructure lead at Samtec. Traditionally living in the data communications world, he stumbled into data (and big data) as a way to manage the floods of information that were being generated in his many telemetry and internet of things adventures.

Comments on this page are now closed.

Comments

Picture of Mike Lutz
Mike Lutz | Infrastructure Lead
07/18/2019 8:36am PDT

Links for topics covered in talk:

Picture of Mike Lutz
Mike Lutz | Infrastructure Lead
06/06/2019 11:15pm PDT

If you have any questions about the session this is a good place to ask.

If you would like to get some extra background in the technologies I’m going to talk about, here are a few other sessions I see on the schedule that look like they might help: