Making Open Work
May 8–9, 2017: Training & Tutorials
May 10–11, 2017: Conference
Austin, TX

Instant and repeatable data platforms

Heather Nelson (Silicon Valley Data Science), Gary Dusbabek (Silicon Valley Data Science)
11:50am12:30pm Thursday, May 11, 2017
Data, Big and Small
Location: Ballroom F
Level: Intermediate
Average rating: ****.
(4.00, 5 ratings)

Who is this presentation for?

  • Data engineers, DevOps professionals, and infrastructure engineers

Prerequisite knowledge

  • Familiarity with cluster products, such as Hadoop and Kafka, as well as AWS, Azure, Terraform, Vagrant, and Ansible
  • A working knowledge of Linux, shell scripting, Python, and JSON

What you'll learn

  • Explore use cases for an instant and repeatable data platform, such as the ability to bring up the same cluster repeatedly or disaster recovery, and the development and release process, including integration testing
  • Learn how to parameterize your cloud environment, create a data lab for the data scientist, with all the tools they require for their exploration, and model costs in real-time to analyze price and desired performance

Description

Configuring a data platform and data science environment can be a tedious, error-prone process: development, continuous integration, QA, staging, production. . .and often from scratch. Heather Nelson and Gary Dusbabek explain how to create a cloud-agnostic environment combining cloud platforms such as AWS or Azure with Terraform and Ansible that spins up quickly and is easy to configure as required. Heather and Gary discuss their “push button” infrastructure tool and demonstrate how you can use it in your own projects. Heather and Gary will be open sourcing this project.

Topics include:

  • Use cases, such as the ability to bring up the same cluster repeatedly or disaster recovery
  • How to parameterize your cloud environment
  • Creating a data lab for the data scientist, with all the tools they require for their exploration
  • The development and release process, including integration testing
  • How to model costs in real-time to analyze price and desired performance
Photo of Heather Nelson

Heather Nelson

Silicon Valley Data Science

Heather Nelson is a senior solution architect at Silicon Valley Data Science, where she draws from her diverse background in business and technology consulting to find the best solutions for her clients’ toughest data problems. A problem solver by nature, Heather is passionate about helping organizations leverage data to drive competitive advantage.

Photo of Gary Dusbabek

Gary Dusbabek

Silicon Valley Data Science

An Apache Cassandra committer and PMC member, Gary Dusbabek specializes in building distributed systems. His recent experience includes creating an open source high-volume metrics processing pipeline and building out several geographically distributed API services in the cloud.