Presented By O'Reilly and Cloudera
Make Data Work
31 May–1 June 2016: Training
1 June–3 June 2016: Conference
London, UK

Floating elephants: Developing data wrangling systems on Docker

Chad Metcalf (Docker), Seshadri Mahalingam (Trifacta)
17:25–18:05 Thursday, 2/06/2016
Hadoop internals & development
Location: Capital Suite 15/16 Level: Intermediate
Average rating: ***..
(3.67, 6 ratings)

Prerequisite knowledge

Attendees shoild have Docker and Hadoop experience.


If you are building applications for the big data world, you’ll need access to a variety of big data platform configurations during development and continuous integration. These deployments need to scale down to an individual laptop or scale up to a realistic test cluster, and they need to be interchangeable with quick setup and teardown. Chad Metcalf and Seshadri Mahalingam explore several strategies for meeting this challenge by containerizing and deploying Hadoop services in Docker. Chad and Seshadri share lessons learned from solutions that they built, some which have been open sourced, and dig into topics including how to manage the images lifecycle, configuration, persistent data, multihost networking with Docker Engine and Swarm, and creating different deployment environments with Docker Machine.

Photo of Chad Metcalf

Chad Metcalf


Chad Metcalf is a solutions engineering manager for Docker. Previously, Chad worked at Puppet Labs and was an infrastructure engineer at WibiData and Cloudera.

Photo of Seshadri Mahalingam

Seshadri Mahalingam


Seshadri Mahalingam is a software engineer at Trifacta, where, in addition to building out Wrangle, Trifacta’s domain-specific language for expressing data transformation, he develops the low-latency compute framework that powers Trifacta’s fluid and immersive data wrangling experience. Seshadri holds a BS in EECS from UC Berkeley, where he cotaught a class on open source software.