Data, Data Everywhere and Only Map-Reduce to Drink

Cindy Lamm (comSysto GmbH), Michael Hausenblas (AWS)
Hadoop & Beyond
Location: 120-121
Average rating: ***..
(3.89, 9 ratings)

We will describe our experiences in implementing a full-scale application applied to a large anonymised dataset from the mobile operator Telefonica. In the course of building this application, we faced and solved classic ETL and aggregation problems in a Map-Reduce setting. More importantly, we also developed methods for integrating diverse tools including MongoDB, the statistical system R, Hadoop Map-Reduce and a project optimized database known as jumboDB. Throughout, we used primitive operations based on Map-Reduce as the core element of computation.

The lessons learned in this project have broad implications outside of this single project. Our project was unusual in the breadth of techniques used and also in the diversity in our goals. We will describe what sorts of Map-Reduce worked well and which forms did not. We will describe what sorts of problems were appropriate for Map-Reduce and which were not. Having a single programming model was useful for us in this project, but it was also an impediment in some respects. We will provide our perspective based on our project and examine how upcoming technologies would have impacted our efforts.

Photo of Cindy Lamm

Cindy Lamm

comSysto GmbH

Cindy Lamm works as a Data Scientist for the comSysto GmbH in Munich, Germany, where she focuses on combining data analysis (mostly done with R) with software development (Python or Java) in an agile environment. She holds an M.Sc. in Statistics from HU Berlin and a DiplĂ´me Statisticien Economiste from ENSAE Paris.

Photo of Michael Hausenblas

Michael Hausenblas


Big Data geek, developer and advocate

Michael works at MapR Technologies in the role of Chief Data Engineer EMEA, where he helps people to tap the potential of Big Data. He has a background in large-scale data integration, the Internet of Things, and Web applications and is experienced in advocacy and standardisation. Michael has been using NoSQL datastores and Hadoop in a number of use cases and he shares his experiences with polyglot persistence at public events and via blogs. Michael contributes to Apache Drill, a distributed system for interactive, ad-hoc analysis and query
of large-scale datasets.