We will describe our experiences in implementing a full-scale application applied to a large anonymised dataset from the mobile operator Telefonica. In the course of building this application, we faced and solved classic ETL and aggregation problems in a Map-Reduce setting. More importantly, we also developed methods for integrating diverse tools including MongoDB, the statistical system R, Hadoop Map-Reduce and a project optimized database known as jumboDB. Throughout, we used primitive operations based on Map-Reduce as the core element of computation.
The lessons learned in this project have broad implications outside of this single project. Our project was unusual in the breadth of techniques used and also in the diversity in our goals. We will describe what sorts of Map-Reduce worked well and which forms did not. We will describe what sorts of problems were appropriate for Map-Reduce and which were not. Having a single programming model was useful for us in this project, but it was also an impediment in some respects. We will provide our perspective based on our project and examine how upcoming technologies would have impacted our efforts.
Cindy Lamm works as a Data Scientist for the comSysto GmbH in Munich, Germany, where she focuses on combining data analysis (mostly done with R) with software development (Python or Java) in an agile environment. She holds an M.Sc. in Statistics from HU Berlin and a Diplôme Statisticien Economiste from ENSAE Paris.
Big Data geek, developer and advocate
Michael works at MapR Technologies in the role of Chief Data Engineer EMEA, where he helps people to tap the potential of Big Data. He has a background in large-scale data integration, the Internet of Things, and Web applications and is experienced in advocacy and standardisation. Michael has been using NoSQL datastores and Hadoop in a number of use cases and he shares his experiences with polyglot persistence at public events and via blogs. Michael contributes to Apache Drill, a distributed system for interactive, ad-hoc analysis and query
of large-scale datasets.
For exhibition and sponsorship opportunities, email email@example.com
For information on trade opportunities with O'Reilly conferences, email firstname.lastname@example.org
For media-related inquiries, contact Maureen Jennings at email@example.com
View a complete list of Strata + Hadoop World contacts
©2015, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.