Standalone MapReduce addressed a lot of early needs in large-scale data processing, but its batch-oriented, offline nature now suits a narrow set of use cases. Companies also need to process real-time data streams, and plain MapReduce won’t cut it.
One solution is the lambda architecture. Popularized by Nathan Marz in his book Big Data, the lambda architecture builds on plain MapReduce to support scalable, fault-tolerant, real-time computation across streaming data. This permits us to have our cake and eat it, too, as it leverages our existing Hadoop cluster for both batch and real-time work.
In this talk, Flip Kromer and Q Ethan McCallum will explore what, how, and (perhaps most importantly) why to adopt the lambda architecture to address your data needs. They will use a live-updating recommendation engine as the supporting example.
I’m a Distinguished Engineer at CSC and co-founder, CTO and chief architect of Infochimps, a CSC Big Data Business, the leading big data platform in the cloud. At Infochimps, a CSC Big Data Business we built a scalable architecture that allows app programmers and statisticians to quickly and confidently manipulate data streams at arbitrary scale — terabytes in size, thousands of events per second, dozens of disparate data sources. We use a mixture of Hadoop, Elasticsearch, Storm/Kafka, Goliath and other industrial-strength solutions.
As part of this work, I’ve authored several successful open-source projects including Wukong (the most-used frameworks for Ruby streaming in Hadoop); Ironfan (cloud orchestration capable of spinning up clusters large or small at the push of a button) and Configliere (ruby configuration made easy). I am also a core committer to Goliath (liquid fast concurrent web framework) and Storm (an open-source streaming analytics platform emerging as a core piece of the Big Data stack).
I am the author of “Big Data for Chimps”, a book on data science in practice for O’Reilly books (j.mp/bigdata4cbook). I have spoken at South by Southwest, Hadoop World, Strata, NIST and CloudCon, and contributed a case study chapter to “Hadoop: The Definitive Guide”.
Q Ethan McCallum works as a professional-services consultant, speaker, and writer with a focus on strategic matters around data and technology. He is especially interested in helping companies build and shape their internal analytics practice.
Q’s speaking engagements include conferences, meetups, and training events. His published work includes Business Models for the Data Economy and Bad Data Handbok: Mapping the World of Data Problems. He is currently working on a new book on time series analysis, and another on building analytics shops (Making Analytics Work). He is also engaged in a number of projects, ranging from open-government/civic-data collaborations to open-source software tools.
Comments on this page are now closed.