In this session, we’ll follow the flow of data through an end-to-end system built to handle tens of terabytes an hour of event-oriented data, providing real-time streaming, in-memory, SQL, and batch access to this data. We’ll go into detail on how open source systems such as Hadoop, Kafka, Solr, and Impala/Hive can be stitched together to form the base platform; describe how and where to perform data transformation and aggregation; provide a simple and pragmatic way of managing event metadata; and talk about how applications built on top of this platform get access to data and extend its functionality.
Attendees will leave this session knowing not just which open source projects go into a system such as this, but how they work together, what tradeoffs and decisions need to be addressed, and how to present a single general purpose data platform to multiple applications. This session should be attended by data infrastructure engineers and Ops engineers planning, building, or maintaining similar systems, or those looking to centralize and correlate user activity, quality of service, operational, and other forms of data.
Eric Sammer is the CTO and co-founder of ScalingData. Prior to ScalingData, he was an engineering manager at Cloudera. His background is in the development and operations of distributed, highly concurrent, data ingest and processing systems. He’s been involved in the open source community and has contributed to a large number of projects over the last decade. Eric is the author of Hadoop Operations (O’Reilly).
Eric is the author of O’Reilly Media’s Hadoop Operations. Learn more. http://oreil.ly/1I0ddf6