Presented By O'Reilly and Cloudera
Make Data Work
22–23 May 2017: Training
23–25 May 2017: Tutorials & Conference
London, UK

Near-real-time ingest with Apache Flume and Apache Kafka at 1 million-events-per-second scale

Tristan Stevens (Cloudera)
14:5515:35 Wednesday, 24 May 2017
Level: Intermediate
Average rating: ***..
(3.67, 3 ratings)

Who is this presentation for?

  • Solution architects, SIEM specialists, and developers

Prerequisite knowledge

  • A basic understanding of Flume and Kafka

What you'll learn

  • Learn how Vodafone architected, built, and tuned a near-real-time ingest pipeline for SIEM or IoT applications at million(s)-of-events-per-second throughput


There are two large obstacles to collecting metadata from a network as large as Vodafone’s (the UK’s second-largest telecoms provider): transporting the sheer volume of data (cumulative bandwidth) and processing it before the data no longer accurately reflects the state of the network (cumulative delay). Fortunately, combining Apache Flume and Apache Kafka using the Flafka pattern provides a means to move data into the EDH (Hadoop cluster) and readily scale the pipeline to address both transient and persistent spikes in data volume.

Flume and Kafka are both capable of high-performance, low-latency event processing; however, careful tuning is required in order to achieve performance at this scale. Vodafone has deployed Flume and Kafka across the UK network in a geographically distributed architecture that achieves scale and resilience, having been tuned from around 10,000 events per second on initial deployment to 1,000,000 events per second using a three-node Kafka cluster. Tristan Stevens discusses the architecture, deployment, and performance-tuning techniques that enable the system to perform at IoT-scale on modest hardware and at a very low cost.

Topics include:

  • Achieving a high throughput into HDFS and Solr for analysis
  • Configuring Flume to consume events from Kafka at an appropriate rate
  • Configuring Flume to publish to Kafka at a rate and in a fashion that allows highly parallel consumption
  • Impact on network configuration in the Kafka cluster
  • Achieving a balanced configuration such that the producing and consuming side are able to operate at the same rate
  • Automating the performance testing process and approaches therein
Photo of Tristan Stevens

Tristan Stevens


Tristan Stevens is a senior solutions architect at Cloudera, where he helps clients across EMEA with their Hadoop implementations. Tristan’s background is in the UK defence sector. He has also worked on large-scale, highly available, business-critical analytics platforms, with more recent experience in gaming, telecoms, and financial services.

Comments on this page are now closed.


Binyamin Bazomnik | C4I OFFICER
25/05/2017 11:06 BST

Hello Tristan, Presentaion was really cool, can we please get the slides? – thank you!