Today, Metamarkets processes over 300 billion events per day—over 100 TB going through a single pipeline built entirely on open source technologies such as Druid, Kafka, and Samza. Working at such a scale presents engineering challenges on many levels, not just in terms of design but also in terms of operations, especially when downtime is not an option.
Xavier Léauté explores how Metamarkets used Kafka and Samza to build a multitenant pipeline to perform streaming joins and transformations of varying degrees of complexity and then push data into Druid to make it available for immediate, interactive analysis at a rate of several hundreds of concurrent queries per second. But as data grew an order of magnitude in the span of a few months, all systems involved started to show their limits. Xavier describes the challenges around scaling this stack and explains how the team overcame them, using extensive metric collection to manage both performance and costs and how they handle very heterogeneous processing workloads while keeping down operational complexity.
Xavier Léauté is a software engineer at Confluent as well as a founding Druid committer and PMC member. Prior to his current role he headed the backend engineering team at Metamarkets.
©2016, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.