Presented By O'Reilly and Cloudera
December 5-6, 2016: Training
December 6–8, 2016: Tutorials & Conference

Architecting a next-generation data platform for real-time ETL, data analytics, and data warehousing

Mark Grover (Lyft), Ted Malaska (Capital One), Jonathan Seidman (Cloudera)
9:00am–12:30pm Tuesday, December 6, 2016
Hadoop use cases
Location: 310/311 Level: Intermediate
Average rating: ****.
(4.75, 4 ratings)

Prerequisite Knowledge

  • An understanding of Hadoop concepts and components in the Hadoop ecosystem
  • Familiarity with traditional data management systems (e.g., relational databases)
  • Knowledge of programming languages and concepts

What you'll learn

  • Understand how new and existing tools in the Hadoop ecosystem can be integrated to implement new types of data processing and analysis
  • Learn considerations and best practices for implementing these applications


Apache Hadoop is rapidly moving from its batch processing roots to a more flexible platform supporting both batch and real-time workloads. Rapid advancements in the Hadoop ecosystem are causing a dramatic evolution in both the storage and processing capabilities of the Hadoop platform. These advancements include projects like:

  • Apache Kudu (incubating), a modern columnar data store which complements HDFS and Apache HBase by offering efficient analytical capabilities and fast inserts and updates with Hadoop.
  • Apache Kafka, which provides a high-throughput and highly reliable distributed message transport.
  • Apache Impala (incubating), a highly concurrent, massively parallel processing query engine for Hadoop.
  • Apache Spark, which is rapidly replacing frameworks such as MapReduce for processing data on Hadoop due to its efficient design and optimized use of memory. Spark components such as Spark Streaming and Spark SQL provide powerful near real-time processing, enabling new applications using the Hadoop platform.

While these advancements to the Hadoop platform are exciting, they add a new array of tools that architects and developers need to understand when architecting solutions with Hadoop. Mark Grover, Ted Malaska, and Jonathan Seidman explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world and discuss how to use components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics, walking you through an example architecture that can:

  • Accelerate data processing tasks such as ETL and change data capture by building near real-time data pipelines using Kafka, Spark Streaming, and Kudu
  • Build a reliable, efficient data pipeline using Kafka and tools in the Kafka ecosystem, such as Kafka Connect and Kafka Streams, along with Spark Streaming
  • Provide users with fast analytics on data with Impala and Kudu
  • Illustrate how these components complement the batch processing capabilities of Hadoop
  • Leverage these capabilities along with other tools such as Spark MLlib and SparkSQL to provide sophisticated machine learning and analytical capabilities for users

Along the way, Mark, Ted, and Jonathan discuss considerations and best practices for utilizing these components to implement solutions, cover common challenges and how to address them, and provide practical advice for building your own modern, real-time big data architectures.

Photo of Mark Grover

Mark Grover


Mark Grover is a product manager at Lyft. Mark’s a committer on Apache Bigtop, a committer and PPMC member on Apache Spot (incubating), and a committer and PMC member on Apache Sentry. He’s also contributed to a number of open source projects, including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He’s a coauthor of Hadoop Application Architectures and wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data. He occasionally blogs on topics related to technology.

Photo of Ted Malaska

Ted Malaska

Capital One

Ted Malaska is a director of enterprise architecture at Capital One. Previously, he was the director of engineering in the Global Insight Department at Blizzard; principal solutions architect at Cloudera, helping clients find success with the Hadoop ecosystem; and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is a coauthor of Hadoop Application Architectures, a frequent speaker at many conferences, and a frequent blogger on data architectures.

Photo of Jonathan Seidman

Jonathan Seidman


Jonathan Seidman is a software engineer on the cloud team at Cloudera. Previously, he was a lead engineer on the big data team at Orbitz, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data Meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly.