Presented By O'Reilly and Cloudera
Make Data Work
September 26–27, 2016: Training
September 27–29, 2016: Tutorials & Conference
New York, NY

Hadoop application architectures: Architecting a next-generation data platform for real-time ETL, data analytics, and data warehousing

Jonathan Seidman (Cloudera), Mark Grover (Lyft), Ted Malaska (Capital One)
1:30pm–5:00pm Tuesday, 09/27/2016
Hadoop use cases
Location: Hall 1C Level: Intermediate
Average rating: ****.
(4.08, 13 ratings)

Prerequisite knowledge

  • An understanding of Hadoop concepts and components in the Hadoop ecosystem
  • Familiarity with traditional data management systems (e.g., relational databases)
  • Knowledge of programming languages and concepts
  • What you'll learn

  • Understand how new and existing tools in the Hadoop ecosystem can be integrated to implement new types of data processing and analysis
  • Learn considerations and best practices for implementing these applications
  • Description

    Apache Hadoop is rapidly moving from its batch processing roots to a more flexible platform supporting both batch and real-time workloads. Rapid advancements in the Hadoop ecosystem are causing a dramatic evolution in both the storage and processing capabilities of the Hadoop platform. These advancements include projects like:

    • Apache Kudu (incubating)—a modern columnar data store which complements HDFS and Apache HBase by offering efficient analytical capabilities and fast inserts and updates with Hadoop
    • Apache Kafka—which provides a high-throughput and highly reliable distributed message transport
    • Apache Impala (incubating)—a highly concurrent, massively parallel processing query engine for Hadoop
    • Apache Spark—which is rapidly replacing frameworks such as MapReduce for processing data on Hadoop due to its efficient design and optimized use of memory (Spark components such as Spark Streaming and Spark SQL provide powerful near-real-time processing, enabling new applications using the Hadoop platform.)

    While these advancements to the Hadoop platform are exciting, they add a new array of tools that architects and developers need to understand when architecting solutions with Hadoop. Jonathan Seidman, Gwen Shapira, Mark Grover, and Ted Malaska explain how to leverage components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics such as real-time ETL, change data capture, and machine learning as they walk attendees through an example architecture that provides the following capabilities:

    • Accelerating data processing tasks such as ETL and change data capture by building near real-time data pipelines using Kafka, Spark Streaming, and Kudu
    • Building a reliable, efficient data pipeline using Kafka and tools in the Kafka ecosystem, such as Kafka Connect and Kafka Streams, along with Spark Streaming
    • Providing users with fast analytics on data with Impala and Kudu
    • Illustrating how these components complement the batch processing capabilities of Hadoop
    • Leveraging these capabilities along with other tools such as Spark MLlib and SparkSQL to provide sophisticated machine learning and analytical capabilities for users

    Along the way, Jonathan, Gwen, Mark, and Ted discuss considerations and best practices for utilizing these components to implement solutions, cover common challenges and how to address them, and provide practical advice for building your own modern, real-time big data architectures.

    Photo of Jonathan Seidman

    Jonathan Seidman


    Jonathan Seidman is a software engineer on the cloud team at Cloudera. Previously, he was a lead engineer on the big data team at Orbitz, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data Meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly.

    Photo of Mark Grover

    Mark Grover


    Mark Grover is a product manager at Lyft. Mark’s a committer on Apache Bigtop, a committer and PPMC member on Apache Spot (incubating), and a committer and PMC member on Apache Sentry. He’s also contributed to a number of open source projects, including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He’s a coauthor of Hadoop Application Architectures and wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data. He occasionally blogs on topics related to technology.

    Photo of Ted Malaska

    Ted Malaska

    Capital One

    Ted Malaska is a director of enterprise architecture at Capital One. Previously, he was the director of engineering in the Global Insight Department at Blizzard; principal solutions architect at Cloudera, helping clients find success with the Hadoop ecosystem; and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is a coauthor of Hadoop Application Architectures, a frequent speaker at many conferences, and a frequent blogger on data architectures.

    Comments on this page are now closed.


    Priyanka Kumar
    11/02/2016 1:07am EDT

    Could you please explain the main components of a Hadoop Application.
    Priyanka Kumar

    Picture of Mark Grover
    Mark Grover
    09/27/2016 12:36pm EDT

    Hi Thai, slides are at

    Thai Truong
    09/27/2016 9:52am EDT

    Could you please share the presentation?

    Picture of Mark Grover
    Mark Grover
    09/27/2016 6:37am EDT

    Hi Mark,
    We will have a demo and the code for the demo is posted at

    We will walk through code samples and show the end result as well.

    However, we strongly believe that our time and the time of our audience is best spent discussing various architectural considerations and choices, and how do you go about choosing them based on your use-case. And, hands-on slows that conversation down. So, we’ll be focussing on higher level architectures, while still giving the audience to take a way tangible recommendations that they can apply to their projects right away.

    Slides are at

    09/27/2016 5:47am EDT

    Are there any hands on labs in this tutorial?