Presented By O'Reilly and Cloudera
Make Data Work
September 25–26, 2017: Training
September 26–28, 2017: Tutorials & Conference
New York, NY

Architecting a next-generation data platform

Jonathan Seidman (Cloudera), Gwen Shapira (Confluent), Mark Grover (Lyft)
1:30pm5:00pm Tuesday, September 26, 2017
Secondary topics:  Architecture
Average rating: ****.
(4.11, 9 ratings)

Who is this presentation for?

  • Developers, technical leads, software architects, and project leads

Prerequisite knowledge

  • An understanding of Hadoop concepts and the Hadoop ecosystem, traditional data management systems (e.g., relational databases), and programming languages and concepts

What you'll learn

  • Discover how new and existing tools in the Hadoop ecosystem can be integrated to implement new types of data processing and analysis
  • Learn considerations and best practices for implementing these applications


Rapid advancements are causing a dramatic evolution in both the storage and processing capabilities in the open source big data software ecosystem. These advancements include projects like:

  • Apache Kudu, a modern columnar data store that complements HDFS and Apache HBase by offering efficient analytical capabilities and fast inserts and updates with Hadoop;
  • Apache Kafka, which provides a high-throughput and highly reliable distributed message transport;
  • Apache Impala (incubating), a highly concurrent, massively parallel processing query engine for Hadoop;
  • Apache Spark, which is rapidly replacing frameworks such as MapReduce for processing data on Hadoop due to its efficient design and optimized use of memory. Spark components such as Spark Streaming and Spark SQL provide powerful near real-time processing.

Along with the Apache Hadoop platform, these storage and processing systems provide a powerful platform to implement data processing applications on batch and streaming data. While these advancements are exciting, they also add a new array of tools that architects and developers need to understand when architecting solutions with Hadoop.

Using Customer 360 and the IoT as examples, Jonathan Seidman, Mark Grover, and Gwen Shapira explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics. Along the way, they discuss considerations and best practices for utilizing these components to implement solutions, cover common challenges and how to address them, and provide practical advice for building your own modern, real-time big data architectures.

Topics include:

  • Accelerating data processing tasks such as ETL and data analytics by building near real-time data pipelines using tools like Kafka, Spark Streaming, and Kudu
  • Building a reliable, efficient data pipeline using Kafka and tools in the Kafka ecosystem such as Kafka Connect and Kafka Streams along with Spark Streaming
  • Providing users with fast analytics on data with Impala and Kudu
  • Illustrating how these components complement the batch processing capabilities of Hadoop
  • Leveraging these capabilities along with other tools such as Spark MLlib and Spark SQL to provide sophisticated machine learning and analytical capabilities for users
Photo of Jonathan Seidman

Jonathan Seidman


Jonathan Seidman is a software engineer on the cloud team at Cloudera. Previously, he was a lead engineer on the big data team at Orbitz, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data Meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly.

Photo of Gwen Shapira

Gwen Shapira


Gwen Shapira is a system architect at Confluent, where she helps customers achieve success with their Apache Kafka implementations. She has 15 years of experience working with code and customers to build scalable data architectures, integrating relational and big data technologies. Gwen currently specializes in building real-time reliable data processing pipelines using Apache Kafka. Gwen is an Oracle Ace Director, the coauthor of Hadoop Application Architectures, and a frequent presenter at industry conferences. She is also a committer on Apache Kafka and Apache Sqoop. When Gwen isn’t coding or building data pipelines, you can find her pedaling her bike, exploring the roads and trails of California and beyond.

Photo of Mark Grover

Mark Grover


Mark Grover is a product manager at Lyft. Mark’s a committer on Apache Bigtop, a committer and PPMC member on Apache Spot (incubating), and a committer and PMC member on Apache Sentry. He’s also contributed to a number of open source projects, including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He’s a coauthor of Hadoop Application Architectures and wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data. He occasionally blogs on topics related to technology.

Comments on this page are now closed.


Picture of Mark Grover
09/29/2017 8:26am EDT

Thanks everyone for attending. We had a great time talking to you, sharing our thoughts and learning from you. We hope you did too.

The slides for the tutorial are now available for download at the top of this page.

Happy hadooping! We’d love to stay in touch!

Picture of Gabriel Mochnacs de Arruda
Gabriel Mochnacs de Arruda | BIG DATA MANAGER
09/26/2017 1:33pm EDT

Hi, could you share ppt link?

Picture of Jonathan Seidman
Jonathan Seidman | SOFTWARE ENGINEER
09/26/2017 8:01am EDT

Nara – the session will be recorded and made available as part of the conference video collection.

Picture of Mark Grover
09/26/2017 7:27am EDT

Hi all, Jonathan, Gwen and I are super excited to see you all today at the tutorial. We are going to be using this link for asking questions:

You can like other people’s questions and overall, it’s a more democratic way of asking questions.

Picture of Mark Grover
09/26/2017 7:26am EDT

Giovanna, actually Mark Donsky who is presenting a tutorial today as well ( is a metadata expert and he will be a very good person to talk to/listen to on that topic.

Picture of Mark Grover
09/26/2017 7:20am EDT

Duly noted about metadata, Giovanna. It’s not a focus area of the tutorial but we will try to talk to it. Feel free to ask about it in our Ask Us Anything session at

Nara, yes, recordings are posted by the organizers after the conference.

Picture of Shankar Neelakrishnan
Shankar Neelakrishnan | DATA & ANALYTICS LEAD
09/26/2017 5:56am EDT

Do we get a recording of this session?

09/25/2017 2:03pm EDT

I am interested in capturing metadata outside and inside my Cloudera ecoystem