Presented By O’Reilly and Cloudera
Make Data Work
21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Architecting a next generation data platform

Ted Malaska (Blizzard Entertainment), Jonathan Seidman (Cloudera)
13:3017:00 Tuesday, 22 May 2018
Data engineering and architecture
Location: Capital Suite 13 Level: Advanced

Who is this presentation for?

Software engineers, software architects, technical leads, project managers, and data engineers

Prerequisite knowledge

You will need an understanding of modern data processing systems such as Hadoop, Cassandra, etc., an understanding of traditional data management systems (e.g. relational databases), and knowledge of programming languages and concepts.

Materials or downloads needed in advance

None, this will be presentation only.

What you'll learn

This tutorial will provide attendees a deeper understanding of how new and existing tools in the open-source big data ecosystem can be integrated to implement new types of data processing and analysis, as well as provide advice on considerations and best practices for implementing these applications.


Rapid advancements are causing a dramatic evolution in both the storage and processing capabilities in the open-source big data software ecosystem. These advancements include projects like:

  • Apache Kudu, a modern columnar data store that complements HDFS and Apache HBase by offering efficient analytical capabilities and fast inserts and updates with Hadoop.
  • Apache Kafka, which provides a high-throughput and highly reliable distributed message transport.
  • Apache Spark, which is rapidly replacing parallel processing frameworks such as MapReduce due to its efficient design and optimized use of memory. Spark components such as Spark Streaming and Spark SQL provide powerful near real-time processing.
  • Distributed storage systems such as HDFS and Cassandra.
  • Parallel query engines such as Apache Impala CockroadDB, which provide capabilities for highly parallel and concurrent analysis of data sets.

These storage and processing systems provide a powerful platform to implement data processing applications on batch and streaming data. While these advancements are exciting, they also add a new array of tools that architects and developers need to understand when architecting modern data processing solutions.

Using an example based on Customer 360 and the Internet of Things, we’ll explain how to architect a modern, real-time big data platform leveraging components to reliably integrate multiple data sources, perform real-time and batch data processing, reliably store massive volumes of data, and efficiently query and process large data sets. Along the way, we’ll discuss considerations and best practices for utilizing these components to implement solutions, cover common challenges and how to address them, and provide practical advice for building your own modern, real-time data architectures.

Topics include:

  • Accelerating data processing tasks such as ETL and data analytics by building near real-time data pipelines using modern open source data integration and processing components.
  • Building reliable and efficient data pipelines, starting with source data and ending with fully processed data sets.
  • Providing users with fast analytics on data using modern storage and query engines..
  • Leveraging these capabilities along with other tools to provide sophisticated machine-learning and analytical capabilities for users.
Photo of Ted Malaska

Ted Malaska

Blizzard Entertainment

Ted Malaska is a group technical architect on the team at Blizzard, helping support great titles like World of Warcraft, Overwatch, and HearthStone. Previously, Ted was a principal solutions architect at Cloudera, helping clients find success with the Hadoop ecosystem, and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has also contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is a coauthor of Hadoop Application Architectures, a frequent speaker at many conferences, and a frequent blogger on data architectures.

Photo of Jonathan Seidman

Jonathan Seidman


Jonathan Seidman is a software engineer on the partner engineering team at Cloudera. Previously, he was a lead engineer on the big data team at Orbitz Worldwide, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data Meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)