Presented By O'Reilly and Cloudera
Make Data Work
22–23 May 2017: Training
23–25 May 2017: Tutorials & Conference
London, UK

Architecting a next-generation data platform

Jonathan Seidman (Cloudera), Mark Grover (Cloudera), Ted Malaska (Blizzard)
13:3017:00 Tuesday, 23 May 2017
Level: Advanced

Who is this presentation for?

  • Software architects, software engineers, data engineers, and project leads

Prerequisite knowledge

  • An understanding of Hadoop concepts and the Hadoop ecosystem, traditional data management systems (e.g., relational databases), and programming languages and concepts

What you'll learn

  • Understand how new and existing tools in the Hadoop ecosystem can be integrated to implement new types of data processing and analysis
  • Learn considerations and best practices for implementing these applications

Description

Apache Hadoop is rapidly moving from its batch processing roots to a more flexible platform supporting both batch and streaming workloads. Rapid advancements in the Hadoop ecosystem are causing a dramatic evolution in both the storage and processing capabilities of the Hadoop platform. These advancements include projects like:

  • Apache Kudu, a modern columnar data store that complements HDFS and Apache HBase by offering efficient analytical capabilities and fast inserts and updates with Hadoop.
  • Apache Kafka, which provides a high-throughput and highly reliable distributed message transport.
  • Apache Impala (incubating), a highly concurrent, massively parallel processing query engine for Hadoop.
  • Apache Spark, which is rapidly replacing frameworks such as MapReduce for processing data on Hadoop due to its efficient design and optimized use of memory. Spark components such as Spark Streaming and Spark SQL provide powerful near real-time processing, enabling new applications using the Hadoop platform.

While these advancements to the Hadoop platform are exciting, they also add a new array of tools that architects and developers need to understand when architecting solutions with Hadoop.

Using Entity 360 as an example, Jonathan Seidman, Ted Malaska, and Mark Grover explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics. Along the way, they discuss considerations and best practices for utilizing these components to implement solutions, cover common challenges and how to address them, and provide practical advice for building your own modern, real-time big data architectures.

Topics include:

  • Accelerating data processing tasks such as ETL and data analytics by building near real-time data pipelines using tools like Kafka, Spark Streaming, and Kudu
  • Building a reliable, efficient data pipeline using Kafka and tools in the Kafka ecosystem along with Spark Streaming
  • Providing users with fast analytics on data with Impala and Kudu
  • Illustrating how these components complement the batch processing capabilities of Hadoop
  • Leveraging these capabilities along with other tools such as Spark MLlib and Spark SQL to provide sophisticated machine-learning and analytical capabilities for users
Photo of Jonathan Seidman

Jonathan Seidman

Cloudera

Jonathan Seidman is a software engineer on the Partner Engineering team at Cloudera. Previously, he was a lead engineer on the Big Data team at Orbitz Worldwide, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly.

Photo of Mark Grover

Mark Grover

Cloudera

Mark Grover is a software engineer working on Apache Spark at Cloudera. Mark is a committer on Apache Bigtop, a committer and PPMC member on Apache Spot (incubating) and a committer and PMC member on Apache Sentry and has contributed to a number of open source projects including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He is a coauthor of Hadoop Application Architectures and also wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data at various national and international conference. He occasionally blogs on topics related to technology.

Photo of Ted Malaska

Ted Malaska

Blizzard

Ted Malaska is a senior solution architect at Blizzard. Previously, he was a principal solutions architect at Cloudera. Ted has 18 years of professional experience working for startups, the US government, some of the world’s largest banks, commercial firms, bio firms, retail firms, hardware appliance firms, and the largest nonprofit financial regulator in the US and has worked on close to one hundred clusters for over two dozen clients with over hundreds of use cases. He has architecture experience across topics including Hadoop, Web 2.0, mobile, SOA (ESB, BPM), and big data. Ted is a regular contributor to the Hadoop, HBase, and Spark projects, a regular committer to Flume, Avro, Pig, and YARN, and the coauthor of Hadoop Application Architectures.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)