Presented By O’Reilly and Cloudera

San Francisco • London • New York

Make Data Work

September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Architecting a next-generation data platform

Ted Malaska (Capital One), Jonathan Seidman (Cloudera)

1:30pm–5:00pm Tuesday, 09/11/2018

Data engineering and architecture
Location: 1A 06/07 Level: Advanced

Secondary topics: Data Platforms

Average rating:

(3.12, 8 ratings)

Download slides (PDF)

Who is this presentation for?

Software architects, developers, data engineers, project managers, product managers, and software managers

Prerequisite knowledge

Familiarity with concepts and components related to the open source enterprise data management ecosystem, such as Hadoop and related projects, HBase, and Cassandra, as well as traditional data management systems (e.g., relational databases) and programming languages

What you'll learn

Understand how new and existing tools in the open source big data ecosystem can be integrated to implement new types of data processing and analysis
Learn considerations and best practices for implementing these applications

Description

Rapid advancements are causing a dramatic evolution in both the storage and processing capabilities in the open source enterprise data software ecosystem. These advancements include projects like:

Apache Kudu, a modern columnar data store that complements HDFS and Apache HBase by offering efficient analytical capabilities and fast inserts and updates with Hadoop;
Apache Kafka, which provides a high-throughput and highly reliable distributed message transport;
Apache Spark, which is rapidly replacing parallel processing frameworks such as MapReduce due to its efficient design and optimized use of memory. Spark components such as Spark Streaming and Spark SQL provide powerful near real-time processing;
Distributed storage systems, such as HDFS and Cassandra;
Parallel query engines such as Apache Impala and CockroachDB, which provide capabilities for highly parallel and concurrent analysis of datasets.

These storage and processing systems provide a powerful platform to implement data processing applications on batch and streaming data. While these advancements are exciting, they also add a new array of tools that architects and developers need to understand when architecting modern data processing solutions.

Using Customer 360 and the internet of things as examples, Jonathan Seidman and Ted Malaska explain how to architect a modern, real-time big data platform leveraging these components to reliably integrate multiple data sources, perform real-time and batch data processing, reliably store massive volumes of data, and efficiently query and process large datasets. Along the way, they discuss considerations and best practices for utilizing these components to implement solutions, cover common challenges and how to address them, and provide practical advice for building your own modern, real-time data architectures.

Topics include:

Accelerating data processing tasks such as ETL and data analytics by building near real-time data pipelines using modern open source data integration and processing components
Building reliable and efficient data pipelines, starting with source data and ending with fully processed datasets
Providing users with fast analytics on data using modern storage and query engines
Leveraging these capabilities along with other tools to provide sophisticated machine learning and analytical capabilities for users

Ted Malaska

Capital One

Ted Malaska is a director of enterprise architecture at Capital One. Previously, he was the director of engineering in the Global Insight Department at Blizzard; principal solutions architect at Cloudera, helping clients find success with the Hadoop ecosystem; and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is a coauthor of Hadoop Application Architectures, a frequent speaker at many conferences, and a frequent blogger on data architectures.

Jonathan Seidman

Cloudera

Jonathan Seidman is a software engineer on the cloud team at Cloudera. Previously, he was a lead engineer on the big data team at Orbitz, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data Meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly.

Website

Comments on this page are now closed.

Comments

Ron DeFreitas | PLATFORM ARCHITECT

09/11/2018 11:11am EDT

as someone posted in slido: https://www.slideshare.net/jseidman/architecting-a-next-gen-data-platform-strata-new-york-2018

pani manchella | LEAD DATA ARCHITECT

09/11/2018 10:02am EDT

can you please stop the presentation and provide link to slides, it is very hard to follow

Abrar Ahmed | SENIOR BI DEVELOPER

09/11/2018 10:00am EDT

It would be good if you can share the slides during the break so that it would avoid us taking notes/photos and concentrate on the talk.

Shruti Modi | SENIOR MANAGER DATA PLATFORM

09/11/2018 9:55am EDT

I am not able to access the presentation. How can I get the access to it

Sunil Razdan | PRINCIPAL INFRASTRUCTURE ARCHITECT

09/11/2018 9:46am EDT

Link for slides tiny.cloudera.. shared at the beginning requires login to a Cloudera account and doesn’t work. Can you share it here please.

Ron DeFreitas | PLATFORM ARCHITECT

09/11/2018 9:44am EDT

The slide link you showed at the beginning of the talk requires login to a cloudera account and doesn’t seem to work. Can you share it here instead?

Samuel Xie | SOFTWARE ENGINEER

09/11/2018 9:41am EDT

Where is the slides and github link please?

Krishna a Chaitanya | SENIOR ENGINEERING MANAGER

09/11/2018 9:31am EDT

Can you share the slides please ?

Presented by

Elite Sponsors

Strategic Sponsors

Zettabyte Sponsors

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Supporting Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com