Presented By O’Reilly and Cloudera

San Francisco • London • New York

Make Data Work

September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Real-time analytics and BI with data lakes and data warehouses using Kudu, HBase, Spark, and Kafka: Lessons learned

Mauricio Aristizabal (Impact)

2:55pm–3:35pm Wednesday, 09/12/2018

Data engineering and architecture
Location: 1A 23/24 Level: Intermediate

Secondary topics: Data Integration and Data Pipelines

Average rating:

(2.67, 3 ratings)

Download slides (BIN)

Who is this presentation for?

Data architects and engineers

Prerequisite knowledge

A basic understanding of star schema data warehouse concepts, key-value and analytic data stores, and streaming frameworks

What you'll learn

Explore a blueprint and tips from the trenches for creating a streaming data warehouse and data lake

Description

Mauricio Aristizabal shares lessons learned from migrating Impact’s traditional ETL platform to a real-time platform on Hadoop (leveraging the full Cloudera EDH stack). Mauricio also discusses the company’s data lake in HBase, Spark Streaming jobs (with Spark SQL), using Kudu for “fast data” BI queries, and using Kafka’s data bus for loose coupling between components.

The new platform satisfies several requirements. It uses the same SQL, JDBC, and star schema already in use by thousands of reports and apps. Every event is reflected in every store within 30 seconds (with a path to single-digit). It contains multiple stores for performant access for many different use cases. It’s scalable, available, and secure—all automatically, simply by using the chosen stack. Engineers and data scientists can interface using multiple languages and frameworks. And it’s code based, so it’s easier to test, debug, diff, maintain, profile, and reuse than graphical drag-and-drop tools.

The platform changes data capture agents to load every change in the company’s OLTP MySQL DBS into Kafka. A data lake in HBase stores every one of those OLTP changes (even each change to same record and column). It enables streaming dimension, fact, and aggregation processing with Spark and Spark SQL and includes a “fast” star schema data warehouse in Kudu. Streaming Kudu writers update facts and aggregates in real time. It also includes authorization and a data dictionary with Sentry and Navigator.

Topics include:

The relevance of type-2 slow-changing dimensions in today’s BI
Why HBase is a perfect fit for true data lakes
Kafka as a message bus for decoupling components and easily plugging in new ones
Why exactly once semantics is hard but doable
Kudu performance and availability tips
Kafka Avro schemas, and why you should err on the side of easy evolution
Keeping record processing insights and metrics with Swoop Spark Records
Overcoming issues with wide records (300+ columns)
Topic versus store schema parity

Mauricio Aristizabal

Impact

Mauricio Aristizabal is the data pipeline architect at Impact (formerly Impact Radius), a marketing technology company that helps brands grow by optimizing their paid marketing and media spend. Mauricio is responsible for massively scaling and modernizing the company’s analytics capabilities, selecting data stores and processing platforms, and designing many of the jobs that process internally and externally captured data and make it available to report and dashboard users, analytic applications, and machine learning jobs. He also assists the operations team with maintaining and tuning its Hadoop and Kafka clusters.

Website

Presented by

Elite Sponsors

Strategic Sponsors

Zettabyte Sponsors

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Supporting Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com