Presented By O'Reilly and Cloudera
Make Data Work
September 25–26, 2017: Training
September 26–28, 2017: Tutorials & Conference
New York, NY

Cloud data lakes: Analytic data warehouses in the cloud

1:15pm1:55pm Wednesday, September 27, 2017
Big data and the Cloud, Data Engineering & Architecture
Location: 1A 21/22 Level: Advanced
Secondary topics:  Financial services, Platform
Average rating: ****.
(4.57, 7 ratings)

Who is this presentation for?

  • Architects and managers

Prerequisite knowledge

  • A basic understanding of big data query tools, data warehouse concepts, and cloud concepts

What you'll learn

  • Understand the benefits of a cloud-based data lake separating storage and compute
  • Learn how FINRA implemented a data lake in the AWS cloud


The Financial Industry Regulatory Authority (FINRA) is a private sector regulator responsible for analyzing over 90% of the equities and 65% of the option activity in the US to look for fraud, market manipulation, insider trading, and abuse. John Hitchingham shares insights into the design and operation of FINRA’s data lake in the AWS cloud, which provides storage, query, and catalog capability using S3, EMR, and a FINRA-developed data catalog and management system. Users can query across petabytes of data in seconds on AWS S3 using Presto and Spark—all while maintaining security and data lineage. FINRA implemented the cloud data warehouse to consolidate a series of data silos as part of a two-and-a-half-year all-in migration of FINRA’s Market Regulation systems to the cloud. It provides increased operational resiliency in response to market events such as Brexit while giving analysts and data scientists within FINRA increased insight into data.

Leveraging S3 for storage provides a resilient, scalable, cost-effective storage layer for data in the cloud data warehouse. Data is stored in text format for archival queries and ORC format for performant queries. The herd data catalog provides a platform-independent way to track data. It supports data versioning, storage of business and technical metadata, and schema information that can be used to query registered data. AWS EMR provides a scalable and secure compute query platform for running ETL, batch analytics and interactive analytics against data stored on S3. Keeping data on S3 provides increased durability, along with the ability to rapidly scale compute up and down to match demand.

Photo of John Hitchingham

John Hitchingham


John Hitchingham is director of performance engineering at FINRA, where he is responsible for driving technical innovation and efficiency across a cloud application portfolio that processes over 75 billion market events per day to detect fraud, market manipulation, insider trading, and abuse. Previously, John worked at both large and boutique consulting firms providing technical design and consulting services to startup, media, and telecommunications clients. John holds a BS in electrical engineering from Rutgers University.

Comments on this page are now closed.


10/11/2017 10:14am EDT

Could you share the presentation deck please?