Presented By O'Reilly and Cloudera
Make Data Work
Sept 29–Oct 1, 2015 • New York, NY

Goldman Sachs data lake

Billy Newport (Goldman Sachs)
11:20am–12:00pm Wednesday, 09/30/2015
Data-driven Business
Location: 1 E10 / 1 E11 Level: Intermediate
Average rating: ***..
(3.85, 34 ratings)

Goldman Sachs is a leading global investment banking, securities, and investment management firm that provides a wide range of financial services. Goldman executes hundreds of millions of financial transactions per day, across nearly every market in the world. In this presentation, we will describe how we manage this scale of data as an enterprise asset.

The Business Problem

Large organizations, such as Goldman Sachs, create very large amounts of data. Traditionally, most of this data is managed in functional silos, making it difficult and expensive to discover, query, and analyze data across these silos. Overlapping and sometimes redundant repositories, differences in meaning of data, and lack of transparency on ownership present challenges, inefficiencies, and inconsistencies. In recent years it has become clear that data must be managed as a corporate strategic asset. How do we know what data is available? How trustworthy is it? How do you deal with different meanings resulting from traditional silo-based data management?

In this session we will share how Goldman Sachs is tackling this problem by developing an enterprise data lake platform to unify and manage data across the firm, enabling data to be discovered and used consistently and reliably for authorized users and use cases.

What is the Goldman Sachs Data Lake?

The Data Lake is a data-centric ecosystem that will allow transactional, operational, and reference data to be:

  • Registered, ingested, validated, stored and archived in its native form
  • Secured and entitled to authorized users
  • Modeled and made query-able using a unified query service
  • Cleansed, enriched, transformed, and analyzed through hosted compute engines
  • Managed as a first class asset with transparency on ownership, lineage, and provenance

The GS Data Lake is being built to store ALL the data in the firm in one accessible place, where it can be rapidly analyzed for business purposes using a hosted query service. This Data Lake will be used for near time OLAP, and GS is investigating using it for OLTP/streaming.

Who will use the Goldman Sachs Data Lake?

There are three main actors in the data lake.

  • Producers – responsible to register and publish their data to the data lake and ensure it meets data validation standards and SLA
  • Refiners – will cleanse, enrich, and transform the data and re-publish the curated version to the data lake
  • Consumers – will browse available data and run reports, queries, and analytics

What technology are we using?

GS is building this infrastructure using open source components such as Hadoop, Spark, and Hive as well as commercial offerings and custom-developed software.

In this talk we will describe the technical architecture of the solution in detail. We will cover the different services provided by the Data Lake that enable the management of data and metadata through a lifecycle of data being published, refined, enriched, and ultimately consumed. We will also cover some of the design patterns that help us scale the platform, namely using SPARK for the ingestion pipeline and the separation of storage from compute.

Billy Newport

Goldman Sachs

Billy Newport has been at Goldman Sachs as a Technology Fellow since 2011, working on big data and graph problems at the firm. Prior to that he was a Distinguished Engineer at IBM for 10 years, where he worked primarily on distributed systems and high availability for the WebSphere platform. He graduated from Waterford Institute of Technology with a first class honor degree in industrial computing in 1989.

Comments on this page are now closed.

Comments

Marlene Holm
10/16/2015 8:02am EDT

It was an excellent presentation – any chance you will share the slides? I am particularly interested in the ecosystem slide where you showed the different components of your infrastructure.

Nahum Rosinsky
10/05/2015 2:18pm EDT

that was a great talk Billy! do you think you can share the slides of your presentation?

Picture of Robert Cohen
Robert Cohen
09/25/2015 11:36am EDT

What do you do to make the data lake secure? Anything more than virtualization or creating a software defined data center?