Goldman Sachs is a leading global investment banking, securities, and investment management firm that provides a wide range of financial services. Goldman executes hundreds of millions of financial transactions per day, across nearly every market in the world. In this presentation, we will describe how we manage this scale of data as an enterprise asset.
The Business Problem
Large organizations, such as Goldman Sachs, create very large amounts of data. Traditionally, most of this data is managed in functional silos, making it difficult and expensive to discover, query, and analyze data across these silos. Overlapping and sometimes redundant repositories, differences in meaning of data, and lack of transparency on ownership present challenges, inefficiencies, and inconsistencies. In recent years it has become clear that data must be managed as a corporate strategic asset. How do we know what data is available? How trustworthy is it? How do you deal with different meanings resulting from traditional silo-based data management?
In this session we will share how Goldman Sachs is tackling this problem by developing an enterprise data lake platform to unify and manage data across the firm, enabling data to be discovered and used consistently and reliably for authorized users and use cases.
What is the Goldman Sachs Data Lake?
The Data Lake is a data-centric ecosystem that will allow transactional, operational, and reference data to be:
The GS Data Lake is being built to store ALL the data in the firm in one accessible place, where it can be rapidly analyzed for business purposes using a hosted query service. This Data Lake will be used for near time OLAP, and GS is investigating using it for OLTP/streaming.
Who will use the Goldman Sachs Data Lake?
There are three main actors in the data lake.
What technology are we using?
GS is building this infrastructure using open source components such as Hadoop, Spark, and Hive as well as commercial offerings and custom-developed software.
In this talk we will describe the technical architecture of the solution in detail. We will cover the different services provided by the Data Lake that enable the management of data and metadata through a lifecycle of data being published, refined, enriched, and ultimately consumed. We will also cover some of the design patterns that help us scale the platform, namely using SPARK for the ingestion pipeline and the separation of storage from compute.
Billy Newport has been at Goldman Sachs as a Technology Fellow since 2011, working on big data and graph problems at the firm. Prior to that he was a Distinguished Engineer at IBM for 10 years, where he worked primarily on distributed systems and high availability for the WebSphere platform. He graduated from Waterford Institute of Technology with a first class honor degree in industrial computing in 1989.
Comments on this page are now closed.
©2015, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.