Presented By O'Reilly and Cloudera
Make Data Work
September 26–27, 2016: Training
September 27–29, 2016: Tutorials & Conference
New York, NY

How a Spark-based feature store can accelerate big data adoption in financial services

Kaushik Deka (Novantas), Phil Jarymiszyn (Novantas)
2:55pm–3:35pm Wednesday, 09/28/2016
Hadoop use cases
Location: 3D 08 Level: Beginner

Prerequisite knowledge

  • A general knowledge of Hadoop and data lakes
  • Familiarity with Spark
  • A basic knowledge of core banking systems and the regulatory landscape (useful but not required)
  • What you'll learn

  • Understand the basics of feature stores and how you can unlock value in a data lake
  • Learn how to build a feature store on Hadoop (and some design patterns in Spark)
  • Discover how to overcome some of the implementation and roll-out challenges
  • Description

    One of the ways to drive enterprise adoption of big data in financial services is to have a central standardized, reusable, transparent, and well-governed library of features (or metrics) that will empower data scientists and business analysts across a range of business problems. This is the central idea behind a feature store—a library of documented features for various analyses based on a shared data model that spans a wide variety of data sources resident within a bank’s data lake.

    Kaushik Deka and Phil Jarymiszyn discuss the benefits of a Spark-based feature store, outline three challenges they faced—semantic data integration within a data lake, high-performance feature engineering, and metadata governance—and explain how they overcame them.

    The first challenge of building such a feature store is to project the data in a data lake into a common conceptual data model and then generate features from that model. The combination of data variety, formal analytical models, and long project cycles in financial services suggests that the application of data modeling to data lakes should yield significant advantages both in terms of a shared understanding of the domain-specific semantic ontology and also as an extensible data integration framework. In the discussed use case, the feature store was powered by one such semantically integrated data model for retail banking.

    The second challenge is to enable high-performance feature engineering at a customer level on top of the conceptual data model. There’s significant benefit to partitioning data at the customer level so that calculations don’t incur cross-node chatter on the network. Kaushik and Phil had to provide an API to access the data model for data scientists to create parameterized features. To accomplish these objectives, they developed an ETL pipeline in Spark that stored the instance data in Hadoop as a distributed collection of partitioned structured objects per customer. They then provided a parallelizable Spark API to access these structured customer objects.

    The third challenge is enforcing business metadata governance on the feature store. The agility of analytics and data democratization that a high-performing feature store can unleash has to be countered with sound metadata governance to prevent complete analytical anarchy. Regulatory pressures make this a necessity. In particular, data lineage, audits, and version control of source code have to be baked into the feature development workflows within the feature store.

    Photo of Kaushik Deka

    Kaushik Deka


    Kaushik Deka is a partner and CTO at Novantas, where he is responsible for technology strategy and R&D roadmap of a number of cloud-based platforms. He has more than 15 years’ experience leading large engineering teams to develop scalable, high-performance analytics platforms. Kaushik holds an MS in computer science from the University of Missouri, an MS in engineering from the University of Pennsylvania, and an MS in computational finance from Carnegie Mellon University.

    Photo of Phil Jarymiszyn

    Phil Jarymiszyn


    Phil Jarymiszyn is the director of big data integration services at Novantas. Phil has over 28 years of experience building enterprise/application data stores for banks and brokers. He has banking data domain expertise in all categories of bank operational systems and data requirements expertise in both analytical and operational use cases and is a BI expert for analytical and data democratization initiatives. Phil holds a BA in economics from Harvard University.

    Comments on this page are now closed.


    Simon Schwendner
    10/03/2016 9:35am EDT

    Are there any slides available?

    09/29/2016 7:36am EDT

    are slides available? Thanks

    09/29/2016 7:23am EDT

    are slides available? thanks