One of the ways to drive enterprise adoption of big data in financial services is to have a central standardized, reusable, transparent, and well-governed library of features (or metrics) that will empower data scientists and business analysts across a range of business problems. This is the central idea behind a feature store—a library of documented features for various analyses based on a shared data model that spans a wide variety of data sources resident within a bank’s data lake.
Kaushik Deka and Phil Jarymiszyn discuss the benefits of a Spark-based feature store, outline three challenges they faced—semantic data integration within a data lake, high-performance feature engineering, and metadata governance—and explain how they overcame them.
The first challenge of building such a feature store is to project the data in a data lake into a common conceptual data model and then generate features from that model. The combination of data variety, formal analytical models, and long project cycles in financial services suggests that the application of data modeling to data lakes should yield significant advantages both in terms of a shared understanding of the domain-specific semantic ontology and also as an extensible data integration framework. In the discussed use case, the feature store was powered by one such semantically integrated data model for retail banking.
The second challenge is to enable high-performance feature engineering at a customer level on top of the conceptual data model. There’s significant benefit to partitioning data at the customer level so that calculations don’t incur cross-node chatter on the network. Kaushik and Phil had to provide an API to access the data model for data scientists to create parameterized features. To accomplish these objectives, they developed an ETL pipeline in Spark that stored the instance data in Hadoop as a distributed collection of partitioned structured objects per customer. They then provided a parallelizable Spark API to access these structured customer objects.
The third challenge is enforcing business metadata governance on the feature store. The agility of analytics and data democratization that a high-performing feature store can unleash has to be countered with sound metadata governance to prevent complete analytical anarchy. Regulatory pressures make this a necessity. In particular, data lineage, audits, and version control of source code have to be baked into the feature development workflows within the feature store.
Kaushik Deka is a partner and CTO at Novantas, where he is responsible for technology strategy and R&D roadmap of a number of cloud-based platforms. He has more than 15 years’ experience leading large engineering teams to develop scalable, high-performance analytics platforms. Kaushik holds an MS in computer science from the University of Missouri, an MS in engineering from the University of Pennsylvania, and an MS in computational finance from Carnegie Mellon University.
Phil Jarymiszyn is the director of big data integration services at Novantas. Phil has over 28 years of experience building enterprise/application data stores for banks and brokers. He has banking data domain expertise in all categories of bank operational systems and data requirements expertise in both analytical and operational use cases and is a BI expert for analytical and data democratization initiatives. Phil holds a BA in economics from Harvard University.
Comments on this page are now closed.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.