Automated feature engineering for the modern enterprise using Dask and featuretools
Who is this presentation for?Data engineers, data architects, developers
While the features generated by the algorithm gave us some compelling results in terms of value, the approach introduced challenges and opportunities. Some challenges encountered while using featuretools included scaling the feature generation process to very large datasets and that the number of features generated run into very large numbers. This forced the remaining steps of the machine learning pipeline to be scalable in all aspects.
The opportunities that the algorithm and featuretools package brings are multifold, including the ability to look at features not necessarily driven by human intuition, retaining domain expertise thinking while taking a methodical approach to auto-feature generation, and taking a codified approach to aspects like feature selection, risk, ethics, and paths to production.
You’ll learn why Dask has been the consciously adopted ecosystem of distributed framework to solve for the above challenges and opportunities. To solve for the problem of generating auto features at scale, a Dask- and Prefect- based framework were developed. While Dask laid the foundations for a Python-first distributed ecosystem, Prefect helped orchestrate workflows to solve for a variety of use cases. Each step of the end-to-end auto-feature engineering workflow was developed as a Prefect task. The framework catered for both local versions and a truly scaled-out cluster approach to meet varying compute cost optimization objectives.
To truly integrate the output of feature tools for enterprise use cases, you need to unpack the semantics of how the feature was harvested from the raw datasets. This helps in use cases like understanding the lineage and even performing impact analysis. Ananth explains how to use a graph database to map out the lineage and algorithms on top of the graph database to perform impact assessment and interpretability tooling.
The true challenge of the auto-feature engineering process was in the feature selection phase. As the post-auto-feature engineering dataset increased its disk foot print multifold, it became necessary to essentially implement feature selection process using distributed patterns as well. The framework now supports manual, information theory-based, or AutoML-based approaches to perform the tasks of feature selection. To aid the business user, a collection of visualization widgets were also provided so autogenerated features gain greater adoption in the enterprise. Surprisingly, communicating why a particular feature needs to be used is a hard problem to crack in the codified approaches.
The last pieces of the puzzle to gain enterprise adoption was to provide tooling that would help regulatory concerns. Further developing the lineage analysis and the semantic analysis tooling, and building upon configuration aspects of featuretools, suppressed features that weren’t allowed from a regulatory or ethical aspect.
Dask helped implement the thinking of “bring your own cluster,” as Dask supports both Kubernetes and YARN clusters in the ecosystem. Because of the codified approach to the entire workflow, it’s possible to provide constructs for observability. Aspects like identifying what features were deteriorating in value as compared to the metrics captured in the feature selection phase became transparent as it was codified, monitored, and analyzable. You’ll see a pictorial representation of a typical auto-feature engineering workflow.
- A basic understanding of the maching learning model lifecycle and data pipelines
What you'll learn
- Gain an introduction to Dask, Prefect, featuretools package, and the deep feature synthesis algorithm
- Learn about challenges related to scaling the approach, opportunities that automated feature engineering brings to the data pipelines ecosystem, and lineage and semantic analysis tooling on the autogenerated features
- See feature selection methodologies and patterns using the Dask ecosystem and feature selection patterns using data visualization techniques for risk- and ethics-based evaluation
- Understand how to package autogenerated features for streaming and batch pipelines and how to build observability aspects once features are deployed into production
- Discover an example end-to-end pipeline using featuretools and the Dask ecosystem
Ananth Kalyan Chakravarthy Gundabattula
Commonwealth Bank of Australia
Ananth Kalyan Chakravarthy Gundabattula is a senior application architect on the decisioning and advanced analytics engineering team for the Commonwealth Bank of Australia (CBA). Previously, he was an architect at ThreatMetrix, a member of the core team that scaled Threatmetrix architecture to 100 million transactions per day—which runs at very low latencies using Cassandra, Zookeeper, and Kafka—and migrated the ThreatMetrix data warehouse into the next generation architecture based on Hadoop and Impala; he was at IBM software labs and IBM CIO labs, enabling some of the first IBM CIO projects onboarding HBase, Hadoop, and Mahout stack. Ananth is a committer for Apache Apex and is working for the next-generation architectures for CBA fraud platform and Advanced Analytics Omnia platform at CBA. Ananth has presented at a number of conferences including YOW! Data and the Dataworks summit conference in Australia. Ananth holds a PhD in computer science security. He’s interested in all things data, including low-latency distributed processing systems, machine learning, and data engineering domains. He holds three patents and has one application pending.
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
Premier Diamond Sponsors
Premier Exhibitor Plus
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
For media/analyst press inquires