Before tackling any project, it’s always prudent to first take inventory of what’s available. This helps you plan and execute a project quickly and efficiently. It’s now common knowledge that data scientists or analysts spend 80% of their time looking for the data they need for an analytics project. Imagine a data analyst at a life sciences or healthcare company working to build an analytic model to improve patient outcomes. There are thousands of possible datasets across the enterprise ranging from data related to patient clinical and electronic medical records (EMR) to genomics, claims, billing, patient forums, call detail records, HL7 data, and much more. Where do you even begin?
John Haddad explains how a data catalog can help you find the data you need and trust for analytic and data governance projects. A data catalog that uses AI/ML can help data scientists and analysts find and recommend the data they need and facilitates collaboration among the analytics teams helping curate the data so it improves in quality and value over time. Just like a powerful space telescope that scans the universe, a data catalog scans and collects metadata from enterprise systems including many types of databases, applications, and tools. It then automatically builds out a metadata and relationship graph exposed via REST APIs so end users and developers can query metadata for other applications or integrations.
A data catalog provides very detailed lineage down to the attribute and column level so that analysts can explore the provenance of data to see if it can be trusted. Using AI/ML, a data catalog discovers and classifies data, providing users with a very intuitive search experience (even recognizing synonyms). You can search on business keywords and filter on out-of-the-box or custom facets to find just the data you’re looking for.
John Haddad is vice president at Informatica, where he runs product and technical marketing for the Big Data, Enterprise Data Catalog and Cloud/Hybrid data management product portfolios. He has over 25 years’ experience developing and marketing enterprise software, focusing on enterprise cloud data management over the last 10 years. Previously, John held various positions in product marketing, R&D, and management at Oracle and Right Hemisphere (acquired by SAP). John holds an AB in applied mathematics from UC Berkeley.
©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com