Data lineage is critical to answering a wide range of questions about how data is being used within an organization. Which datasets and table columns are driving key performance indicators? How is certain privacy-sensitive data being used? Where do errors or outliers arise, and how do they propagate forward? Where are inefficient or unnecessary processing steps being taken? Tracking data lineage is also critical in real-world use cases such as regulatory reporting and compliance.
Sean Kandel presents novel interactive visualizations for exploring data lineage across multiple levels of detail. From high-level overviews of input-output relationships to fine-grained column dependency tracking, Sean explains how analysts can rapidly navigate lineage data and formulate provenance queries to gain insight into how data is being processed and transformed. By incorporating summary statistics, distributions, and data quality metrics, these visualizations can further augment lineage views to jointly inspect schema-level metadata and the results of large-scale data processing.
Sean Kandel is the founder and chief technical officer at Trifacta. Sean holds a PhD from Stanford University, where his research focused on new interactive tools for data transformation and discovery, such as Data Wrangler. Prior to Stanford, Sean worked as a data analyst at Citadel Investment Group.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.