It’s well known that data analysts spend 80% of their time preparing data and only 20% analyzing it. In order to change that ratio, organizations must build tools specifically designed for working with ad hoc (semistructured) data. Sean Kandel and Karthik Sethuraman discuss the development of a new technique leveraging machine learning to discover and profile the inherent structure in ad hoc datasets.
Since data changes from use case to use case, the initial process of understanding the structure of a file has to happen without human intervention. Sean and Karthik explain how they accomplish this by combining a lexer/tokenizer with unsupervised learning techniques to detect the structure within a wide range of data sources. The Pattern Profiler takes a sequence of records and groups them into an optimal set of clusters. It then tokenizes the records using a combination of primitive and domain-specific tokens and represents each group of records as a compact but still information-rich pattern. Neighboring clusters are then combined and new parent (or super) patterns are discovered. Finally, this pattern tree is presented back to the user with compact pattern representations and example records at each level. This allows searching and profiling over the latent structure of the data at multiple levels, allowing the data analyst to choose the trade-off between compactness and descriptiveness she desires.
Sean and Karthik share specific examples of this approach using common data formats and real-world customer use cases and discuss future plans for how this approach will evolve over time to handle a wider range of formats and provide more automated structuring suggestions for common tasks.
Sean Kandel is the founder and chief technical officer at Trifacta. Sean holds a PhD from Stanford University, where his research focused on new interactive tools for data transformation and discovery, such as Data Wrangler. Prior to Stanford, Sean worked as a data analyst at Citadel Investment Group.
Karthik Sethuraman is a senior software engineer at Trifacta, where, in addition to working on performance, Trifacta’s wrangle language, and core user experience, he helps build the inference layer that powers Trifacta’s predictive interaction. Previously, Karthik worked at Palantir and did research in computational biology.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.