Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Intelligent pattern profiling on semistructured data with machine learning

Sean Kandel (Trifacta), Karthik Sethuraman (Trifacta)
11:00am11:40am Wednesday, March 15, 2017
Data science & advanced analytics
Location: 230 C Level: Intermediate
Average rating: ***..
(3.60, 5 ratings)

Who is this presentation for?

  • Engineers

Prerequisite knowledge

  • An understanding of machine learning, algorithms, graph theory, and automated structuring

What you'll learn

  • Explore a new technique leveraging machine learning to discover and profile the inherent structure in ad hoc datasets

Description

It’s well known that data analysts spend 80% of their time preparing data and only 20% analyzing it. In order to change that ratio, organizations must build tools specifically designed for working with ad hoc (semistructured) data. Sean Kandel and Karthik Sethuraman discuss the development of a new technique leveraging machine learning to discover and profile the inherent structure in ad hoc datasets.

Since data changes from use case to use case, the initial process of understanding the structure of a file has to happen without human intervention. Sean and Karthik explain how they accomplish this by combining a lexer/tokenizer with unsupervised learning techniques to detect the structure within a wide range of data sources. The Pattern Profiler takes a sequence of records and groups them into an optimal set of clusters. It then tokenizes the records using a combination of primitive and domain-specific tokens and represents each group of records as a compact but still information-rich pattern. Neighboring clusters are then combined and new parent (or super) patterns are discovered. Finally, this pattern tree is presented back to the user with compact pattern representations and example records at each level. This allows searching and profiling over the latent structure of the data at multiple levels, allowing the data analyst to choose the trade-off between compactness and descriptiveness she desires.

Sean and Karthik share specific examples of this approach using common data formats and real-world customer use cases and discuss future plans for how this approach will evolve over time to handle a wider range of formats and provide more automated structuring suggestions for common tasks.

Photo of Sean Kandel

Sean Kandel

Trifacta

Sean Kandel is the founder and chief technical officer at Trifacta. Sean holds a PhD from Stanford University, where his research focused on new interactive tools for data transformation and discovery, such as Data Wrangler. Prior to Stanford, Sean worked as a data analyst at Citadel Investment Group.

Photo of Karthik Sethuraman

Karthik Sethuraman

Trifacta

Karthik Sethuraman is a senior software engineer at Trifacta, where, in addition to working on performance, Trifacta’s wrangle language, and core user experience, he helps build the inference layer that powers Trifacta’s predictive interaction. Previously, Karthik worked at Palantir and did research in computational biology.