Presented By O'Reilly and Cloudera
Make Data Work
September 26–27, 2016: Training
September 27–29, 2016: Tutorials & Conference
New York, NY

Creating and evaluating a distance measure

Melissa Santos (Big Cartel)
11:20am–12:00pm Wednesday, 09/28/2016
Data-driven business
Location: 1 E 10/1 E11 Level: Intermediate
Average rating: *****
(5.00, 4 ratings)

Prerequisite knowledge

  • The ability and patience to work with subject-matter experts over time, carefully validating that the math is in agreement with their view of the data
  • Enough math to follow Cartesian distance and cosine similarity
  • A basic familiarity with object-oriented code concepts
  • What you'll learn

  • Learn a practical approach to creating a distance metric and validating with business owners that it provides value
  • Description

    Whether we’re talking about spam emails, merging records, or investigating clusters, there are many times when having a measure of how alike things are makes them easier to work with. You may have unstructured or vague data that isn’t incorporated into your data models (e.g., information from subject-matter experts who have a sense of whether something is good or bad, similar or different). Melissa Santos offers a practical approach to creating a distance metric and validating with business owners that it provides value—providing you with the tools to turn that expert information into numbers you can compare and use to quickly see structures in the data.

    Melissa walks you through setting expectations for a distance, creating distance metrics, iterating with experts to check expectations, validating the distance on a large chunk of the dataset, and then circling back to add more complexity and shares some real-world examples, such as distance from usual emails from a domain, quality scores for geographic data, and merging person records if they are sufficiently similar.

    Topics include:

    • What is a distance?
    • Turning expert opinion into training data
    • Making a very basic model
    • Why your model is wrong
    • Making it better
    • Working with experts and stakeholders to validate usefulness
    Photo of Melissa  Santos

    Melissa Santos

    Big Cartel

    Melissa Santos has over a decade of experience with all parts of the data pipeline, from ETLs to modeling. Her role as a data scientist at Big Cartel involves teaching both engineers and nontechnical people how to get the data they need. Melissa holds a PhD in applied math.