Whether we’re talking about spam emails, merging records, or investigating clusters, there are many times when having a measure of how alike things are makes them easier to work with. You may have unstructured or vague data that isn’t incorporated into your data models (e.g., information from subject-matter experts who have a sense of whether something is good or bad, similar or different). Melissa Santos offers a practical approach to creating a distance metric and validating with business owners that it provides value—providing you with the tools to turn that expert information into numbers you can compare and use to quickly see structures in the data.
Melissa walks you through setting expectations for a distance, creating distance metrics, iterating with experts to check expectations, validating the distance on a large chunk of the dataset, and then circling back to add more complexity and shares some real-world examples, such as distance from usual emails from a domain, quality scores for geographic data, and merging person records if they are sufficiently similar.
Melissa Santos has over a decade of experience with all parts of the data pipeline, from ETLs to modeling. Her role as a data scientist at Big Cartel involves teaching both engineers and nontechnical people how to get the data they need. Melissa holds a PhD in applied math.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.