Sep 23–26, 2019
Please log in

Sketching data and other magic tricks

Sophie Watson (Red Hat), William Benton (Red Hat)
1:30pm5:00pm Tuesday, September 24, 2019
Location: 1E 11
Average rating: ****.
(4.87, 15 ratings)

Who is this presentation for?

  • Machine learning practitioners, data scientists, data engineers, and developers




Sophie Watson and William Benton explore a way to answer interesting queries about truly massive datasets almost instantly and with a fixed amount of space.

It sounds like magic, but you’ll go hands-on to practice sketching data structures that work this magic and the key trick that makes them possible. Sophie and William introduce truly scalable techniques for several fundamental problems like set membership, set and document similarity, counting kinds of events, and counting distinct elements. You’ll learn how and when to use these structures as well as how they work. You’ll see how the same techniques work for parallel, distributed, and stream processing at scale. And you’ll leave able to put these techniques to work in real data engineering and machine learning applications like join processing, document classification, and content personalization.

Prerequisite knowledge

  • A working knowledge of Python

Materials or downloads needed in advance

  • A WiFi-enabled laptop

What you'll learn

  • Learn how data sketches summarize large amounts of data in constant space and time; what problems different data sketches are useful for, including the bloom filter, Count-Min Sketch, HyperLogLog counter, and MinHash signatures; how to use these techniques for distributed computing and stream processing; and applications of data sketching for database and ML applications
Photo of Sophie Watson

Sophie Watson

Red Hat

Sophie Watson is a senior data scientist at Red Hat, where she helps customers use machine learning to solve business problems in the hybrid cloud. She’s a frequent public speaker on topics including machine learning workflows on Kubernetes, recommendation engines, and machine learning for search. Sophie earned her PhD in Bayesian statistics.

Photo of William Benton

William Benton

Red Hat

William Benton is an engineering manager and senior principal software engineer at Red Hat, where he leads a team of data scientists and engineers. He’s applied machine learning to problems ranging from forecasting cloud infrastructure costs to designing better cycling workouts. His focus is investigating the best ways to build and deploy intelligent applications in cloud native environments, but he’s also conducted research and development in the areas of static program analysis, managed language runtimes, logic databases, cluster configuration management, and music technology.

Comments on this page are now closed.


Advait Trivedi | Senior Data Scientist
10/07/2019 9:40am EDT

Great tutorial!

  • Cloudera
  • O'Reilly
  • Google Cloud
  • IBM
  • Cisco
  • Dataiku
  • Intel
  • Io-Tahoe
  • MemSQL
  • Microsoft Azure
  • Oracle Cloud Infrastructure
  • SAS
  • Arcadia Data
  • BMC Software
  • Hazelcast
  • SAP
  • Amazon Web Services
  • Anaconda
  • Esri
  •, Inc.
  • Kyligence
  • Pitney Bowes
  • Talend
  • Google Cloud
  • Confluent
  • DataStax
  • Dremio
  • Immuta
  • Impetus Technologies Inc.
  • Keyence
  • Kyvos Insights
  • StreamSets
  • Striim
  • Syncsort
  • SK holdings C&C

    Contact us

    For conference registration information and customer service

    For more information on community discounts and trade opportunities with O’Reilly conferences

    For information on exhibiting or sponsoring a conference

    For media/analyst press inquires