Sep 23–26, 2019

Sketching data and other magic tricks

Sophie Watson (Red Hat), William Benton (Red Hat)
1:30pm5:00pm Tuesday, September 24, 2019
Location: 1E 08
Secondary topics:  Streaming and IoT, Temporal data and time-series analytics

Who is this presentation for?

machine learning practitioners, data scientists, data engineers, developers

Level

Intermediate

Description

What if we told you that you could answer interesting queries about truly massive data sets almost instantly and with a fixed amount of space? You might say that it sounds like magic. In this hands-on workshop, we’ll introduce several sketching data structures that work this magic and show you the key trick that makes them possible. We’ll introduce truly scalable techniques for several fundamental problems like set membership, set and document similarity, counting kinds of events, and counting distinct elements. You’ll learn how and when to use these structures as well as how they work. You’ll see how the same techniques work for parallel, distributed, and stream processing at scale. Finally, you’ll see how you can put these techniques to work in real data engineering and machine learning applications like join processing, document classification, and content personalization.

Prerequisite knowledge

reading knowledge of Python

Materials or downloads needed in advance

attendees will only need a laptop with wi-fi -- we'll run everything from notebooks in the cloud.

What you'll learn

Attendees will learn: - how data sketches summarize large amounts of data in constant space and time, - what problems different data sketches are useful for, including the Bloom filter, count-min sketch, HyperLogLog counter, and Minhash signatures, - how to use these techniques for distributed computing and stream processing, - applications of data sketching for database and ML applications
Photo of Sophie Watson

Sophie Watson

Red Hat

Sophie Watson is a software engineer in an Emerging Technology Group at Red Hat, where she applies her data science and statistics skills to solving business problems and informing next-generation infrastructure for intelligent application development. She has a background in mathematics and holds a PhD in Bayesian statistics, in which she developed algorithms to estimate intractable quantities quickly and accurately.

Photo of William Benton

William Benton

Red Hat

William Benton leads a team of data scientists and engineers at Red Hat, where he has also applied machine learning to problems ranging from forecasting cloud infrastructure costs to designing better cycling workouts. His current focus is investigating the best ways to build and deploy intelligent applications in cloud-native environments, but he has also conducted research and development in the areas of static program analysis, managed language runtimes, logic databases, cluster configuration management, and music technology.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Contact us

confreg@oreilly.com

For conference registration information and customer service

partners@oreilly.com

For more information on community discounts and trade opportunities with O’Reilly conferences

strataconf@oreilly.com

For information on exhibiting or sponsoring a conference

Contact list

View a complete list of Strata Data Conference contacts