Presented By O'Reilly and Cloudera
Make Data Work
Sept 29–Oct 1, 2015 • New York, NY

Sketching big data with Spark: Randomized algorithms for large-scale data analytics

Reynold Xin (Databricks)
4:30pm–5:00pm Tuesday, 09/29/2015
Hardcore Data Science
Location: 1 E10/1 E11 Level: Intermediate
Average rating: ****.
(4.00, 4 ratings)

Many of the common data analysis methods are expensive to scale to big datasets. In this talk, we introduce a recent effort in Spark to employ randomized algorithms for a number of common, expensive methods: membership testing, cardinality, stratified sampling, frequent items, quantile estimation. We will discuss the algorithms of choice and their implementation in Spark.

Photo of Reynold Xin

Reynold Xin

Databricks

Reynold Xin is a cofounder and chief architect at Databricks as well as an Apache Spark PMC member and release manager for Spark’s 2.0 release. Prior to Databricks, Reynold was pursuing a PhD at the UC Berkeley AMPLab, where he worked on large-scale data processing.

Comments on this page are now closed.

Comments

Ashwin Kumar
09/29/2015 1:00pm EDT

On distributed sampling – this post gives a very efficient way to distributed weighted reservoir sampling http://gregable.com/2007/10/reservoir-sampling.html