Presented By O'Reilly and Cloudera
Make Data Work
Sept 29–Oct 1, 2015 • New York, NY

Sketching big data with Spark: Randomized algorithms for large-scale data analytics

Reynold Xin (Databricks)
4:30pm–5:00pm Tuesday, 09/29/2015
Hardcore Data Science
Location: 1 E10/1 E11 Level: Intermediate
Average rating: ****.
(4.00, 4 ratings)

Many of the common data analysis methods are expensive to scale to big datasets. In this talk, we introduce a recent effort in Spark to employ randomized algorithms for a number of common, expensive methods: membership testing, cardinality, stratified sampling, frequent items, quantile estimation. We will discuss the algorithms of choice and their implementation in Spark.

Photo of Reynold Xin

Reynold Xin


Reynold Xin is a cofounder and chief architect at Databricks as well as an Apache Spark PMC member and release manager for Spark’s 2.0 release. Prior to Databricks, Reynold was pursuing a PhD at the UC Berkeley AMPLab, where he worked on large-scale data processing.

Comments on this page are now closed.


Ashwin Kumar
09/29/2015 1:00pm EDT

On distributed sampling – this post gives a very efficient way to distributed weighted reservoir sampling