Presented By O'Reilly and Cloudera
Make Data Work
December 1–3, 2015 • Singapore

Sketching big data with Spark: Randomized algorithms for large-scale data analytics

Reynold Xin (Databricks)
11:50am–12:30pm Thursday, 12/03/2015
IoT and Real-time
Location: 324

Many of the common data analysis methods are expensive to scale to big datasets. In this talk, we introduce a recent effort in Spark to employ randomized algorithms for a number of common, expensive methods: membership testing, cardinality, stratified sampling, frequent items, quantile estimation. We will discuss the algorithms of choice and their implementation in Spark.

Photo of Reynold Xin

Reynold Xin


Reynold Xin is a cofounder and chief architect at Databricks as well as an Apache Spark PMC member and release manager for Spark’s 2.0 release. Prior to Databricks, Reynold was pursuing a PhD at the UC Berkeley AMPLab, where he worked on large-scale data processing.