Presented By O'Reilly and Cloudera
Make Data Work
Feb 17–20, 2015 • San Jose, CA

Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Engine

Richard Williamson (Silicon Valley Data Science)
1:30pm–2:10pm Thursday, 02/19/2015
Hadoop & Beyond
Location: 210 C/G
Average rating: ***..
(3.00, 3 ratings)
Slides:   1-PPTX 

This talk will examine the benefits of using multiple persistence strategies to build an end-to-end predictive engine. Utilizing Spark Streaming backed by a Cassandra persistence layer allows rapid lookups and inserts to be made in order to perform real-time model scoring. Spark backed by Parquet files, stored in HDFS, allows for high-throughput model training and tuning utilizing Spark MLlib. Both of these persistence layers also provide ad-hoc queries via Spark SQL in order to easily analyze model sensitivity and accuracy. Storing the data in this way also provides extensibility to leverage existing tools like CQL to perform operational queries on the data stored in Cassandra and Impala to perform larger analytical queries on the data stored in HDFS further maximizing the benefits of the flexible architecture.


  • The general model building/scoring use case
  • Benefits of Spark Persistence to Cassandra
  • Benefits of Spark Persistence to Parquet files in HDFS
  • Using Spark MLlib to perform predictive modeling
  • Spark Streaming workflow
  • Source Data Feed – collect event streams
  • Apply predictions – score event streams with given models, update predictions in Cassandra
  • Aggregate data to create model training/tuning datasets
  • Store aggregated data in Parquet files in HDFS
  • Models are tuned based on new data
  • Ad-hoc queries against historical data in Spark SQL and/or Impala
  • Operational queries against event data in CQL
Photo of Richard Williamson

Richard Williamson

Silicon Valley Data Science

Richard has been at the cutting edge of big data since its inception, leading multiple efforts to build multi-petabyte Hadoop platforms, maximizing business value by combining data science with big data. He has extensive experience creating advanced analytic systems using data warehousing and data mining technologies