Start Small Before Going Big

Hadoop: Case Studies, Gramercy Suite (NY Hilton)
Average rating: ****.
(4.50, 2 ratings)

The availability of Hadoop and other big data technologies has made it possible to build models with more data than statisticians of even a decade ago would have thought possible. However, the best practices for effectively using massive amounts of data in the construction and evaluation of statistical models are still being invented. As is the case with most difficult complex problems: “If you’re not failing, you’re not trying hard enough”. The majority of ideas tried do not work. Best practices should include keeping failures small and inexpensive, quickly eliminating approaches that are not likely to work out, and keeping track these failures so they won’t be repeated. Every development environment should encourage trying multiple approaches to problem solving.

This talk presents a case study of statistical modeling in the insurance industry and examines the trade-offs between working with all of the data in a Hadoop cluster, dealing with complex programming, significant set-up times and a batch-like programming mentality, versus rapidly iterating through models on smaller data sets in a dynamic R environment at the possible expense of model accuracy. We will examine the benefits and shortcomings of both approaches and include model accuracy, job execution time and overall project time among the performance measures. Technologies examined will include programming a Hadoop cluster from R using the RHadoop interface and the RevoScaleR package from Revolution Analytics.

Steve Yun


Steve Yun is a Principal Predictive Modeler at the Allstate Research and Planning Center. He works on developing statistical models for a variety of insurance applications.

Photo of Joseph Rickert

Joseph Rickert

Revolution Analytics

I am a marketing manager at Revolution Analytics with a passion for analyzing data. I have worked a number of successful Silicon Valley start-ups including Sytek, Alantec, Parallan Computer and Scotts-Valley Instruments. I have graduate degrees in both the Humanities and Statistics. I taught statistics briefly at SJSU and I blog at


Sponsorship Opportunities

For information on exhibition and sponsorship opportunities, contact Susan Stewart at

Media Partner Opportunities

For information on trade opportunities contact Kathy Yu at mediapartners

Press and Media

For media-related inquiries, contact Maureen Jennings at

Contact Us

View a complete list of Strata contacts.