The economics of data aggregation and analysis are being disrupted by
falling costs for storage and CPU power, the continuing shift of
business processes online, and the deluge of data that is being
generated as a consequence.
Innovative technologies have emerged to cope with the storage and retrieval of Big Data, yet analysis tools have been less emphasized. Many emerging data sets do not fit within existing software paradigms: either their size overwhelms traditional desktop tools such as Excel, or their range of data types (geocodes, for example) prevent them from being pipelined into more powerful, but narrowly designed tools. Most importantly, closed-source tools cannot keep pace with the leading edge of innovation in statistical and machine-learning algorithms.
Enter the open source programming language R. R has been dubbed the
lingua franca for statistical computing and graphical analysis, with a
pedigree tracing back several decades at Bell Labs. Though its
million-plus users are concentrated within academia, R is gaining
currency within several high-profile quantitative analysis groups,
including Google’s Customer Insights team and Barclays Global
Investors. In addition, R’s extensibility via user-contributed
packages has spawned an active developer community.
In this session, I will focus on applying R’s powerful visualization
and analysis capabilities to the kinds of large, multidimensional data
sets that increasingly confront developers. Along the way, I will
highlight R’s functional programming features, its compact syntax for
statistical modeling, and its ease of connectivity with persistent
In particular, I will present the following two case studies applying R to large, freely available data sets:
- an analysis of NASA’s Landsat imagery of Brazil’s center-west
agricultural regions to detect correlates for soybean harvest yields,
and a derived predictor of the Brazilian soybean market based in part
on these correlates.
- a validation of Bill James’ sabermetrics approach to batting
performance using 30 years of Major League Baseball statistics, and a
derived predictor for batters’ salaries.
For all of its strengths, R has an admittedly steep learning curve.
While source code for these examples will be provided, this talk will
emphasize techniques and approach over detail. This session seeks to
give developers the courage to learn R, the confidence to include it
in their OSS arsenal, and the wisdom to recognize opportunities for
Michael E. Driscoll is a Principal at Dataspora, Inc. a business analytics consultancy in San Francisco. He has eight years of experience developing large-scale databases and inference algorithms across academia and industry with applications ranging from metal-breathing microbes to municipal real estate. He also founded and until 2008 served on the board of CustomInk.com, an Inc. 500 e-commerce firm.
He is the co-chair of the Bay Area R Users Group, and has used R extensively for the visualization and analysis of genome data, GIS data, and macroeconomic data sets.
Michael has a Ph.D. in Bioinformatics and Systems Biology from Boston University, where he was a DOE Computational Science Graduate Fellow, and an A.B. from Harvard College.
Comments on this page are now closed.
For information on exhibition and sponsorship opportunities at the conference, contact Sharon Cordesse at email@example.com
Download the OSCON Sponsor/Exhibitor Prospectus
For media-related inquiries, contact Maureen Jennings at firstname.lastname@example.org
To stay abreast of conference news and to receive email notification when registration opens, please sign up for the OSCON newsletter (login required)
View a complete list of OSCON contacts