For information on exhibition and sponsorship opportunities at the convention, contact Sharon Cordesse at email@example.com
Download the OSCON Data Sponsor/Exhibitor Prospectus
For information on trade opportunities with O'Reilly conferences or contact mediapartners@ oreilly.com
For media-related inquiries, contact Maureen Jennings at firstname.lastname@example.org
To stay abreast of convention news and announcements, please sign up for the OSCON email bulletin (login required)
View a complete list of OSCON contacts
The R programming language has become a standard environment for statistical computing, but out of the box R is restricted to analysis on data sets that fit in memory. Hadoop has become a popular platform for storing and analyzing data sets that are too large to fit on a single machine. Not surprisingly, there’s significant interest in bringing these two platforms together to perform sophisticated analysis on data that’s too large to fit in memory on a single machine. Although there are several systems being developed by the R community to support this such as Ricardo and RHIPE, as well as newer interfaces such as Segue and Hadoop InteractiVE, there’s still considerable confusion as to how to effectively use these two systems together. This talk will provide a survey of available R/Hadoop interfaces and use an example use case to provide a comparison between systems. We’ll also discuss problems that are a good fit for distributed analysis with R, and those that are not.
Jonathan has spent more than 15 years as a software developer, with a focus in the last few years on processing large data sets using tools such as Hadoop. Currently, Jonathan is a Lead Engineer on the Business Intelligence/Big Data team at Orbitz Worldwide. Jonathan is also a co-founder and organizer of the Chicago Hadoop User Group and founder of the Chicago Big Data User Group.
Ramesh is a member of the Operations and Engineering Team at Orbitz Worldwide with a focus on analysis of distributed, high availability systems in the travel data domain. His passion is the fusion of distributed, multi-active datacenter infrastructure, parallel processing, and analysis platforms like R for Big Data. In combination with many Machine Learning methodologies, he believes that the next generation of data science and support infrastructure are a must for smooth operations of enterprise data centers and private clouds. He received a Ph. D. in Ocean Engineering from Texas A&M University with a focus on instrumentation and sensor platforms.