R is the most popular language in the data science community with 2+ million users and 4000+ R packages. R’s adoption evolved with it’s easy to use statistical language, graphics, packages, tools and an active R community. In this session we will introduce new open source technology Distributed R that solves the scalability and performance limitations of the standard R language. Since R is single-threaded and does not scale to large datasets, we have built Distributed R, a distributed system that extends R and addresses many of its limitations. Distributed R efficiently shares sparse structured data, can leverage multi-cores, and dynamically partitions data to mitigate load imbalance. Our results show the promise of this approach: many important machine learning and graph algorithms can be expressed in a single framework and are substantially faster.
By combining Distributed R with the columnar analytic database technology data scientists now can build complex machine learning models on terabytes of data without sampling.
This session is sponsored by HP
Sunil Venkayala, Senior Technical Product Manager at HP Vertica in Cambridge, Mass. He leads the Distributed R open-source technology initiative and advanced analytics features of the HP Vertica platform. Prior to joining HP, he was a product manager and architect of Oracle Fusion Sales Configurator Application. Prior to that, he was an expert group member of Java Data Mining (JDM) standards and led development of many modules of Oracle’s Data Mining platform.
Sunil is a co-author of “Java Data Mining” book and publisher of several articles.
Indrajit Roy is a principal researcher at HP Labs and part of the HP Vertica engineering team. He builds distributed systems for machine learning and graph analytics. Indrajit has multiple publications in systems research and a best paper award at Middleware 2013. Indrajit received his PhD from the University of Texas at Austin.