Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Using R for scalable data analytics: From single machines to Hadoop Spark clusters

Vanja Paunic (Microsoft), Robert Horton (Microsoft), Hang Zhang (Microsoft), Srini Kumar (LevaData, Inc.), Mengyue Zhao (Microsoft), John-Mark Agosta (Microsoft), Mario Inchiosa (Microsoft), Debraj GuhaThakurta (Microsoft)
9:00am12:30pm Tuesday, March 14, 2017
Data science & advanced analytics
Location: LL21 C/D Level: Intermediate
Secondary topics:  R
Average rating: **...
(2.50, 4 ratings)

Who is this presentation for?

  • Data scientists, machine-learning scientists, and statisticians

Prerequisite knowledge

  • Programming experience in R
  • Familiarity with machine-learning algorithms

Materials or downloads needed in advance

  • A WiFi-enabled laptop with an SSH client with port-forwarding capability (On MacOS or Linux, simply run the SSH command in a terminal window. On Windows, download and install plink.exe.)

What you'll learn

  • Learn how to perform scalable data science in R using appropriate compute infrastructure, distributed algorithms, out-of-memory computational techniques and access codes, and worked-out samples from public repositories and adopt them in practice


R is one of the most used languages in the data science, statistics, and machine-learning (ML) community. Although open source R has a rich set of packages and functions for statistics and ML, when it comes to scalable data science, many CRAN-R users are hindered by the limitations of available functions to handle big data efficiently and a lack of knowledge about the appropriate computing environments to scale R scripts from single-node to elastic and distributed cloud services, including Spark 2.0 integrations.

Vanja Paunic, Robert Horton, Hang Zhang, Srini Kumar, Mengyue Zhao, John-Mark Agosta, Mario Inchiosa, and Debraj GuhaThakurta walk you through creating end-to-end data science solutions in R on Spark clusters and consuming them in production.

The tutorial materials and the scripts that are used to create the Spark clusters will be published to a public GitHub repository, so you’ll be able to create Spark clusters identical to the ones you use in the tutorial by running the scripts even after the tutorial session completes.

Photo of Vanja Paunic

Vanja Paunic


Vanja Paunic is a data scientist in the Algorithms and Data Science Group at Microsoft London. She works on building machine learning solutions with external companies utilizing Microsoft’s AI Cloud Platform. She holds a PhD in computer science with a focus on data mining in the biomedical domain from the University of Minnesota.

Photo of Robert Horton

Robert Horton


Bob Horton is a senior data scientist on the user understanding team at Bing. Bob holds an adjunct faculty appointment in health informatics at the University of San Francisco, where he gives occasional lectures and advises students on data analysis and simulation projects. Previously, he was on the professional services team at Revolution Analytics. Long before becoming a data scientist, he was a regular scientist (with a PhD in biomedical science and molecular biology from the Mayo Clinic). Some time after that, he got an MS in computer science from California State University, Sacramento.

Photo of Hang Zhang

Hang Zhang


Hang Zhang is a senior data science manager on the Algorithm and Data Science team in the Data group at Microsoft, where his major focus is on team data science processes and the Cortana Intelligence Competition Platform. Previously, Hang was a staff data scientist at WalmartLabs in charge of internal business intelligence tools and a senior data scientist at Opera Solutions. He is a senior member of the IEEE. Hang holds a PhD in industrial and systems engineering and an MS in statistics from Rutgers University.

Photo of Srini Kumar

Srini Kumar

LevaData, Inc.

Srini Kumar is the vice president of product management and data science at LevaData, Inc. Previously, he was a director of data science in the Algorithms and Data Science group at Microsoft, where he worked with strategic customers in the areas of Cortana Analytics and Microsoft R Server; headed product management for the information management (EIM) product suite at SAP; originated and architected a product on HANA to analyze human genome variants, which led to a discovery relating diabetes to a person’s origin and resulted in two patent applications related to modeling genomic variants and one related to enterprise information management; and helped turn around and sell a startup in the area of on-demand supply chain management software. Srini holds a master’s degree in industrial engineering from the University of Wisconsin-Madison and a bachelor’s degree in mechanical engineering from the Indian Institute of Technology, Madras.

Photo of Mengyue Zhao

Mengyue Zhao


Mengyue Zhao is a data scientist at Microsoft, where she develops end-to-end machine-learning solutions for various use cases in cloud computing and distributed platforms (e.g., Azure, Hadoop, and Spark). Mengyue focuses on scalable analysis, including data processing, feature engineering, feature selection, predictive modeling, and web services development. Previously, she was a data analyst at GE Digital, mainly focusing on solving machine-learning problems in the manufacturing domain. Mengyue has broad interests in machine learning, deep learning, and data mining and is passionate about harnessing the power of big data to answer interesting questions and drive business decisions. Mengyue holds a master’s degree in analytics from the University of San Francisco.

Photo of John-Mark Agosta

John-Mark Agosta


John-Mark Agosta is a principal data scientist in IMML at Microsoft. Previously, he worked with startups and labs in the Bay Area, including the original Knowledge Industries, and was a researcher at Intel Labs, where he was awarded a Santa Fe Institute Business Fellowship in 2007, and at SRI International after receiving his PhD from Stanford. He has participated in the annual Uncertainty in AI conference since its inception in 1985, proving his dedication to probability and its applications. When feeling low, he recharges his spirits by singing Russian music with Slavyanka, the Bay Area’s Slavic music chorus.

Photo of Mario Inchiosa

Mario Inchiosa


Mario Inchiosa is a principal software engineer at Microsoft, where he focuses on scalable machine learning and AI. Previously, Mario served as Revolution Analytics’s chief scientist; analytics architect in IBM’s Big Data organization, where he worked on advanced analytics in Hadoop, Teradata, and R; US chief scientist in Netezza Labs, bringing advanced analytics and R integration to Netezza’s SQL-based data warehouse appliances; US chief science officer at NuTech Solutions, a computer science consultancy specializing in simulation, optimization, and data mining; and senior scientist at BiosGroup, a complexity science spin-off of the Santa Fe Institute. Mario holds bachelor’s, master’s, and PhD degrees in physics from Harvard University. He has been awarded four patents and has published over 30 research papers, earning publication of the year and open literature publication excellence awards.

Photo of Debraj GuhaThakurta

Debraj GuhaThakurta


Debraj GuhaThakurta is a senior data scientist lead for AI and research, the Cloud Data Platform, algorithms, and data science at Microsoft, where he focuses on developing the team data science process and the use of different Microsoft data platforms and toolkits (Spark, SQL Server, ADL, Hadoop, DL toolkits, etc.) for creating scalable and operationalized analytical processes. He has many years of experience using data science and machine learning applications, particularly in biomedical and forecasting domains, and has published more than 25 peer-reviewed papers, book chapters, and patents. Debraj holds a PhD in chemistry and biophysics.