Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Using R and Python for scalable data science, machine learning, and AI

Mario Inchiosa (Microsoft), Vanja Paunic (Microsoft), Robert Horton (Microsoft), Debraj GuhaThakurta (Microsoft), Ali-Kazim Zaidi (Microsoft), Tomas Singliar (Microsoft), John-Mark Agosta (Microsoft)
9:00am12:30pm Tuesday, March 6, 2018
Average rating: ****.
(4.00, 4 ratings)

Who is this presentation for?

  • Data scientists and machine learning engineers

Prerequisite knowledge

  • Programming experience in R or Python
  • Familiarity with machine learning concepts

Materials or downloads needed in advance

  • A laptop with WiFi and an SSH client with port-forwarding capability (On macOS or Linux, simply run the SSH command in a terminal window. On Windows, run plink.exe.)

What you'll learn

  • Learn how to perform scalable data science in R and Python using appropriate libraries and compute infrastructure, quickly and easily apply deep learning to custom use cases with limited labeled data, and access codes and worked-out samples from public repositories and adopt them in practice


R and Python top the list of languages used in data science and machine learning, and data scientists and engineers fluent in one of these languages are increasingly marketable. Pretrained deep learning models and transfer learning accessed via R and Python APIs are making custom image classification with large or small amounts of labeled data easily accessible to data scientists and application developers.

Mario Inchiosa, Vanja Paunić, Robert Horton, Debraj GuhaThakurta, Ali Zaidi, Tomas Singliar, and John-Mark Agosta walk you through creating end-to-end data science solutions in R and Python on virtual machines, in Spark environments, and on cloud-based infrastructure and take you through consuming them in production. Along the way, they cover strategies and best practices for porting and interoperating between R and Python and share a novel deep learning use case for image classification.

The tutorial materials and the scripts that are used to create the virtual machines configured as single-node Spark clusters will be published to a public GitHub repository, so you’ll be able to create environments identical to the ones you use in the tutorial by running the scripts even after the tutorial session completes.

Topics include:

  • Limitations on the scalability of R and Python scripts
  • Functions and techniques to overcome those limits
  • A hands-on, end-to-end deep learning-based image classification example in R and Python using functions that scale from single nodes to distributed computing clusters
  • Data exploration and wrangling
  • Featurization and modeling
  • Deployment and consumption
  • Scaling with distributed computing
Photo of Mario Inchiosa

Mario Inchiosa


Mario Inchiosa is a principal software engineer at Microsoft, where he focuses on scalable machine learning and AI. Previously, Mario served as Revolution Analytics’s chief scientist; analytics architect in IBM’s Big Data organization, where he worked on advanced analytics in Hadoop, Teradata, and R; US chief scientist in Netezza Labs, bringing advanced analytics and R integration to Netezza’s SQL-based data warehouse appliances; US chief science officer at NuTech Solutions, a computer science consultancy specializing in simulation, optimization, and data mining; and senior scientist at BiosGroup, a complexity science spin-off of the Santa Fe Institute. Mario holds bachelor’s, master’s, and PhD degrees in physics from Harvard University. He has been awarded four patents and has published over 30 research papers, earning publication of the year and open literature publication excellence awards.

Photo of Vanja Paunic

Vanja Paunic


Vanja Paunic is a data scientist in the Algorithms and Data Science Group at Microsoft London. She works on building machine learning solutions with external companies utilizing Microsoft’s AI Cloud Platform. She holds a PhD in computer science with a focus on data mining in the biomedical domain from the University of Minnesota.

Photo of Robert Horton

Robert Horton


Bob Horton is a senior data scientist on the user understanding team at Bing. Bob holds an adjunct faculty appointment in health informatics at the University of San Francisco, where he gives occasional lectures and advises students on data analysis and simulation projects. Previously, he was on the professional services team at Revolution Analytics. Long before becoming a data scientist, he was a regular scientist (with a PhD in biomedical science and molecular biology from the Mayo Clinic). Some time after that, he got an MS in computer science from California State University, Sacramento.

Photo of Debraj GuhaThakurta

Debraj GuhaThakurta


Debraj GuhaThakurta is a senior data scientist lead for AI and research, the Cloud Data Platform, algorithms, and data science at Microsoft, where he focuses on developing the team data science process and the use of different Microsoft data platforms and toolkits (Spark, SQL Server, ADL, Hadoop, DL toolkits, etc.) for creating scalable and operationalized analytical processes. He has many years of experience using data science and machine learning applications, particularly in biomedical and forecasting domains, and has published more than 25 peer-reviewed papers, book chapters, and patents. Debraj holds a PhD in chemistry and biophysics.

Photo of Ali-Kazim Zaidi

Ali-Kazim Zaidi


Ali Zaidi is data scientist in Microsoft’s AI and Research Group, where he spends his day trying to make distributed computing and machine learning in the cloud easier, more efficient, and more enjoyable for data scientists and developers alike. Previously, Ali was a research associate at NERA (National Economic Research Associates), providing statistical expertise on financial risk, securities valuation, and asset pricing. He studied statistics at the University of Toronto and computer science at Stanford University.

Photo of Tomas Singliar

Tomas Singliar


Tomas Singliar is a data scientist in Microsoft’s AI and Research Group. Tomas’s favorite hammer is probabilistic and Bayesian modeling, which he applies analytically and predictively to business data. He has published a dozen papers in and serves as reviewer for several top tier AI conferences, including AAAI and UAI, and holds four patents in intent recognition through inverse reinforcement learning. Tomas studied machine learning at University of Pittsburgh.

Photo of John-Mark Agosta

John-Mark Agosta


John Mark Agosta is a principal data scientist at Microsoft, where he leads a team that is expanding the machine learning and artificial intelligence capabilities of Azure. Previously, John worked with startups and labs in the Bay Area, including “The Connected Car 2025” at Toyota ITC, peer-to-peer malware detection at Intel, and automated planning at SRI. His dedication to probability and AI led him to found an annual applications workshop for the Uncertainty in AI conference. When feeling low, he recharges his spirits by singing Russian music with Slavyanka, the Bay Area’s Slavic music chorus.

Comments on this page are now closed.


Tomas Singliar | SR DATA SCIENTIST
03/12/2018 12:03am PDT

The tutorial has some Microsoft paid software technology (the ML Server) used for featurization. A lengthy licensing process is not required, however. Instead, one can provision the Data Science VMs, which are slightly more expensive than regular VMs but have the ML Server preinstalled.

Other than featurization, most of the tutorial uses examples in open R.

steve miller | EDITOR
02/15/2018 7:18am PST

how much of the focus will be on Microsoft commercial vs freely-available open source solutions in this presentation?