R is one of the most used languages in the data science, statistics, and machine-learning (ML) community. Although open source R has a rich set of packages and functions for statistics and ML, when it comes to scalable data science, many CRAN-R users are hindered by the limitations of available functions to handle big data efficiently and a lack of knowledge about the appropriate computing environments to scale R scripts from single-node to elastic and distributed cloud services, including Spark 2.0 integrations.
Vanja Paunic, Robert Horton, Hang Zhang, Srini Kumar, Mengyue Zhao, John-Mark Agosta, Mario Inchiosa, and Debraj GuhaThakurta walk you through creating end-to-end data science solutions in R on Spark clusters and consuming them in production.
The tutorial materials and the scripts that are used to create the Spark clusters will be published to a public GitHub repository, so you’ll be able to create Spark clusters identical to the ones you use in the tutorial by running the scripts even after the tutorial session completes.
Vanja Paunić is a data scientist on the Azure Machine Learning team at Microsoft. Previously, Vanja was a research scientist in the field of bioinformatics, where she published on uncertainty in genetic data, genetic admixture, and prediction of genes. She holds a PhD in computer science with a focus on data mining from the University of Minnesota.
Bob Horton is a senior data scientist on the deep partner engagement team within Microsoft’s AI and Research Group, where he helps independent software vendors build and deploy machine learning solutions for their customers. Previously, he worked on the professional services team at Revolution Analytics. Long before becoming a data scientist, he was a regular scientist (with a PhD in biomedical science and molecular biology from the Mayo Clinic). Some time after that, he got an MS in computer science from California State University, Sacramento. Bob currently holds an adjunct faculty appointment in health informatics at the University of San Francisco, where he gives occasional lectures and advises students on data analysis and simulation projects.
Hang Zhang is a senior data science manager on the Algorithm and Data Science team in the Data group at Microsoft, where his major focus is on team data science processes and the Cortana Intelligence Competition Platform. Previously, Hang was a staff data scientist at WalmartLabs in charge of internal business intelligence tools and a senior data scientist at Opera Solutions. He is a senior member of the IEEE. Hang holds a PhD in industrial and systems engineering and an MS in statistics from Rutgers University.
Srini Kumar is the vice president of product management and data science at LevaData, Inc. Previously, he was a director of data science in the Algorithms and Data Science group at Microsoft, where he worked with strategic customers in the areas of Cortana Analytics and Microsoft R Server; headed product management for the information management (EIM) product suite at SAP; originated and architected a product on HANA to analyze human genome variants, which led to a discovery relating diabetes to a person’s origin and resulted in two patent applications related to modeling genomic variants and one related to enterprise information management; and helped turn around and sell a startup in the area of on-demand supply chain management software. Srini holds a master’s degree in industrial engineering from the University of Wisconsin-Madison and a bachelor’s degree in mechanical engineering from the Indian Institute of Technology, Madras.
Mengyue Zhao is a data scientist at Microsoft, where she develops end-to-end machine-learning solutions for various use cases in cloud computing and distributed platforms (e.g., Azure, Hadoop, and Spark). Mengyue focuses on scalable analysis, including data processing, feature engineering, feature selection, predictive modeling, and web services development. Previously, she was a data analyst at GE Digital, mainly focusing on solving machine-learning problems in the manufacturing domain. Mengyue has broad interests in machine learning, deep learning, and data mining and is passionate about harnessing the power of big data to answer interesting questions and drive business decisions. Mengyue holds a master’s degree in analytics from the University of San Francisco.
John Mark Agosta is a principal data scientist in IMML at Microsoft. Over his career, he has worked with startups and labs in the Bay Area, including the original Knowledge Industries, and was a researcher at Intel Labs, where he was awarded a Santa Fe Institute Business Fellowship in 2007, and at SRI International after receiving his PhD from Stanford. He has participated in the annual Uncertainty in AI conference since its inception in 1985, proving his dedication to probability and its applications. When feeling low he recharges his spirits by singing Russian music with Slavyanka, the Bay Area’s Slavic music chorus.
Mario Inchiosa is a principal software engineer at Microsoft, where he focuses on delivering parallelized, scalable advanced analytics integrated with the R language. Previously, Mario served as Revolution Analytics’s chief scientist; analytics architect in IBM’s Big Data organization, where he worked on advanced analytics in Hadoop, Teradata, and R; US chief scientist in Netezza Labs, bringing advanced analytics and R integration to Netezza’s SQL-based data warehouse appliances; US chief science officer at NuTech Solutions, a computer science consultancy specializing in simulation, optimization, and data mining; and senior scientist at BiosGroup, a complexity science spin-off of the Santa Fe Institute. Mario holds bachelor’s, master’s, and PhD degrees in physics from Harvard University. He has been awarded four patents and has published over 30 research papers, earning publication of the year and open literature publication excellence awards.
Debraj GuhaThakurta is a senior data scientist lead for AI and research, the Cloud Data Platform, algorithms, and data science at Microsoft, where he focuses on developing the team data science process and the use of different Microsoft data platforms and toolkits (Spark, SQL Server, ADL, Hadoop, DL toolkits, etc.) for creating scalable and operationalized analytical processes. He has many years of experience using data science and machine learning applications, particularly in biomedical and forecasting domains, and has published more than 25 peer-reviewed papers, book chapters, and patents. Debraj holds a PhD in chemistry and biophysics.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.