R and Python top the list of languages used in data science and machine learning, and data scientists and engineers fluent in one of these languages are increasingly marketable. Pretrained deep learning models and transfer learning accessed via R and Python APIs are making custom image classification with large or small amounts of labeled data easily accessible to data scientists and application developers.
Mario Inchiosa, Vanja Paunić, Robert Horton, Debraj GuhaThakurta, Ali Zaidi, Tomas Singliar, and John-Mark Agosta walk you through creating end-to-end data science solutions in R and Python on virtual machines, in Spark environments, and on cloud-based infrastructure and take you through consuming them in production. Along the way, they cover strategies and best practices for porting and interoperating between R and Python and share a novel deep learning use case for image classification.
The tutorial materials and the scripts that are used to create the virtual machines configured as single-node Spark clusters will be published to a public GitHub repository, so you’ll be able to create environments identical to the ones you use in the tutorial by running the scripts even after the tutorial session completes.
Mario Inchiosa is a principal software engineer at Microsoft, where he focuses on scalable machine learning and AI. Previously, Mario served as Revolution Analytics’s chief scientist; analytics architect in IBM’s Big Data organization, where he worked on advanced analytics in Hadoop, Teradata, and R; US chief scientist in Netezza Labs, bringing advanced analytics and R integration to Netezza’s SQL-based data warehouse appliances; US chief science officer at NuTech Solutions, a computer science consultancy specializing in simulation, optimization, and data mining; and senior scientist at BiosGroup, a complexity science spin-off of the Santa Fe Institute. Mario holds bachelor’s, master’s, and PhD degrees in physics from Harvard University. He has been awarded four patents and has published over 30 research papers, earning publication of the year and open literature publication excellence awards.
Vanja Paunic is a data scientist in the Algorithms and Data Science Group at Microsoft London. She works on building machine learning solutions with external companies utilizing Microsoft’s AI Cloud Platform. She holds a PhD in computer science with a focus on data mining in the biomedical domain from the University of Minnesota.
Bob Horton is a senior data scientist on the user understanding team at Bing. Bob holds an adjunct faculty appointment in health informatics at the University of San Francisco, where he gives occasional lectures and advises students on data analysis and simulation projects. Previously, he was on the professional services team at Revolution Analytics. Long before becoming a data scientist, he was a regular scientist (with a PhD in biomedical science and molecular biology from the Mayo Clinic). Some time after that, he got an MS in computer science from California State University, Sacramento.
Debraj GuhaThakurta is a senior data scientist lead for AI and research, the Cloud Data Platform, algorithms, and data science at Microsoft, where he focuses on developing the team data science process and the use of different Microsoft data platforms and toolkits (Spark, SQL Server, ADL, Hadoop, DL toolkits, etc.) for creating scalable and operationalized analytical processes. He has many years of experience using data science and machine learning applications, particularly in biomedical and forecasting domains, and has published more than 25 peer-reviewed papers, book chapters, and patents. Debraj holds a PhD in chemistry and biophysics.
Ali Zaidi is data scientist in Microsoft’s AI and Research Group, where he spends his day trying to make distributed computing and machine learning in the cloud easier, more efficient, and more enjoyable for data scientists and developers alike. Previously, Ali was a research associate at NERA (National Economic Research Associates), providing statistical expertise on financial risk, securities valuation, and asset pricing. He studied statistics at the University of Toronto and computer science at Stanford University.
Tomas Singliar is a data scientist in Microsoft’s AI and Research Group. Tomas’s favorite hammer is probabilistic and Bayesian modeling, which he applies analytically and predictively to business data. He has published a dozen papers in and serves as reviewer for several top tier AI conferences, including AAAI and UAI, and holds four patents in intent recognition through inverse reinforcement learning. Tomas studied machine learning at University of Pittsburgh.
John Mark Agosta is a principal data scientist at Microsoft, where he leads a team that is expanding the machine learning and artificial intelligence capabilities of Azure. Previously, John worked with startups and labs in the Bay Area, including “The Connected Car 2025” at Toyota ITC, peer-to-peer malware detection at Intel, and automated planning at SRI. His dedication to probability and AI led him to found an annual applications workshop for the Uncertainty in AI conference. When feeling low, he recharges his spirits by singing Russian music with Slavyanka, the Bay Area’s Slavic music chorus.
Comments on this page are now closed.
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com