Presented By O'Reilly and Cloudera
Make Data Work
Feb 17–20, 2015 • San Jose, CA

Large-scale Machine Learning Day

Shawn Scully (Dato), Carlos Guestrin (Apple | University of Washington ), Alice Zheng (Amazon), Chris DuBois (Dato), Yucheng Low (Dato)
9:00am–5:00pm Wednesday, 02/18/2015
Data Science
Location: LL20 D
Average rating: ****.
(4.43, 7 ratings)



This all-day, hands-on training program provides a quick start to building and deploying predictive applications at scale. You will learn simple and effective ways of building powerful machine learning models and deployment them. We will walk you through all the steps of prototyping and production: data cleaning, feature engineerings, model building and evaluation, and deployment. Using GraphLab Create, PySpark, and other open source tools, the same code works for prototyping and production, whether on your personal laptop, on the cloud, or on a Hadoop cluster.

More specifically, the program focuses on high-value applications such as personalized recommendations, image analysis using deep learning, and unstructured text analysis and data matching. With a few lines of Python, learn to visualize your data, perform feature engineering at scale, and build state-of-the-art machine learning models. Finally, to round it off, practice building pipelines of data analysis jobs, deploying and monitoring predictive applications, all from your laptop.

Tentative topics:

  • Overview large-scale machine learning tools
  • Data engineering with GraphLab Create and PySpark
  • Deep dive in personalized recommenders
  • Deep learning made easy: image analysis with deep features
  • Data matching
  • Deploying machine learning models in production


Attendees must be familiar with Python, and have an understanding of basic machine learning and data analysis concepts.

Please go to the setup page and follow instructions to set up your machine for this training program.

Photo of Shawn Scully

Shawn Scully


Shawn is the VP of Customer Success & Applications at Dato where he helps make it easy to build cool experiences with data. He is data geeky and loves inspired technologies, businesses, and gadgets. His technical background spans recommendation systems and business analytics, physics simulations, and energy. He holds a PhD in Materials Science from Stanford University and a BA in Physics from Cornell University.

Photo of Carlos Guestrin

Carlos Guestrin

Apple | University of Washington

Carlos Guestrin is the director of machine learning at Apple and the Amazon Professor of Machine Learning in Computer Science and Engineering at the University of Washington. Carlos was the cofounder and CEO of Turi (formerly Dato and GraphLab), a machine-learning company acquired by Apple. A world-recognized leader in the field of machine learning, Carlos was named one of the 2008 Brilliant 10 by Popular Science. He received the 2009 IJCAI Computers and Thought Award for his contributions to artificial intelligence and a Presidential Early Career Award for Scientists and Engineers (PECASE).

Photo of Alice Zheng

Alice Zheng


Alice Zheng is a senior manager of applied science on the machine learning optimization team on Amazon’s advertising platform. She specializes in research and development of machine learning methods, tools, and applications. She’s the author of Feature Engineering for Machine Learning. Previously, Alice has worked at GraphLab, Dato, and Turi, where she led the machine learning toolkits team and spearheaded user outreach; and was a researcher in the Machine Learning Group at Microsoft Research – Redmond. Alice holds PhD and BA degrees in computer science and a BA in mathematics, all from UC Berkeley.

Photo of Chris DuBois

Chris DuBois


Chris DuBois is a data scientist focused on building tools for other data scientists. At Dato, he has helped design and implement tools for creating recommendation systems as well as large-scale text analysis. His current work makes it simpler to train models that generalize well. After studying Applied Mathematics at Pomona College, he obtained a Ph.D. in Statistics from University of California, Irvine, where he researched latent variable models for social network data occurring over time.

Photo of Yucheng Low

Yucheng Low


Yucheng Low is a co-founder and Chief Architect of GraphLab Inc. He led the development of the SFrames and SGraphs scalable datastructures underpinning the GraphLab Create Product. He completed his PhD in Machine Learning in 2013 from Carnegie Mellon University advised by Prof. Carlos Guestrin where he studied parallel and distributed systems for large scale Machine Learning. As part of his thesis work he also co-developed the open source PowerGraph system for distributed Machine Learning which has achieved state of art performance on a variety of benchmarks.

Comments on this page are now closed.


Picture of Alice Zheng
Alice Zheng
02/16/2015 7:34am PST

All the set up info is here:

We’ll be posting the slides over the next couple of days.

Please let us know if you run into any problems!

Picture of Sophia DeMartini
Sophia DeMartini
02/16/2015 5:19am PST

Hi Sebastian,

You should be receiving an email with instructions soon. We’ll also be posting them to this page shortly.

Thank you,

Sebastian Castro
02/16/2015 5:19am PST

Would link to material (VM, slides, other) be available sometime soon? We are two days away from the Tutorial and some people will do travelling for this.

Picture of Alice Zheng
Alice Zheng
02/10/2015 12:58pm PST

Cygwin won’t be enough. We’ll provide a VMware setup for Windows. We’ll also provide a public snapshot of the whole training set up (including software and datasets) on, which you’ll be able to run from any browser.

Instructions for download and installation will be available in the next couple of days. If you are already registered, you’ll get an email. The course descriptions will also be updated with more info. Please stay tuned!

Picture of Rachel B Warren
Rachel B Warren
02/10/2015 10:27am PST

Is Cygwin, which creates a Unix like environment good enough, or should I set up a full Linux dual boot?

Picture of Alice Zheng
Alice Zheng
01/27/2015 8:44am PST

Yes you’ll need your own computer. Mac or Linux are best. We’ll provide a VM for Windows. We’ll also provide a web interface. But that will depend on internet connectivity on site, so we recommend it only as a backup.

Sean Harrington
01/25/2015 11:13pm PST

Are there any special computing requirements for the tutorial (like Mac, Linux, Windows OS)? Will we be using our own computers?

Picture of Alice Zheng
Alice Zheng
01/12/2015 2:56am PST

The target audience is beginning data scientists and anyone who wants to build an intelligent data app. It would be helpful to have rudimentary knowledge of machine learning and familiarity with Python. Please us know if you have any other questions.

Naveen Maram
01/09/2015 8:37am PST

who is the target audience for this? what kind of background does one need to benefit from this?