Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Learning from incomplete, imperfect data with probabilistic programming

Mike Lee Williams (Cloudera Fast Forward Labs)
11:50am12:30pm Thursday, March 16, 2017
Data science & advanced analytics
Location: 230 A Level: Beginner
Secondary topics:  AI, Hardcore Data Science
Average rating: ***..
(3.80, 5 ratings)

Who is this presentation for?

  • Data scientists, statisticians, product managers, and people who work in businesses where risk is fundamental or data is limited

Prerequisite knowledge

  • Familiarity with the concepts of machine learning

What you'll learn

  • Understand Bayesian inference, where it's useful, and why it's hard to do yourself
  • Learn how probabilistic programming makes it easier
  • Explore modern probabilistic programming languages, including PyMC3 and Stan


Real-world data is incomplete and imperfect. The right way to handle it is with Bayesian inference. Michael Williams demonstrates how probabilistic programming languages hide the gory details of this elegant but potentially tricky approach, making a powerful statistical method easy and enabling rapid iteration and new kinds of data-driven products.

Michael begins by introducing Bayesian inference and using the approach to solve a famous problem (the German Tank Problem) in three lines of code. (The code we’ll write is so simple you won’t need to be a programmer or a mathematician to understand it.) Michael then offers an overview of two working Fast Forward Labs product prototypes that crucially depend on Bayesian inference—one that supports decisions about consumer loans and one that models the future of the NYC real estate market—to highlight the advantages and use cases of the Bayesian approach, which include domains where data is scarce, where prior institutional knowledge is important, and where quantifying risk is crucial.

But as you’ll see, this naive approach to implementing Bayesian inference has serious limitations and is only useful for tiny problems. Michael explores the challenges involved in speeding it up and shares solutions ranging from classics like Metropolis Hastings and MCMC Monte Carl to modern industrial-strength algorithms like NUTS and ADVI. These algorithms are complicated, and implementing them so they give the right answer quickly is difficult.

Which brings us to the real subject of this talk: probabilistic programming—a family of languages that define fundamental probabilistic ideas such as random variables and probability distributions as primitive objects, which makes code short, simple, and declarative. And they have expert-written, blazing-fast implementations of the latest and greatest inference algorithms built right in. Michael examines a handful of probabilistic programming languages, taking a particularly close look at Stan and PyMC3—comparing their performance and deployment trade-offs and showing how the German Tank Problem and our consumer loan and NYC real estate problems could be solved using them.

Photo of Mike Lee Williams

Mike Lee Williams

Cloudera Fast Forward Labs

Mike Lee Williams is a research engineer at Cloudera Fast Forward Labs, where he builds prototypes that bring the latest ideas in machine learning and AI to life and helps Cloudera’s customers understand how to make use of these new technologies. Mike holds a PhD in astrophysics from Oxford.