Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Distributed clinical models: Inference without sharing patient data

Balasubramanian Narasimhan (Stanford University), John-Mark Agosta (Microsoft), Philip Lavori (Stanford University)
5:10pm5:50pm Wednesday, March 7, 2018
Average rating: ***..
(3.00, 2 ratings)

Who is this presentation for?

  • Cloud architects, privacy experts, statisticians, and clinical researchers

Prerequisite knowledge

  • A basic understanding of statistical computation, cloud architecture, and network authentication methods

What you'll learn

  • Learn an alternate approach for creating statistical models: a network of distributed cloud applications that only communicate aggregate data rather than sharing private data

Description

Previously, medical researchers who wanted to run a large, multi-institution study needed to create a central registrar of subjects’ personal data, collected from different institutions. Despite strict HIPPA compliance by cloud offerings such as Azure, such aggregated datasets are few and far between due to institutional barriers to sharing sensitive personal data.

But statistical learning models need not have all their data exposed in one place. Equivalent models can be learned with message passing among distributed iterative algorithms that just communicate aggregate values. Balasubramanian Narasimhan, John-Mark Agosta, and Philip Lavori outline a privacy-preserving alternative that creates statistical models equivalent to one from the entire dataset, implemented with a set of remote cloud applications that communicate with a master application to build a common model. The remote clouds form a star-shaped network that exchange partial results asynchronously with the master until convergence.

The distributed cloud application provides rapid assembly of collaborative computational projects that wrap flexible and extensible R statistical software. It works across a heterogeneous collection of database environments, where the data can be stored either in local instances of the cloud or left on-premises. The implementation in Azure has full transparency to allow local officials concerned with privacy protections to validate the safety of the method. Security between remote and master sites builds on OAuth-style distributed authentication so that each site runs under local control, as a separate tenant. Using Azure as a development framework, a single installer can spin up the set of cloud resources for the application instance and handle security and network configuration details as well.

Photo of Balasubramanian Narasimhan

Balasubramanian Narasimhan

Stanford University

Balasubramanian Narasimhan is a senior research scientist in the Department of Statistics and the Department of Biomedical Data Sciences at Stanford University and the director of the Data Coordinating Center within the Department of Biomedical Data Sciences. His research areas include statistical computing, distributed computing, clinical trial design, and reproducible research. Balasubramanian coteaches a computing for data science course with John Chambers, an inventor of the S language.

Photo of John-Mark Agosta

John-Mark Agosta

Microsoft

John Mark Agosta is a principal data scientist at Microsoft, where he leads a team that is expanding the machine learning and artificial intelligence capabilities of Azure. Previously, John worked with startups and labs in the Bay Area, including “The Connected Car 2025” at Toyota ITC, peer-to-peer malware detection at Intel, and automated planning at SRI. His dedication to probability and AI led him to found an annual applications workshop for the Uncertainty in AI conference. When feeling low, he recharges his spirits by singing Russian music with Slavyanka, the Bay Area’s Slavic music chorus.

Philip Lavori

Stanford University

Math PhD 1974 Cornell, on faculty at MIT, Harvard, Brown and Stanford. Specialist in adaptive designs for clinical trials, use of observational data for causal effects of treatment, trials embedded in clinical practice.