Previously, medical researchers who wanted to run a large, multi-institution study needed to create a central registrar of subjects’ personal data, collected from different institutions. Despite strict HIPPA compliance by cloud offerings such as Azure, such aggregated datasets are few and far between due to institutional barriers to sharing sensitive personal data.
But statistical learning models need not have all their data exposed in one place. Equivalent models can be learned with message passing among distributed iterative algorithms that just communicate aggregate values. Balasubramanian Narasimhan, John-Mark Agosta, and Philip Lavori outline a privacy-preserving alternative that creates statistical models equivalent to one from the entire dataset, implemented with a set of remote cloud applications that communicate with a master application to build a common model. The remote clouds form a star-shaped network that exchange partial results asynchronously with the master until convergence.
The distributed cloud application provides rapid assembly of collaborative computational projects that wrap flexible and extensible R statistical software. It works across a heterogeneous collection of database environments, where the data can be stored either in local instances of the cloud or left on-premises. The implementation in Azure has full transparency to allow local officials concerned with privacy protections to validate the safety of the method. Security between remote and master sites builds on OAuth-style distributed authentication so that each site runs under local control, as a separate tenant. Using Azure as a development framework, a single installer can spin up the set of cloud resources for the application instance and handle security and network configuration details as well.
Balasubramanian Narasimhan is a senior research scientist in the Department of Statistics and the Department of Biomedical Data Sciences at Stanford University and the director of the Data Coordinating Center within the Department of Biomedical Data Sciences. His research areas include statistical computing, distributed computing, clinical trial design, and reproducible research. Balasubramanian coteaches a computing for data science course with John Chambers, an inventor of the S language.
John Mark Agosta is a principal data scientist at Microsoft, where he leads a team that is expanding the machine learning and artificial intelligence capabilities of Azure. Previously, John worked with startups and labs in the Bay Area, including “The Connected Car 2025” at Toyota ITC, peer-to-peer malware detection at Intel, and automated planning at SRI. His dedication to probability and AI led him to found an annual applications workshop for the Uncertainty in AI conference. When feeling low, he recharges his spirits by singing Russian music with Slavyanka, the Bay Area’s Slavic music chorus.
Math PhD 1974 Cornell, on faculty at MIT, Harvard, Brown and Stanford. Specialist in adaptive designs for clinical trials, use of observational data for causal effects of treatment, trials embedded in clinical practice.
For exhibition and sponsorship opportunities, email strataconf@oreilly.com
For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com
View a complete list of Strata Data Conference contacts
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com