As the risk and reward trade-offs grow for products based on AI, along with the pressures of compliance and accountability, at what point is it no longer acceptable for any one commercial entity to hold responsibility for so much shared risk? Can we incentivize corporations, government agencies, independent watchdog groups, and other relevant parts to combine their data in cases where there are large shared risks?
ML models have become ubiquitous, embedded in products and services used throughout our daily lives. Generally, those models get deployed by large commercial interests, which train them on proprietary datasets. However, matters of ethics, privacy, safety, bias, and other concerns can have terrible impact on individuals.
For example, Google develops large sets of training data from crucial sensors in self-driving cars. In an almost adversarial way, the regulators on multiple continents focus on the impact of failure cases related to those sensors and associated ML models. Edge cases in test datasets prove to be disproportionately valuable, and potentially the basis for economic incentives. Instead of entrusting each manufacturer to build “near perfect” training datasets while bearing large risks, we should incentivize manufacturers to combine their data. Rewards for contributing parties could then derive from a combination of training data and testing edge cases, as identified by regulators and other watchdog parties.
Paco Nathan explains how decentralized data markets provide a means to resolve difficult problems when training machine learning models, especially for use cases with large shared risks. With components based on blockchain technologies—smart contracts, token-curated registries, DApps, voting mechanisms, etc.—decentralized data markets allow multiple parties to curate ML training datasets in ways that are transparent, auditable, and secure and allow equitable payouts that take social values into account.
Paco explores open source libraries from Computable.io based on Ethereum, which are being used to develop data markets. These enable users to adjust trade-offs between decentralized and centralized characteristics as needed for specific business use cases and as indicated by ethical concerns. This addresses other areas of machine learning risk, such as in genomics, medical research, and financial credit scores, where proprietary interests and social needs often come into conflict.
Paco Nathan is known as a “player/coach” with core expertise in data science, natural language processing, machine learning, and cloud computing. He has 35+ years of experience in the tech industry, at companies ranging from Bell Labs to early-stage startups. His recent roles include director of the Learning Group at O’Reilly and director of community evangelism at Databricks and Apache Spark. Paco is the cochair of Rev conference and an advisor for Amplify Partners, Deep Learning Analytics, Recognai, and Primer. He was named one of the "top 30 people in big data and analytics" in 2015 by Innovation Enterprise.
©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com