A contemporary theme in artificial intelligence work is designing human-in-the-loop systems: while largely automated, these systems allow people to examine, adjust, and improve what the machines accomplish. Semisupervised learning is difficult: while people can curate training sets for ML systems, it becomes expensive at scale. Adding more unlabeled data does not replace requirements for human guidance and oversight of automated systems. Moreover, it’s quite difficult to anticipate edge cases that will be encountered at scale, especially when live data comes from a large, diverse audience.
On the one hand, how do people manage AI systems by interacting with them? On the other hand, how do we manage people who are managing AI systems? If machine learning pipelines running at scale write to log files, then troubleshooting issues in those pipelines can become a machine learning/big data problem itself. Peter Norvig recently described this issue from Google’s perspective at the 2016 Artificial Intelligence Conference: to paraphrase, building reliable and robust software is hard even in deterministic domains, but when we move to uncertain domains (e.g., machine learning), robustness becomes even harder as we encounter operations costs, tech debt, etc.
Paco Nathan reviews use cases where Jupyter provides a frontend to AI as the means for keeping humans in the loop (and shares the code used). Jupyter gets used in two ways. First, people responsible for managing ML pipelines use notebooks to set the necessary hyperparameters. In that sense, the notebooks serve in place of configuration scripts. Second, the ML pipelines update those notebooks with telemetry, summary analytics, etc. in lieu of merely sending that data out to log files. Analysis is kept contextualized, making it simple for a person to review. This process enhances the feedback loop between people and machines: humans in the loop use Jupyter notebooks to inspect ML pipelines remotely, adjusting them at any point and inserting additional analysis, data visualization, and their notes into the notebooks; the machine component is mostly automated but available interactively for troubleshooting and adjustment.
The end result is that a smaller group of people can handle a wider range of responsibilities for building and maintaining a complex system of automation. (An analogy is how products such as New Relic address the needs for DevOps practices at scale for web apps, except here Jupyter is the frontend for ML pipelines at scale.) This work anticipates collaborative features for Jupyter notebooks, where multiple parties can edit or update the same live notebook. In this case, the multiple parties would include both the ML pipelines and the humans in the loop, collaborating together.
Paco Nathan leads the Learning group at O’Reilly Media. Known as a “player/coach” data scientist, Paco led innovative data teams building ML apps at scale for several years and more recently was evangelist for Apache Spark, Apache Mesos, and Cascading. Paco has expertise in machine learning, distributed systems, functional programming, and cloud computing with 30+ years of tech-industry experience, ranging from Bell Labs to early-stage startups. Paco is an advisor for Amplify Partners and was cited in 2015 as one of the top 30 people in big data and analytics by Innovation Enterprise. He is the author of Just Enough Math, Intro to Apache Spark, and Enterprise Data Workflows with Cascading.
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org