Applying deep learning in nontraditional data domains, such as electronic health record (EHR) data and medical imagery, presents a variety of challenges, including the friction that practitioners experience when transitioning from traditional data science tasks and tools to training complex neural network architectures. The areas in which deep learning reigns supreme, such as computer vision and natural language, often require less data exploration and utilize well-known preprocessing (e.g., whitening or one-hot encoding). The jump from raw data to model development is short and practitioners can easily bridge the gap using mature, off-the-shelf tools. This makes it straightforward to build a reusable experimental pipeline that feeds into, e.g, a distributed environment designed to optimize performance over a large number of possible model architectures with little or no manual intervention.
In contrast, in data domains like healthcare, there is wide gulf between initial data exploration and downstream model development. Health data requires significantly more upfront analysis in order to determine properties like data types and distributions and identify outliers and missing values. This data likewise requires the creation of more complex and typically ad hoc preprocessing pipelines. This is best performed in an interactive environment in which practitioners can iteratively ask and answer data-driven questions, quickly view their results alongside their code, make plots and graphs, and record inline notes. However, the transition from this sort of interactive environment to developing and training of large-scale neural network models is often bumpy, requiring the practitioner to switch development environments and refactor their piecemeal analyses into a pipeline that can be connected to an offline model training framework.
Cloudera Workbench provides a practitioner with a smooth transition from interactive data exploration to building pipelines to eventual execution of large-scale deep learning leveraging a traditional Hadoop cluster. The Workbench provides a secure, isolated environment for model development and collaboration and can help accelerate data science from exploration to production.
Josh Patterson, Vartika Singh, David Kale, and Tom Hanlon walk you through interactively developing and training deep neural networks to analyze digital health data using the Cloudera Workbench and Deeplearning4j (DL4J). You’ll learn how to use the Workbench to rapidly explore real-world clinical data, build data-preparation pipelines, and launch training of neural networks.
Josh Patterson is the director of field engineering for Skymind. Previously, Josh ran a big data consultancy, worked as a principal solutions architect at Cloudera, and was an engineer at the Tennessee Valley Authority, where he was responsible for bringing Hadoop into the smart grid during his involvement in the openPDC project. Josh is a cofounder of the DL4J open source deep learning project and is a coauthor of Deep Learning: A Practitioner’s Approach. Josh has over 15 years’ experience in software development and continues to contribute to projects such as DL4J, Canova, Apache Mahout, Metronome, IterativeReduce, openPDC, and JMotif. Josh holds a master’s degree in computer science from the University of Tennessee at Chattanooga, where he did research in mesh networks and social insect swarm algorithms.
Vartika Singh is a solutions architect at Cloudera with over 12 years of experience applying machine learning techniques to big data problems.
David Kale is a deep learning engineer at Skymind and a PhD candidate in computer science at the University of Southern California, where he is advised by Greg Ver Steeg of the USC Information Sciences Institute. His research uses machine learning to extract insights from digital data in high-impact domains, such as healthcare, and he collaborates with researchers from Stanford Center for Biomedical Informatics Research and the YerevaNN Research Lab. Recently, David pioneered the application of deep learning to modern electronic health records data. At Skymind, he works with clients and partners to develop and deploy deep learning solutions for real world problems. David co-organizes the Machine Learning for Healthcare Conference (MLHC) and has served as a judge in several XPRIZE competitions, including the upcoming IBM Watson AI XPRIZE. He is the recipient of the Alfred E. Mann Innovation in Engineering Fellowship.
Tom Hanlon is a senior instructor at Functional Media, where he delivers courses on the wonders of the Hadoop ecosystem. Before beginning his relationship with Hadoop and large distributed data, he had a happy and lengthy relationship with MySQL with a focus on web operations. He has been a trainer for MySQL, Sun, and Percona.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com