A revolution in DNA sequencing technology has dramatically changed our ability to examine the human genome and become a crucial tool for research into fundamental human biology. Modern approaches to discovering the genetic basis for diseases such as cancer, diabetes, and Alzheimer’s disease apply powerful algorithmic methods to vast collections genomic data from thousands of patients to uncover robust statistical signals of genomic variation and function. However, many existing bioinformatics tools and much of the software are not designed to feasibly, reliably, or efficiently process the large amounts of data produced by these new technologies. Genomics software today runs much the way it did 10 years ago: individually developed tools, shell scripting to implement workflow, files instead of databases, file formats in place of data models, little-to-no parallelism.
To enable the next generation of sequencing analysis, bioinformaticians will need to simultaneously change the way they develop large-scale genomic analysis methods and at the same time rebuild the infrastructure on which they execute those algorithms. Spark is an ideal platform for organizing large genomics analysis pipelines and workflows. Its compatibility with the Hadoop platform makes it easy to deploy and support within existing bioinformatics IT infrastructures. Its support for languages such as R, Python, and SQL eases the learning curve for practicing bioinformaticians.
Widespread use of Spark for genomics, however, will require adapting and rewriting many of the common methods, tools, and algorithms in regular use today. Spark’s ability to parallelize pipelined analyses is a natural fit for some genomics workflows; however, the abstractions and interfaces presented by the Spark platform remain problematic for other analyses and algorithms.
Timothy Danford offers a case study of a cancer genomics analysis pipeline implemented as part of the open source genomics software project, ADAM, which uses Apache Spark-generated abstractions executed on commodity computing infrastructure. Timothy will describe the integration of Spark and ADAM into the NCI Cloud Pilot, a contract awarded to the Broad Institute of Harvard and MIT and the University of California Berkeley by the National Cancer Institutes to implement a cloud-based platform for cancer genomics analysis.
Timothy Danford is a computer scientist working on advanced automation approaches to big data variety in the pharmaceutical and healthcare industries. Previously, Timothy was a software architect, engineer, and founding team member for Genome Bridge LLC, a Broad Institute subsidiary organized to develop cloud-based SaaS genomic analysis pipelines. He has experience in developing data-management services, applications, and ontologies for bioinformatics and genomics systems at Novartis and Massachusetts General Hospital. As a PhD student in computer science at MIT CSAIL, he focused on computational functional genomics. He is a contributor to ADAM, an open source project for bioinformatics on Spark.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.