Presented By O'Reilly and Cloudera
Make Data Work
Sept 29–Oct 1, 2015 • New York, NY

Next-generation genomics analysis with Apache Spark

Timothy Danford (Tamr, Inc.)
2:05pm–2:45pm Wednesday, 09/30/2015
Spark & Beyond
Location: 1 E20 / 1 E21 Level: Intermediate
Tags: health
Average rating: ***..
(3.52, 23 ratings)

The genome is “the blueprint of life,” a repeating string composed of four letters, whose order and configuration lay the plan for each individual’s growth and development. Genomics is the study of the structure, function, and evolution of genomes at a variety of scales: from the single cells of a cancer tumor to the genomes of an entire population of individuals. Scientists use “sequencers” to look at the molecular structure of the genome much the same way that astronomers use telescopes to examine composition of stars, and what they see with these molecular telescopes holds the potential to find new drugs, diagnose patients, uncover the genealogy of entire populations, and discover the genetic bases for human disease.

Genomics is also in the middle of a massive technological revolution; over the past decade, the sequencers used by scientists have improved in cost, quality, and speed at exponential rates. Fifteen years ago, it took billions of dollars and years of work for an international consortium of researchers to produce a single human genome; today a single sequencing center can sequence a human genome in a single day for almost $1000. Thousands of human genomes have been sequenced, and projects to sequence hundreds of thousands or millions of genomes are already underway.

Even as the experimental machinery of genomics has advanced, however, its computational support — the tools and methods that convert raw data into clinical findings and research discoveries — has not kept pace. Genomics software today runs much the way it did ten years ago: discrete tools, scripting for workflow, files instead of databases, file formats in place of data models, and little-to-no parallelism.

Spark is an ideal platform for organizing large genomics analysis pipelines and workflows. Its compatibility with the Hadoop platform makes it easy to deploy and support within existing bioinformatics IT infrastructures, and its support for languages such as R, Python, and SQL ease the learning curve for practicing bioinformaticians. Widespread use of Spark for genomics, however, will require adapting and rewriting many of the common methods, tools, and algorithms that are in regular use today.

This talk will present ADAM, an open-source library for bioinformatics analysis, written for Spark and hosted by the AMPLab. We will discuss both the places where Spark’s ability to parallelize an analysis pipeline is a natural fit for genomics methods, as well as some methods that have proven more difficult to adapt. We will also cover ADAM’s use of technologies like Avro, for schema specification, and Parquet, for compressed file formats, in conjunction with its Spark-based workflows.

Photo of Timothy Danford

Timothy Danford

Tamr, Inc.

Timothy Danford is a computer scientist working on advanced automation approaches to big data variety in the pharmaceutical and healthcare industries. Previously, Timothy was a software architect, engineer, and founding team member for Genome Bridge LLC, a Broad Institute subsidiary organized to develop cloud-based SaaS genomic analysis pipelines. He has experience in developing data-management services, applications, and ontologies for bioinformatics and genomics systems at Novartis and Massachusetts General Hospital. As a PhD student in computer science at MIT CSAIL, he focused on computational functional genomics. He is a contributor to ADAM, an open source project for bioinformatics on Spark.

Comments on this page are now closed.


saravanan selvam
11/19/2015 9:29pm EST

I think that ADAM library is cool. Congrats for the developers.

I would like to know that whether Spark would be a good option for De novo assembly of genome? Anybody can answer this question? If anybody can reply, i would be really grateful…
Darshak Bhatti
09/04/2015 3:55am EDT

Many things have been implemented in genomics with MapReduce and also with Apache Spark, So what are the areas which are still to be explored?