Searching for the Genetic Causes of Disease with Hadoop

Hadoop: Case Studies, Gramercy Suite (NY Hilton)

The cost of finding every letter of the code in your DNA has fallen precipitously, to where you can get your full sequence for under three thousand dollars. In another year, that will be about a thousand dollars. The data coming from high throughput sequencing machines threatens to overwhelm researchers right now, and the data rate coming out of the laboratories is increasing faster than Moore’s Law can keep up. At UNC’s Renaissance Computing Institute, we’re using Hadoop to comb through the data from more than three thousand human genomes sequenced in the last two years. What we’re looking for is the genetic basis for disease.

If you represent your DNA as a sequence of letters, it would be a string composed of the letters “G”, “A”, “T”, and “C”. For a human, that string would be about three and half billion characters long. If you compared your string to any other human’s, the two of you would be 99.9% identical. Finding that one-tenth of a percent difference, and then doing the whole process over again and again comparing lots of people with or without a particular disease state, gives us a statistical basis to determine which of the very few variations are actually responsible for a condition.

All that combing through genomes, terabytes at a time, is tedious work. It’s work that benefits from commercial spillover – the task of finding disease-causing variants is often very similar to the “Big Data” problems faced by businesses at internet scale. Whether measuring the frequency of rare alterations, spotting variants very likely to occur together, or teasing out echoes of long-ago shared ancestors, the fundamental problem is to find signals in a mountain of statistical noise.

We’re using Hadoop for our analysis for the same reasons you are – we need the ability to work at uncomfortably large scale to produce results at all, and we need the performance gains from parallelism to do the work in a useful timeframe. This session will discuss how we have attacked some of these problems, the state of our efforts now, and where we hope to be able to go very soon.

Photo of Charles Schmitt

Charles Schmitt

Renaissance Computing Institute

Dr. Charles Schmitt is the Director of Data Sciences at the Renaissance Computing Institute (RENCI), a research computing center at the University of North Carolina at Chapel Hill. As director, Dr. Schmitt is responsible for exploring and advancing the application of novel data technologies for national research agendas. This includes work in areas such as high throughput genomic sequencing, management of distributed research data, medical decision support, and data security.

Prior to joining RENCI, Dr. Schmitt worked as a Computer Scientist in industry in areas including data-mining, bioinformatics, and software engineering. His Ph.D. is in Computer Science where he focused on developing neural network models of the human visual system.


Sponsorship Opportunities

For information on exhibition and sponsorship opportunities, contact Susan Stewart at

Media Partner Opportunities

For information on trade opportunities contact Kathy Yu at mediapartners

Press and Media

For media-related inquiries, contact Maureen Jennings at

Contact Us

View a complete list of Strata contacts.