Presented By O'Reilly and Cloudera
Make Data Work
22–23 May 2017: Training
23–25 May 2017: Tutorials & Conference
London, UK

How Apache Spark and AWS Lambda empower researchers to identify disease-causing mutations and engineer healthier genomes

Denis C. Bauer (Commonwealth Scientific and Industrial Research Organisation)
14:3015:00 Tuesday, 23 May 2017
Data Case Studies, Strata Business Summit
Location: Capital Suite 15
Level: Beginner
Average rating: ****.
(4.50, 4 ratings)

Did you know that DNA regulates almost all functions in the body? All of this is encoded in the 3 billion-letter-long genome. As a result, the multibillion-dollar bioinformatics market faces a data tsunami that is different from the one seen in customer and web analytics applications, in that the data is not only deep (many samples) but also extremely wide (many features/letters per sample). Concretely, the decreasing cost for sequencing a genome will enable an estimated 25% of the world’s population to have their 3 billion-letter genomes analyzed by 2025.

Denis C. Bauer shares lessons learned from analyzing data with more features than samples, discusses workarounds implemented to overcome the resource restrictions for AWS Lambda functions, and contrasts Spark- and Lambda-based parallelization. Denis first explores how genomic research uses Apache Spark to massively parallelize the generation of random forests to identify disease genes efficiently. VariantSpark can analyze 3,000 samples with 80 million features in under 30 minutes, enabling real-time diagnosis by finding similar patients. This platform is contributing to motor neuron disease research (publicized by the Ice Bucket Challenge) in Australia.

Denis then turns to real-time analysis with cloud-based solutions. Keeping runtime constant can be challenging for problems that vary in complexity, such as genome engineering. Here, the whole genome needs to be analyzed anew for every location where a beneficial genomic change can be introduced, varying complexity by orders of magnitude. Denis shows how Lambda is used to break down this task into smaller subtasks that can be solved in parallel by instantaneously recruiting additional Lambda functions as the complexity increases and discusses GT-Scan2, featured on Jeff Barr’s AWS blog, which brings together novel scientific insights and unprecedented cloud-compute capacity.

Photo of Denis C. Bauer

Denis C. Bauer

Commonwealth Scientific and Industrial Research Organisation

Denis Bauer leads the Transformational Bioinformatics team at Australia’s national science agency, the Commonwealth Scientific and Industrial Research Organisation (CSIRO)—the research institution behind fast WiFi, the Hendra virus vaccine, and polymer banknotes. She is also involved in initiatives to bring genomics into medical practice. Denis holds a PhD in bioinformatics with expertise in machine learning and genomics.