Presented By O'Reilly and Cloudera
Make Data Work
31 May–1 June 2016: Training
1 June–3 June 2016: Conference
London, UK

Petascale genomics

Tom White (Cloudera)
16:35–17:15 Thursday, 2/06/2016
Data science & advanced analytics
Location: Capital Suite 8/9 Level: Intermediate
Tags: health, science
Average rating: ***..
(3.40, 5 ratings)

Prerequisite knowledge

Attendees should have some background knowledge of big data technologies like Hadoop. No prior knowledge of biology or genomics is required.


The advent of next-generation DNA sequencing technologies is poised to revolutionize the way life sciences research is practiced. These new technologies are scaling significantly faster than Moore’s law and promise to catapult life sciences research and the biotech industry into the realm of big data. However, bioinformatics and data management in the life sciences have been slow to adopt the latest big data technologies pioneered by the Internet industry (e.g., Google and Facebook), in part because these tools are only beginning to become necessary today.

Tom White reviews several ways in which distributed computing tools (e.g., the Hadoop ecosystem) can be used to significantly advance the state of the art in life sciences research, including scaling genome-wide association studies to find connections between your genes and your traits, large-scale data integration of the large number of public databases, and assembling genome sequences from short snippets for use in cancer genomics. Tom also covers the new ADAM project for rebooting genomics ETL on top of Spark and the Eggo project for providing Parquet-formatted public datasets.

Photo of Tom White

Tom White


Tom White is one of the foremost experts on Hadoop. Tom is a data scientist at Cloudera, where he has worked since its foundation on the core distributions from Cloudera and Apache. Previously, he was an independent Hadoop consultant, working with companies to set up, use, and extend Hadoop. He has been an Apache Hadoop committer since February 2007 and is a member of the Apache Software Foundation. His book Hadoop: The Definitive Guide (O’Reilly) is recognized as the leading reference on the subject. He has written numerous articles for O’Reilly,, and IBM’s developerWorks and has spoken at several conferences including ApacheCon, OSCON, and Strata + Hadoop World. Tom has a bachelor’s degree in mathematics from the University of Cambridge and a master’s in philosophy of science from the University of Leeds, UK.