The advent of next-generation DNA sequencing technologies is poised to revolutionize the way life sciences research is practiced. These new technologies are scaling significantly faster than Moore’s law and promise to catapult life sciences research and the biotech industry into the realm of big data. However, bioinformatics and data management in the life sciences have been slow to adopt the latest big data technologies pioneered by the Internet industry (e.g., Google and Facebook), in part because these tools are only beginning to become necessary today.
Tom White reviews several ways in which distributed computing tools (e.g., the Hadoop ecosystem) can be used to significantly advance the state of the art in life sciences research, including scaling genome-wide association studies to find connections between your genes and your traits, large-scale data integration of the large number of public databases, and assembling genome sequences from short snippets for use in cancer genomics. Tom also covers the new ADAM project for rebooting genomics ETL on top of Spark and the Eggo project for providing Parquet-formatted public datasets.
Tom White is one of the foremost experts on Hadoop. Tom is a data scientist at Cloudera, where he has worked since its foundation on the core distributions from Cloudera and Apache. Previously, he was an independent Hadoop consultant, working with companies to set up, use, and extend Hadoop. He has been an Apache Hadoop committer since February 2007 and is a member of the Apache Software Foundation. His book Hadoop: The Definitive Guide (O’Reilly) is recognized as the leading reference on the subject. He has written numerous articles for O’Reilly, Java.net, and IBM’s developerWorks and has spoken at several conferences including ApacheCon, OSCON, and Strata + Hadoop World. Tom has a bachelor’s degree in mathematics from the University of Cambridge and a master’s in philosophy of science from the University of Leeds, UK.
©2016, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.