In the world of the $1,000 genome, researchers are inundated with data. The challenge now isn’t to get enough data; it is to distill it, to find the connections that lead to brilliant insights and breakthrough technologies.
Marc Carlson and Sean Taylor offer an overview of Project Rainier, which leverages the power of HDFS and the Hadoop and Spark ecosystem to help scientists at Seattle Children’s Research Institute quickly find new patterns and generate predictions that they can test later, accelerating important pediatric research and increasing scientific collaboration by highlighting where it is needed most.
Rainier is envisioned as a translational research tool for clinicians and scientists to bring together patient clinical data, high-throughput research data, and public data repositories and annotations into a common platform. As a proof of concept, the project ingested a sample dataset consisting of 26 patients, four flow cytometry panels, and 15 exome sequences as well as public datasets including the known variants from the 1000 Genomes project, ExAC, Ensembl gene annotations, Uniprot, Pfam, OMIM, GO, and Pubmed. This small dataset represents over 500 million records brought together in a single searchable space—all with a friendly graphical interface.
Marc Carlson is a lead computational biologist in research informatics at Seattle Children’s Research Institute. Marc divides his time between helping architect new cloud-based infrastructure to serve the scientists at SCRI, working to make sure that new compute resources are brought online and properly configured for immediate utility, and helping users with their data and analysis needs via the Bioinformatics Unit, the goal of which is to make sure that scientists at SCRI can learn the most from their data. Marc’s contributions include creating and running training courses, periodic consultations, and helping with the bioinformatics user group. Previously, he held a postdoc in computational biology at UCLA and worked on the bioconductor core team at the Fred Hutchinson Cancer Research center, where he served the needs of the R-based computational biology community. Marc holds a BS in genetics and cell biology from Washington State University and a PhD in developmental and cell biology from the UC Irvine.
Sean Taylor is the manager for the bioinformatics and high-throughput analytics team at Seattle Children’s Research Institute (SCRI), where he manages the support delivery effort for bioinformatics and computational biology solutions for the eight research centers and almost 1,000 researchers at SCRI. Sean led design and development efforts for SCRI’s integrated precision medicine repository and is now expanding the open source approaches and big data technologies to additional centers and cores. Previously, Sean led the initiative to develop and implement a state-of-the-art bioinformatics core resource at SCRI; was a computational biologist at Amgen, customizing and driving usability in a range of end user interfaces and visualization tools while applying analytic code from multiple projects for areas such as immunotherapy and inflammation; and held a postdoc at the Fred Hutchinson Cancer Research Center, where he developed a new ultrasensitive assay to detect rare mitochondrial DNA mutations in cancer and aging. Sean holds a PhD from Yale University and a BS from Brigham Young University.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org