Skip to main content Managing Big Data Reaching Back to the 11th Century with Hadoop

Scott Sorensen (
Hadoop in Action Sutton Center - Sutton South
Average rating: ***..
(3.50, 4 ratings)
Slides:   1-PDF 

In 2013 it will be possible to map a person’s entire genome for less than $1,000, which represents a million-fold reduction in cost from the first such process only a few years ago. Consumers attracted by this affordable price and by the growing potential of gene sequencing in areas including personalized medicine, pharmacogenomic testing and family history research will generate vast amounts of searchable, highly personal data.

Much of this new data will be sent to, a Big Data company that already manages more than 11 billion records (4 petabytes) of searchable structured and unstructured data consisting of birth, death, census, military, immigration and other records. In the past 17 years, Ancestry users have created more than 47 million family trees containing more than 5 billion profiles of relatives. Added to the current mass archive, the new flood of gene-sequencing data generated by Ancestry’s recently-introduced DNA testing product will present Big Data challenges and opportunities of interest to many other companies whose business models are predicated on similar massive data troves.

In this session, Scott Sorensen, CTO, will present an informative explanation of how the company is leveraging its Big Data capabilities by using Hadoop.

Specifically, Sorensen will provide two broad category overviews of how the company uses Hadoop: 1) with analytics and 2) product features. Details include:


  • Marketing—Marketing team uses predictive modeling to customize and target messages for specific customers (e.g., if we know who is expected to cancel and we understand their behavior on the site we can target these customers with assistance specific to their needs). We also mine customer behavior data to customize our marketing messages (customers that upload a high volume of old photos will receive information about Ancestry’s photo scanning features, messages about new content collections will be tailored to specific customers based on their family trees, etc.).
  • Content Investment—Content usage data is stored in Hadoop and then data models are created to help the Company make content (record) investment decisions to increase ROI.
  • Product—Customer behavior data is captured in Hadoop during controlled experiments and that data is minded to provide direction for the product roadmap.

Product Features: Improving the customer experience

  • Search— has an unmatched genealogy search algorithm that is able to help users distinguish their “John Smith’s” records from the 70 million other “John Smith” records in the database.
  • Hinting—Similar to how Amazon provides suggestions for what books you might be interested in, uses record linking to serve up historical record and family tree hints to help users further family history discoveries. Machine Learning that leverages Hadoop is used to create the record linking algorithms that generate these discoveries.
  • Entity Extraction— Uses natural language processing to accurately find and classify key events and relationships in unstructured text and serve it up to customers through searches and hinting.
  • AncestryDNA—Science and engineering teams use Hadoop to scale genealogy algorithm GERMLINE for ethnicity prediction and cousin matching for 1700% increase in performance.

Scott Sorensen

Scott Sorensen has served as’sChief Technology Officer since April 2013. Since joining the family history search giant in 2002, Scott has held multiple positions including Senior Vice President of Engineering, Vice President of Search and Vice President of Commerce and the first piece of code Scott wrote for the company is still used today. Prior to joining, Scott was co-founder and Vice President of Engineering and then President at Coresoft Technologies. Scott was an engineering manager at WordPerfect / Novell and a software engineer at IBM. He holds a B.S in Computer Science from Brigham Young University.

Comments on this page are now closed.


Marek K Kolodziej
10/30/2013 4:30pm EDT

Would it be possible to post the slides here, like the other speakers have?


Sponsorship Opportunities

For exhibition and sponsorship opportunities, contact Susan Stewart at

Media Partner Opportunities

For information on trade opportunities with O'Reilly conferences email mediapartners

Press & Media

For media-related inquiries, contact Maureen Jennings at

Contact Us

View a complete list of Strata + Hadoop World 2013 contacts