Presented By O'Reilly and Cloudera
Make Data Work
31 May–1 June 2016: Training
1 June–3 June 2016: Conference
London, UK

A methodology for taxonomy generation and maintenance from large collections of textual data

Roxana Danger (
16:00–16:30 Wednesday, 1/06/2016
Hardcore data science
Location: Capital Suite 4 Level: Intermediate
Tags: text
Average rating: ***..
(3.60, 5 ratings)

Prerequisite knowledge

Attendees should have experience with textual analytics.


One of the most significant challenges many organizations must solve is how to semantically organize textual data in such a way that analytical tools can benefit from a better understanding of any underlying domain knowledge. The vast majority of this type of data is described in terms of names and descriptions (e.g., product items, job advertisements, company listings, etc.), and companies tend to organize them using a taxonomy of concepts curated manually, based on organizational knowledge, which adjusts over time.

Millions of jobs have been advertised at over the last 20 years. is currently exploring new ways of cataloguing its data to improve the quality of its products and services. Roxana Danger offers an overview of ROOT, the reed online occupational taxonomy, which was constructed to improve the quality of services at, and discusses this semisupervised methodology for generating (and maintaining) taxonomies from large collections of textual data. Roxana outlines the importance of the methodology as well as the lessons learned during the taxonomy construction at

The proposed methodology is composed of the following steps:

  1. Data collection: relevant data associated to the objects of analysis are captured.
  2. Named entities detection: ML models are generated to recognize the most important set of entities characterizing objects.
  3. Object name detection and normalization: clustering techniques are applied and a unique name for each type of object in the dataset is chosen.
  4. Taxonomy construction: based on the normalized name of objects, the taxonomy is constructed in a way such that each level reflects a distinguishable new type of object.
  5. Taxonomy updating: active learning approaches are be used to provide an incremental updating of the taxonomy.
Photo of Roxana Danger

Roxana Danger

In her research career, Roxana Danger has often pursued and achieved the dual goal of improving the performance of information extraction systems while proposing and validating novel mechanisms for storing and analyzing the extracted data in semantic knowledge databases. Roxana is currently working as a data scientist at ReedOnline LTD, designing and applying machine learning and NLP techniques for providing data-driven insights to the company. She was previous enrolled as a research associate at the Computing Department of Imperial College London, where she designed and implemented a provenance platform and data mining tools for diagnosis decision support in health care systems, as part of EU-FP7 project TRANSFoRm, and at the Department of Computer Systems and Computation at Universidad Polit├ęcnica de Valencia, Spain, where she worked on the development of an information extraction system for protein-protein interactions. Roxana holds a PhD from University Jaume I, Castellon, Spain, where her project aimed at extracting and analyzing semantic data from archaeology site excavation reports, and undergraduate and master’s degrees in computer science from Universidad de Oriente, Santiago de Cuba.