One of the most significant challenges many organizations must solve is how to semantically organize textual data in such a way that analytical tools can benefit from a better understanding of any underlying domain knowledge. The vast majority of this type of data is described in terms of names and descriptions (e.g., product items, job advertisements, company listings, etc.), and companies tend to organize them using a taxonomy of concepts curated manually, based on organizational knowledge, which adjusts over time.
Millions of jobs have been advertised at reed.co.uk over the last 20 years. Reed.co.uk is currently exploring new ways of cataloguing its data to improve the quality of its products and services. Roxana Danger offers an overview of ROOT, the reed online occupational taxonomy, which was constructed to improve the quality of services at reed.co.uk, and discusses this semisupervised methodology for generating (and maintaining) taxonomies from large collections of textual data. Roxana outlines the importance of the methodology as well as the lessons learned during the taxonomy construction at reed.co.uk.
The proposed methodology is composed of the following steps:
In her research career, Roxana Danger has often pursued and achieved the dual goal of improving the performance of information extraction systems while proposing and validating novel mechanisms for storing and analyzing the extracted data in semantic knowledge databases. Roxana is currently working as a data scientist at ReedOnline LTD, designing and applying machine learning and NLP techniques for providing data-driven insights to the company. She was previous enrolled as a research associate at the Computing Department of Imperial College London, where she designed and implemented a provenance platform and data mining tools for diagnosis decision support in health care systems, as part of EU-FP7 project TRANSFoRm, and at the Department of Computer Systems and Computation at Universidad Politécnica de Valencia, Spain, where she worked on the development of an information extraction system for protein-protein interactions. Roxana holds a PhD from University Jaume I, Castellon, Spain, where her project aimed at extracting and analyzing semantic data from archaeology site excavation reports, and undergraduate and master’s degrees in computer science from Universidad de Oriente, Santiago de Cuba.
©2016, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.