Mining Unstructured Data: Practical Applications

Alyona Medelyan (Pingar), Anna Divoli (Pingar)
Data Science, Mission City B1
Average rating: ***..
(3.00, 1 rating)

The challenge of unstructured data is a top priority for organizations that are looking for ways to search, sort, analyze and extract knowledge from masses of documents they store and create daily. Text mining uses knowledge-driven algorithms to make sense of documents in a similar way a person would do by reading them. Lately, text mining and analytics tools became available via APIs, meaning that organizations can take immediate advantage these tools. We discuss three examples of how such APIs were utilized to solve key business challenges.

Most organizations dream of paperless office, but still generate and receive millions of print documents. Digitizing these documents and intelligently sharing them is a universal enterprise challenge. Major scanning providers offer solutions that analyze scanned and OCR’d documents and then store detected information in document management systems. This works well with pre-defined forms, but human interaction is required when scanning unstructured text. We describe a prototype build for the legal vertical that scans stacks of paper documents and on the fly categorizes and generates meaningful metadata.

In the area of forensics, intelligence and security, manual monitoring of masses of unstructured data is not feasible. The ability of automatically identify people’s names, addresses, credit card and bank account numbers and other entities is the key. We will briefly describe a case study of how a major international financial institution is taking advantage of text mining APIs in order to comply with a recent legislation act.

In healthcare, although Electronic Health Records (EHRs) have been increasingly becoming available over the past two decades, patient confidentiality and privacy concerns have been acting as obstacles from utilizing the incredibly valuable information they contain to further medical research. Several approaches have been reported in assigning unique encrypted identifiers to patients’ ID but each comes with drawbacks. For a number of medical studies consistent uniform ID mapping is not necessary and automated text sanitization can serve as a solution. We will demonstrate how sanitization has practical use in a medical study.

Photo of Alyona Medelyan

Alyona Medelyan


Alyona Medelyan holds a Master’s degree from the University of Freiburg and a PhD from the University of Waikato, which both focused on Natural Language Processing. During her PhD Medelyan developed an open-source tool Maui (Multi-purpose automatic topic indexing) that performs as well as professional librarians in identifying document’s main topics. Maui is now used by companies and organizations around the world. Alyona has always been passionate about practical applications of her research, which lead to internships at Google New York and Exorbyte Germany. She joined Pingar two years ago and now leads the research and development of API-based products that include semantic and faceted search, query analysis, text summarization, keyword extraction, entity and entity relations extraction.

Photo of Anna Divoli

Anna Divoli


Anna Divoli holds a Master’s degree in Biosystems and Informatics from the University of Liverpool and a PhD in Biomedical Text Mining from the University of Manchester. For her doctoral research, Anna studied sentence extraction for semi-automatic annotation of biological databases. After her PhD, she carried out postdoctoral research, first in user search interfaces in the School of Information at the University of California at Berkeley and then in knowledge acquisition on cancer metastasis from expert opinions in the Department of Medicine at the University of Chicago. Her research focuses on developing methodologies for acquiring knowledge from textual data and studying the effect of human factors in that process. Anna joined Pingar in 2011 as Senior Software Researcher.


  • EMC
  • Microsoft
  • HPCC Systems™ from LexisNexis® Risk Solutions
  • MarkLogic
  • Shared Learning Collaborative
  • Cloudera
  • Digital Reasoning Systems
  • Pentaho
  • Rackspace Hosting
  • Teradata Aster
  • VMware
  • IBM
  • NetApp
  • Oracle
  • 1010data
  • 10gen
  • Acxiom
  • Amazon Web Services
  • Calpont
  • Cisco
  • Couchbase
  • Cray
  • Datameer
  • DataSift
  • DataStax
  • Esri
  • Facebook
  • Feedzai
  • Hadapt
  • Hortonworks
  • Impetus
  • Jaspersoft
  • Karmasphere
  • Lucid Imagination
  • MapR Technologies
  • Pervasive
  • Platform Computing
  • Revolution Analytics
  • Scaleout Software
  • Skytree, Inc.
  • Splunk
  • Tableau Software
  • Talend

For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at

For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners

For media-related inquiries, contact Maureen Jennings at

View a complete list of Strata contacts