Ontology-driven text-mining analysis and normalization of free-text specimen descriptions


  • Damion Dooley
  • Gurinder Gosal
  • William Hsiao

Workshop type:

  • Tutorial/Demo

Workshop Abstract

Researchers, in honouring FAIR principles, often require open-source software tools for cleaning up and harmonizing free-text specimen descriptions, but currently there aren't many options to choose from – and those that are available don't work well on specimen descriptions that are often fragmented punctuation and abbreviation laden phrases, or lists of keywords. We present a tutorial about our hybrid lexicon and rule-based pipeline for converting short and grammatically incomplete sample descriptions into appropriate ontology terms. It was developed and tested in the past year against a training dataset of EnteroBase food born pathogen related sample metadata, and will be further validated against the FDA's GenomeTrakr database of DNA sequenced food-borne pathogens.

The pipeline illustrates and addresses many challenges in processing short textual data, such as the detection of entity boundaries, grammatical incorrectness, misspellings, plurality, coined acronyms, challenges of synonymy, the mingling of multiple languages, term context ambiguity, the recognition of new biomedical terms as they come into use, and ontology vocabulary deficits. Our pipeline combines basic lexicographic transformation with light Natural Language Processing, synonymy, ontology and other resource lookup to produce a tokenized equivalent description suitable for keyword and ontology-driven search of specimen database contents. Selected domain ontologies such as ENVO, UBERON, FoodOn, Unit Ontology, and NCBITaxon provide the breadth of terms sought for sample description. We foresee that other text-mining content domains can be addressed by adding select ontologies and abbreviation tables. The tutorial will review the design decisions made in interfacing ontology and lexicographic resources, and will compare results with other tools like the Google Ontomaton add-on, Extract (https://extract.jensenlab.org/), and the Monarch Initiative text annotator (https://monarchinitiative.org/annotate/text).


Researchers will become familiar with upgrading existing food-borne pathogen datasets that have free-text descriptions in order to query and amalgamate data more easily using Semantic Web tools. They will also see challenges posed by untidy structuring of these kinds of datasets and likely ontology-driven solutions for improving them in the future. We expect the tutorial to foster the vision of the Foundry's ontologies collectively as a content coding standard. As well, we see the tutorial as an opportunity for feedback on the pipeline system's proficiency from a user interface and functionality perspective, and look forward to hearing from participants about feature wish-lists and other sample metadata domains it could be extended to.

Can-SHARE New Initiatives 2016