Solving Gaps in Biomedical NLP with Expert-Curated Ground-Truth Datasets

Our client, a leading precision medicine diagnostics provider based in the Bay Area and Switzerland, has maintained a longstanding relationship with Strand since 2017.
The client sought a ground-truth dataset to train and test an NLP algorithm they are developing to extract molecular and genetic data from publications.
Due to the limited availability of reliable ground-truth datasets in biomedical literature, organizations often find it challenging to train and evaluate NLP models effectively. Additionally, ambiguities and inconsistencies in biomedical terminology hinder accurate NLP relation extraction.
Strand curated a high-quality dataset by manually annotating over 1,000 client-provided research abstracts and images, using specified entity types and referencing well-known biomedical databases such as HGVS (for genes), DO (for diseases), ChEMBL (for drugs), and MeSH (for vocabulary). We used BRAT to annotate text passages for entities and relations, with normalization to standard ontologies where needed.
This dataset now serves as a gold standard for training and evaluating the client’s NLP algorithm.
Our expertise in biomedical curation, combined with extensive experience in molecular biology and clinical genomics, ensured precise annotation and timely project delivery.

Other Resources