Arrow image

17 Dec 2024

Metadata Curation using AI/ML Methods

WRITTEN BY

Sakshi Shinghal

SHARE THIS

Blog

                                                         

 


In the age of Big Data, it’s important to not only have good quality data but also well documented, easily digested data. The ‘data behind the data’ or metadata can play an important role in analyses as the data itself. This is where metadata curation comes in.

To better understand this important aspect of data-wrangling, one of our senior content analysts, Sakshi Shinghal, spoke to Lavanya Nemani, a resident data scientist here at Strand.


Sakshi: Hi, I’m Sakshi Shinghal, a bioinformatician and geneticist who has worked with Strand for around 3 years. I recently obtained an MSc in Genomic Medicine from King’s College London, in conjunction with St George’s - University of London. I has experience in creating webpages for bioinformatics tools such as expression visualisation and variant analysis pipelines as well as developing variant analysis pipelines for different types of sequencing. I’m currently working at Strand as a senior content analyst, aiding in content development and marketing efforts.

Lavanya: Thanks, Sakshi. My name is Lavanya Nemani. I got a PhD in Machine Learning applications for Astrophysics from The National Institute for Astrophysics, Rome. With a background in astrophysics and computer science and 2 years of corporate experience I have an interest in building AI applications in the healthcare field. I have worked on building models for detection of cancerous cells in histopathology images and am currently developing foundational models for single cell RNA Sequencing data. 

 

Sakshi: So let’s start with the basics, can you explain what metadata is and the types of metadata that tend to be curated?

Lavanya: Metadata includes information about the experiment conducted. For example: tissue type, organism studied, age etc. Some other important fields are: which cell types were studied, disease condition etc. these help researchers narrow down their search for relevant datasets for their analysis.

 

Sakshi: I see, so metadata is really the data about our data. But, why is metadata curation important, especially when working with public datasets?

Lavanya: When it comes to public datasets, metadata heterogeneity tends to be an issue. This could be, for instance, inconsistencies in terminologies or missing metadata fields caused by diverse data sources and a lack of standardization. This can cause issues in data integration and AI/ML applications. Hence, harmonization of such metadata is required. 

 

Sakshi: At Strand we have been working with metadata curation for years, curating it manually. Can you maybe talk us through projects where we’ve had to carry out this manual curation, and why it was so important that it was done? 

Lavanya: Our curation team is actively working with single-cell datasets and other public datasets. The aim with single-cell datasets is to publish disease-specific datasets that are ready for research use, and so we recently released Strand’s scRNA portal, which hosts meticulously curated datasets with 80+ metadata fields across 3 levels of curation. The portal is equipped with 26 filters for easy navigation of harmonized fields and a ‘Search’ bar to explore free-flow texts. Developing it took time-intensive metadata curation. It had 97 unique metadata fields for just 2 diseases (Ulcerative Colitis and Alzheimer’s Disease). 

 

Sakshi: That seems like a rather in-depth curation, with nearly 100 unique metadata fields! I’m sure using AL/ML could help in such a curation, have we had any luck with this?

Lavanya: Our data science team are in the process of testing AI methods for metadata curation. The idea is to automate the process with a combination of traditional entity recognition methods, semantic searches, RAGs and LLMs. The last step will always be a manual QC check to ensure the results are correct. 

 

Sakshi: I see, so with this automated curation, the process will be a lot faster I’m sure. Would you have any preliminary results you could share with us?

Lavanya: We have seen TAT improvements of ~8x with metadata curation for several fields like library preparation protocol, disease, tissue, organism etc. We also investigated the RAG method for biomedical text normalization and the preliminary results are very promising with accuracies for ~99% for the metadata field of disease and ~95% for the metadata field of tissue.

 

Sakshi: Wow, that’s incredible! With such results, I’m certain we’re looking at integrating AI in future projects as well, are there any you could tease us with?

Lavanya: We are working on several applications of foundational models for scRNA Sequencing data. For example: use of LLM based models for cell type annotation tasks, a crucial step in scRNA-Seq analysis. Other applications include: 

  • gene function predictions
  • multiomics integrated analysis
  • gene perturbation prediction
  • drug response prediction  


Sakshi: Thanks so much Lavanya for taking the time to talk to me about this. It’s been a really illuminating conversation. It’s really cool to get to hear about how we’re integrating AI technology into our processes while still keeping the medical problem statement at the forefront. For readers who’d like to learn more about our metadata curation services or try out our scRNA portal for themselves, feel free to visit https://scrna-curation.mystrand.org/  or email us at bioinformatics@strandls.com

 

 

 

 

 

Today’s Pick
from Blogs

13 Dec 2024

Strand’s Methylation Pipeline Series

1 | Strand’s Methylation Pipeline - An Overview

Divya Anantsri

Know More

06 Dec 2024

Strand’s Automated Variant Verification System Cuts Down Efforts by 80%

Divya Anantsri

Know More

Your Next
Blog Recommendations

10 Nov 2023

Strand is heading to AMP2023!

Divya Anantsri

Know More

03 Apr 2024

Data Harmonization Series

4 | Harnessing the Power of Harmonized Data: Strand's Approach

Suhasini Singh

Know More

21 Oct 2024

Microbiome Analysis Series

1 | An Ecosystem of Possibilities—Applications of Microbiome Analysis

Sanjna Banerjee

Know More

Let's
Talk

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Form
About image
Please fill out this form to
download the case study.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.