In the age of Big Data, it’s important to not only have good quality data but also well documented, easily digested data. The ‘data behind the data’ or metadata can play an important role in analyses as the data itself. This is where metadata curation comes in.
To better understand this important aspect of data-wrangling, one of our senior content analysts, Sakshi Shinghal, spoke to Lavanya Nemani, a resident data scientist here at Strand.
Sakshi: Hi, I’m Sakshi Shinghal, a bioinformatician and geneticist who has worked with Strand for around 3 years. I recently obtained an MSc in Genomic Medicine from King’s College London, in conjunction with St George’s - University of London. I has experience in creating webpages for bioinformatics tools such as expression visualisation and variant analysis pipelines as well as developing variant analysis pipelines for different types of sequencing. I’m currently working at Strand as a senior content analyst, aiding in content development and marketing efforts.
Lavanya: Thanks, Sakshi. My name is Lavanya Nemani. I got a PhD in Machine Learning applications for Astrophysics from The National Institute for Astrophysics, Rome. With a background in astrophysics and computer science and 2 years of corporate experience I have an interest in building AI applications in the healthcare field. I have worked on building models for detection of cancerous cells in histopathology images and am currently developing foundational models for single cell RNA Sequencing data.
Sakshi: So let’s start with the basics, can you explain what metadata is and the types of metadata that tend to be curated?
Lavanya: Metadata includes information about the experiment conducted. For example: tissue type, organism studied, age etc. Some other important fields are: which cell types were studied, disease condition etc. these help researchers narrow down their search for relevant datasets for their analysis.
Sakshi: I see, so metadata is really the data about our data. But, why is metadata curation important, especially when working with public datasets?
Lavanya: When it comes to public datasets, metadata heterogeneity tends to be an issue. This could be, for instance, inconsistencies in terminologies or missing metadata fields caused by diverse data sources and a lack of standardization. This can cause issues in data integration and AI/ML applications. Hence, harmonization of such metadata is required.
Sakshi: At Strand we have been working with metadata curation for years, curating it manually. Can you maybe talk us through projects where we’ve had to carry out this manual curation, and why it was so important that it was done?
Lavanya: Our curation team is actively working with single-cell datasets and other public datasets. The aim with single-cell datasets is to publish disease-specific datasets that are ready for research use, and so we recently released Strand’s scRNA portal, which hosts meticulously curated datasets with 80+ metadata fields across 3 levels of curation. The portal is equipped with 26 filters for easy navigation of harmonized fields and a ‘Search’ bar to explore free-flow texts. Developing it took time-intensive metadata curation. It had 97 unique metadata fields for just 2 diseases (Ulcerative Colitis and Alzheimer’s Disease).
Sakshi: That seems like a rather in-depth curation, with nearly 100 unique metadata fields! I’m sure using AL/ML could help in such a curation, have we had any luck with this?
Lavanya: Our data science team are in the process of testing AI methods for metadata curation. The idea is to automate the process with a combination of traditional entity recognition methods, semantic searches, RAGs and LLMs. The last step will always be a manual QC check to ensure the results are correct.
Sakshi: I see, so with this automated curation, the process will be a lot faster I’m sure. Would you have any preliminary results you could share with us?
Lavanya: We have seen TAT improvements of ~8x with metadata curation for several fields like library preparation protocol, disease, tissue, organism etc. We also investigated the RAG method for biomedical text normalization and the preliminary results are very promising with accuracies for ~99% for the metadata field of disease and ~95% for the metadata field of tissue.
Sakshi: Wow, that’s incredible! With such results, I’m certain we’re looking at integrating AI in future projects as well, are there any you could tease us with?
Lavanya: We are working on several applications of foundational models for scRNA Sequencing data. For example: use of LLM based models for cell type annotation tasks, a crucial step in scRNA-Seq analysis. Other applications include:
- gene function predictions
- multiomics integrated analysis
- gene perturbation prediction
- drug response prediction
Sakshi: Thanks so much Lavanya for taking the time to talk to me about this. It’s been a really illuminating conversation. It’s really cool to get to hear about how we’re integrating AI technology into our processes while still keeping the medical problem statement at the forefront. For readers who’d like to learn more about our metadata curation services or try out our scRNA portal for themselves, feel free to visit https://scrna-curation.mystrand.org/ or email us at bioinformatics@strandls.com