Data Organization and Therapeutic Applications of AI

01/ Metadata Harmonization

Metadata Harmonization with LLMs

We have developed methods to harmonize metadata from GEO into a destination schema + ontology via LLMs

Broadly, the methods involved a defined destination ontology for each schema field in a database; a source that can be free text; and some combination of semantic search, RAG, and LLM that harmonizes the free text to the destination ontology for a schema field

The methods are currently available for 15 destination ontologies pertaining to single-cell RNA-Seq.

For these, the automated harmonization results in between 90-97% accuracy and costs an additional $1 per 20 datasets in GPT-related API calls

See this poster for more details on the methods involved.

Get In
Touch with
Us

Contact us to adapt these to your favourite ontology or to access APIs for the harmonization tools we’ve already developed.

02/ Foundational Models

Fine-Tuning Foundational Models

Cell-Type Annotation

We fine-tuned 2 single-cell foundational models for cell type annotation on Ulcerative Colitis cell types, hand curating these cell types to a specified ontology prior to fine-tuning. Our models show an improved accuracy and F1 score (3.7%,0.8% improvement in accuracy for colon and pancreas, respectively and 7%,33% improvement in macro F1 for the same) as well as provide a probabilistic classification score for discovering potential novel cell types.

Improved Cell-Type Prediction
and Target Discovery

We fine-tuned single-cell foundation models for cell-type classification, achieving up to 33% improvement in macro F1 scores (7% for colon, 33% for pancreas) and higher accuracy. This enables more precise identification of disease-relevant cell types and molecular targets, leading to more effective therapies, higher market uptake, and increased revenue.

Improved Single- and Two-Gene
Perturbation Prediction

Our single-cell foundation models achieve a 10–20% reduction in mean squared error for two-gene perturbation prediction, helping pharma teams prioritize candidates for the most promising combination therapies. This cuts down experimental time, lowers costs, and streamlines drug development.

Improved Predictive Power for Drug
Response

We developed an integrated predictive model that combines embeddings from a cancer-specific foundation model (CancerFoundation), a large single-cell foundation model (scLong), and drug chemical structure. This was trained to predict a drug response metric (IC50) for 104 cancer drugs across 884 cell lines. This provided a 10% improvement in Pearson’s correlation for IC50 value prediction. FMs trained on large-scale single-cell data can predict how different cell types will respond based on basal gene expression and drug chemical structure, thus helping prioritize the most promising compounds and reduce the number of failed clinical trials.

Improved Gene-Function
Classification for Target Discovery

Our single-cell foundation model-based approaches demonstrated ~15% improvements in macro F1 scores as compared to conventional machine learning methods like random forest and support vector machines for gene-function classification tasks. More accurate computational gene function classification can link genes to specific biological processes and thus accelerate large-scale genomic data analysis and drug target prioritization for therapeutic intervention, reducing the amount of manual experimental efforts.

Multimodal Integration of
Single-Cell Data

Our single cell foundation models can help integrate multiple modalities such as scRNA-seq and scATAC-seq, and subsequently enable downstream tasks based on the joint embeddings. This can lead to more accurate cell-type classification (~7%, ~22% improvements in Adjusted Rand Index for Lymph Node and PBMC data respectively, compared to Seurat V5) and trajectory inference. Usage of multimodal data can lead to more comprehensive insights into disease mechanisms, disease cell types and improved target discovery.