

Data Organization and Therapeutic Applications of AI
Data Organization and Therapeutic Applications of AI
Metadata Harmonization with LLMs
Metadata Harmonization with LLMs

01
We have developed methods to harmonize metadata from GEO into a destination schema + ontology via LLMs

02
Broadly, the methods involved a defined destination ontology for each schema field in a database; a source that can be free text; and some combination of semantic search, RAG, and fine tuned LLM that harmonizes the free text to the destination ontology for a schema field

03
The methods are currently available for 15 destination ontologies pertaining to single-cell RNA-Seq.

04
For these, the automated harmonization results in between 90-97% accuracy and costs an additional $1 per 20 datasets in GPT-related API calls

05
See this poster for more details on the methods involved.
Fine Tuning Foundational Models
Fine Tuning Foundational Models

Cell-Type Annotation
We fine tuned 2 single-cell foundational models for cell type annotation on Ulcerative Colitis cell types, hand curating these cell types to a specified ontology prior to fine tuning. Our models show an improved accuracy and F1 score (3.7%,0.8% improvement in accuracy for colon and pancreas, respectively and 7%,33% improvement in macro F1 for the same) as well as provide a probabilistic classification score for discovering potential novel cell types.

Improved Cell-Type Prediction
and Target Discovery
We fine-tuned single-cell foundation models for cell-type classification, achieving up to 33% improvement in macro F1 scores (7% for colon, 33% for pancreas) and higher accuracy. This enables more precise identification of disease-relevant cell types and molecular targets, leading to more effective therapies, higher market uptake, and increased revenue.

Improved Single- and Two-Gene
Perturbation Prediction
Our single-cell foundation models achieve a 10–20% reduction in mean squared error for two-gene perturbation prediction, helping pharma teams prioritize candidates for the most promising combination therapies. This cuts down experimental time, lowers costs, and streamlines drug development.

Improved Predictive Power for Drug
Response
We developed an integrated predictive model that combines embeddings from a cancer-specific foundation model (CancerFoundation), a large single-cell foundation model (scLong), and drug chemical structure. This was trained to predict a drug response metric (IC50) for 104 cancer drugs across 884 cell lines. This provided a 10% improvement in Pearson’s correlation for IC50 value prediction. FMs trained on large-scale single-cell data can predict how different cell types will respond based on basal gene expression and drug chemical structure, thus helping prioritize the most promising compounds and reduce the number of failed clinical trials.

Improved Gene-Function
Classification for Target Discovery
Our single-cell foundation model-based approaches demonstrated ~15% improvements in macro F1 scores as compared to conventional machine learning methods like random forest and support vector machines for gene-function classification tasks. More accurate computational gene function classification can link genes to specific biological processes and thus accelerate large-scale genomic data analysis and drug target prioritization for therapeutic intervention, reducing the amount of manual experimental efforts.

Multimodal Integration of
Single-Cell Data
Our single cell foundation models can help integrate multiple modalities such as scRNA-seq and scATAC-seq, and subsequently enable downstream tasks based on the joint embeddings. This can lead to more accurate cell-type classification (~7%, ~22% improvements in Adjusted Rand Index for Lymph Node and PBMC data respectively, compared to Seurat V5) and trajectory inference. Usage of multimodal data can lead to more comprehensive insights into disease mechanisms, disease cell types and improved target discovery.