A Python module for cell type annotation using various LLMs.
Project description
mLLMCelltype
Overview
mLLMCelltype is a Python package for cell type annotation in single-cell RNA sequencing data using an iterative multi-LLM consensus approach. It combines predictions from multiple large language models and provides uncertainty quantification. The package is compatible with the scverse ecosystem, including AnnData objects and Scanpy workflows.
Installation
pip install mllmcelltype
For development:
git clone https://github.com/cafferychen777/mLLMCelltype.git
cd mLLMCelltype/python
pip install -e .
Requirements: Python >= 3.9, internet connection for LLM API access.
Quick Start
import pandas as pd
from mllmcelltype import annotate_clusters, setup_logging
setup_logging()
marker_genes_df = pd.read_csv('marker_genes.csv')
import os
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
annotations = annotate_clusters(
marker_genes=marker_genes_df,
species='human',
provider='openai',
model='gpt-5.2',
tissue='brain'
)
for cluster, annotation in annotations.items():
print(f"Cluster {cluster}: {annotation}")
Supported Providers and Models
| Provider | Models | API Key Variable |
|---|---|---|
| OpenAI | GPT-5.2, GPT-5, O3-Pro, etc. | OPENAI_API_KEY |
| Anthropic | Claude 4.6 Opus, Claude 4.5 Sonnet/Haiku, etc. | ANTHROPIC_API_KEY |
| Gemini 3 Pro, Gemini 3 Flash, etc. | GEMINI_API_KEY (also supports GOOGLE_API_KEY) |
|
| Alibaba | Qwen3-Max, Qwen-Plus, etc. | QWEN_API_KEY |
| DeepSeek | DeepSeek-Chat, DeepSeek-Reasoner | DEEPSEEK_API_KEY |
| StepFun | Step-3, Step-2-16k, Step-2-Mini | STEPFUN_API_KEY |
| Zhipu AI | GLM-4.7, GLM-4-Plus | ZHIPU_API_KEY |
| MiniMax | MiniMax-M2.1, MiniMax-M2 | MINIMAX_API_KEY |
| X.AI | Grok-4 | GROK_API_KEY |
| OpenRouter | Access to multiple models via single API | OPENROUTER_API_KEY |
API keys can be set via environment variables, passed directly as parameters, or loaded from a .env file.
Annotation Features
- Iterative Consensus: Multiple rounds of comparison between LLM outputs to resolve disagreements
- Uncertainty Quantification: Consensus Proportion (CP) and Shannon Entropy (H) metrics
- Cross-model Comparison: Helps identify inconsistent predictions across models
- Hierarchical Annotation: Optional multi-resolution analysis with parent-child consistency
- Caching: Avoids redundant API calls to reduce costs
- Custom Base URLs: Configure custom API endpoints for proxy servers or enterprise gateways
Multi-LLM Consensus Annotation
from mllmcelltype import format_discussion_report, interactive_consensus_annotation
result = interactive_consensus_annotation(
marker_genes=marker_genes,
species='human',
tissue='peripheral blood',
models=['gpt-5.2', 'claude-sonnet-4-5-20250929', 'gemini-3-pro', 'qwen3-max'],
consensus_threshold=0.7,
max_discussion_rounds=3,
verbose=True
)
print(result["consensus"])
print(format_discussion_report(result))
Consensus Model Selection
The consensus_model parameter specifies which LLM evaluates semantic similarity, calculates consensus metrics, and moderates discussions. Recommended models for consensus checking:
- Anthropic:
claude-sonnet-4-5-20250929,claude-opus-4-1-20250805 - OpenAI:
o1,gpt-5.2,gpt-4.1 - Google:
gemini-3-pro,gemini-3-flash - Other:
deepseek-r1,qwen3-max,grok-4
result = interactive_consensus_annotation(
marker_genes=marker_genes,
species="human",
tissue="brain",
models=["gpt-5.2", "claude-sonnet-4-5-20250929", "gemini-3-pro"],
consensus_model="claude-sonnet-4-5-20250929",
consensus_threshold=0.7,
entropy_threshold=1.0
)
If not specified, defaults to qwen3-max with claude-sonnet-4-5-20250929 as fallback.
Targeted Analysis
Analyze Specific Clusters
result = interactive_consensus_annotation(
marker_genes=all_marker_genes,
species="human",
models=["gpt-5.2", "claude-sonnet-4-5-20250929", "gemini-3-pro"],
clusters_to_analyze=["cluster_0", "cluster_1", "cluster_2"],
tissue="peripheral blood"
)
Force Fresh Analysis (Bypass Cache)
result = interactive_consensus_annotation(
marker_genes=marker_genes,
species="human",
models=["gpt-5.2", "claude-sonnet-4-5-20250929"],
tissue="peripheral blood",
additional_context="Patient with autoimmune disease",
force_rerun=True
)
OpenRouter Integration
OpenRouter provides a unified API for accessing models from multiple providers.
from mllmcelltype import annotate_clusters
annotations = annotate_clusters(
marker_genes=marker_genes,
species='human',
tissue='peripheral blood',
provider_config={"provider": "openrouter", "model": "openai/gpt-5.2"}
)
Free models are available with the :free suffix (e.g., meta-llama/llama-4-maverick:free).
Custom Base URL Configuration
base_urls = {
'openai': 'https://openai-proxy.com/v1/chat/completions',
'anthropic': 'https://anthropic-proxy.com/v1/messages',
'qwen': 'https://qwen-proxy.com/compatible-mode/v1/chat/completions'
}
result = interactive_consensus_annotation(
marker_genes=marker_genes,
species='human',
models=['gpt-5.2', 'claude-sonnet-4-5-20250929', 'qwen3-max'],
api_keys=your_api_keys,
base_urls=base_urls
)
For Qwen models, endpoint selection between international and domestic endpoints is handled automatically.
Scanpy/AnnData Integration
import scanpy as sc
import mllmcelltype as mct
# After standard preprocessing and clustering...
sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')
marker_genes = {}
for cluster in adata.obs['leiden'].unique():
genes = sc.get.rank_genes_groups_df(adata, group=cluster)['names'].tolist()[:20]
marker_genes[cluster] = genes
# Single model annotation
annotations = mct.annotate_clusters(
marker_genes=marker_genes,
species='human',
provider='openai',
model='gpt-5.2'
)
adata.obs['cell_type'] = adata.obs['leiden'].astype(str).map(annotations)
# Or multi-model consensus
consensus_results = mct.interactive_consensus_annotation(
marker_genes=marker_genes,
species='human',
models=['gpt-5.2', 'claude-sonnet-4-5-20250929', 'gemini-3-pro'],
consensus_threshold=0.7
)
adata.obs['consensus_cell_type'] = adata.obs['leiden'].astype(str).map(consensus_results["consensus"])
adata.obs['consensus_proportion'] = adata.obs['leiden'].astype(str).map(consensus_results["consensus_proportion"])
adata.obs['entropy'] = adata.obs['leiden'].astype(str).map(consensus_results["entropy"])
See the examples directory for complete workflow examples.
Contributing
We welcome contributions. Please submit issues or pull requests on our GitHub repository.
License
MIT License
Citation
If you use mLLMCelltype in your research, please cite:
@article{Yang2025.04.10.647852,
author = {Yang, Chen and Zhang, Xianyang and Chen, Jun},
title = {Large Language Model Consensus Substantially Improves the Cell Type Annotation Accuracy for scRNA-seq Data},
elocation-id = {2025.04.10.647852},
year = {2025},
doi = {10.1101/2025.04.10.647852},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2025/04/17/2025.04.10.647852},
journal = {bioRxiv}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mllmcelltype-2.0.4.tar.gz.
File metadata
- Download URL: mllmcelltype-2.0.4.tar.gz
- Upload date:
- Size: 87.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a8fc0145b31dba5bc3426a9097811d53de5325d289233b084a27fb355d73bd2f
|
|
| MD5 |
1a79455f555674bd739cb69a80014633
|
|
| BLAKE2b-256 |
721cee3ff1510efe17fc3c72c815c88c5f52e30286e98f32b600901ec23eac04
|
File details
Details for the file mllmcelltype-2.0.4-py3-none-any.whl.
File metadata
- Download URL: mllmcelltype-2.0.4-py3-none-any.whl
- Upload date:
- Size: 63.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
19087c04ae4df80ceca33a5cf9f6fca0e1c2d0aa41be8eb5b5a66c8f05c1d3eb
|
|
| MD5 |
e72ea26302696e34c97d168f40d37694
|
|
| BLAKE2b-256 |
ef6d25c4c22dd7b1fbc82589082d5da142d1a20edfc4f5539b565f081b987b96
|