DNA analysis RAG pipeline powered by LLMs
Project description
DNA RAG
Analyse your personal DNA data using Large Language Models.
⚠️ Not medical advice. This tool is for educational and research purposes only. Do not make health decisions based on its output. Always consult a qualified healthcare provider or genetic counselor for medical interpretation of genetic data.
Try it live on Hugging Face Spaces — bring your own API key from DeepSeek or any OpenAI-compatible provider.
💡 Cost: 2 days of active testing with OpenAI API didn't even cost $0.01 in tokens.
DNA RAG is a Python pipeline that answers questions about personal genetic data from consumer DNA testing services (23andMe, AncestryDNA, MyHeritage, VCF). It uses a two-step LLM approach:
- SNP identification — the LLM determines which genetic variants (SNPs) are relevant to the user's question.
- Interpretation — the user's DNA file is filtered for those variants, and the LLM interprets the matched genotypes.
Quick Start
1. Install
# Engine only (no FastAPI, no Streamlit)
pip install dna-rag
# With Streamlit UI
pip install dna-rag[ui]
# With API server
pip install dna-rag[api]
# Everything
pip install dna-rag[api,ui,rag]
Development (from source):
pip install -e ".[dev]"
pip install -e ".[dev,api,ui]"
2. Configure
cp .env.example .env
Edit .env — pick your provider:
DeepSeek (default):
DNA_RAG_LLM_PROVIDER=deepseek
DNA_RAG_LLM_API_KEY=your-deepseek-key
DNA_RAG_LLM_MODEL=deepseek-r1:free
DNA_RAG_LLM_BASE_URL=https://api.deepseek.com/v1
OpenAI (or any OpenAI-compatible API):
DNA_RAG_LLM_PROVIDER=openai_compat
DNA_RAG_LLM_API_KEY=sk-your-openai-key
DNA_RAG_LLM_MODEL=gpt-4o-mini
DNA_RAG_LLM_BASE_URL=https://api.openai.com/v1
The openai_compat provider works with any API that implements the OpenAI /chat/completions format. Only OpenAI and DeepSeek have been tested with real DNA data.
Per-step LLM (optional) — use a different model for the interpretation step:
# Interpretation step overrides (falls back to primary if not set)
DNA_RAG_LLM_INTERP_PROVIDER=openai_compat
DNA_RAG_LLM_INTERP_API_KEY=sk-your-openai-key
DNA_RAG_LLM_INTERP_MODEL=gpt-4o-mini
DNA_RAG_LLM_INTERP_BASE_URL=https://api.openai.com/v1
3. Run Tests
# All tests (194 tests, ~82% coverage)
pytest
# Quick run without coverage
pytest --override-ini="addopts=-v" --no-header
# Only unit tests
pytest tests/unit/ -v
# Only API tests
pytest tests/api/ -v
# Only integration tests
pytest tests/integration/ -v
# Specific module
pytest tests/test_vcf_parser.py -v
pytest tests/test_polygenic.py -v
pytest tests/test_snp_database.py -v
4. Lint & Type Check
ruff check src/ tests/
mypy src/dna_rag/ --exclude vector_store.py
5. Use the CLI
# Single question
dna-rag ask --dna-file path/to/genome.csv --question "lactose tolerance"
# JSON output
dna-rag ask --dna-file path/to/genome.csv --question "lactose tolerance" --output-format json
# Interactive session
dna-rag interactive --dna-file path/to/genome.csv
6. Run the API Server
# Direct
dna-rag-api
# Or via Docker
make docker-build
make docker-up
API available at http://localhost:8000:
# Health check
curl http://localhost:8000/health
# Analyze (with file upload)
curl -X POST http://localhost:8000/api/v1/analyze \
-F "file=@genome_data.csv" \
-F "question=lactose tolerance"
# Supported formats
curl http://localhost:8000/api/v1/formats
Architecture
graph LR
Q["User question"] --> S1["Step 1: LLM identifies SNPs"]
S1 --> F["Filter DNA file by RSIDs"]
F --> S2["Step 2: LLM interprets genotypes"]
S2 --> R["AnalysisResult"]
DNA["DNA file<br/>(23andMe / Ancestry / MyHeritage / VCF)"] --> F
Key Design Principles
- LLM-agnostic — each pipeline step can use a different LLM provider via Python Protocols
- Pluggable — cache backends, LLM providers, and DNA parsers are all injected via constructor
- Structured output — Pydantic models validate LLM responses and pipeline results
- Lightweight core — only 7 runtime deps; heavy libs (chromadb, sentence-transformers) behind
[rag]extra
Python API
from pathlib import Path
from dna_rag import DNAAnalysisEngine, Settings
from dna_rag.llm.deepseek import DeepSeekProvider
from dna_rag.cache import InMemoryCache
settings = Settings() # reads DNA_RAG_* env vars
engine = DNAAnalysisEngine(
snp_llm=DeepSeekProvider(settings),
cache=InMemoryCache(),
)
result = engine.analyze("lactose tolerance", Path("genome_data.csv"))
print(result.interpretation)
print(f"Matched {result.snp_count_matched}/{result.snp_count_requested} SNPs")
Per-Step LLM Selection
from dna_rag.llm.deepseek import DeepSeekProvider
from dna_rag.llm.openai_compat import OpenAICompatProvider
engine = DNAAnalysisEngine(
snp_llm=DeepSeekProvider(snp_settings), # reasoning model
interpretation_llm=OpenAICompatProvider(interp_settings), # cheaper model
cache=InMemoryCache(),
)
Polygenic Risk Scores
from dna_rag.polygenic import PolygenicScoreCalculator
from dna_rag.parsers.detector import detect_and_parse
df = detect_and_parse(Path("genome_data.csv"))
calc = PolygenicScoreCalculator()
result = calc.calculate("alzheimers_risk", df)
print(result.interpretation)
SNP Validation
from dna_rag.snp_database import SNPDatabase
db = SNPDatabase()
info = db.validate_rsid("rs429358")
print(f"{info.rsid}: gene={info.gene}, chr={info.chromosome}")
Supported DNA Formats
Input files are tabular data (TSV/CSV) exported by DNA testing services. Format is auto-detected by file content (header), not extension.
| Service | File type | Delimiter | Example file |
|---|---|---|---|
| VCF | .vcf, .vcf.gz |
Tab | genome.vcf |
| 23andMe | .txt (TSV) |
Tab | genome_John_Doe.txt |
| AncestryDNA | .txt (TSV) |
Tab | AncestryDNA_raw.txt |
| MyHeritage | .csv |
Comma | MyHeritage_raw.csv |
Tested with real DNA data purchased from MyHeritage.
Configuration
All settings via DNA_RAG_-prefixed env vars or .env file.
Primary LLM (SNP identification + default)
| Variable | Default | Description |
|---|---|---|
DNA_RAG_LLM_API_KEY |
required | API key for the LLM provider |
DNA_RAG_LLM_PROVIDER |
deepseek |
deepseek or openai_compat |
DNA_RAG_LLM_MODEL |
deepseek-r1:free |
Model name |
DNA_RAG_LLM_BASE_URL |
https://api.deepseek.com/v1 |
API base URL |
DNA_RAG_LLM_TEMPERATURE |
0.0 |
Sampling temperature (0.0–2.0) |
DNA_RAG_LLM_MAX_TOKENS |
— | Max response tokens (provider default if unset) |
DNA_RAG_LLM_TIMEOUT |
60.0 |
Request timeout in seconds |
DNA_RAG_LLM_MAX_RETRIES |
3 |
Retries on connection/rate-limit errors (0–10) |
Interpretation LLM (optional, overrides primary for step 2)
If not set, the primary LLM settings are used for both steps.
| Variable | Default | Description |
|---|---|---|
DNA_RAG_LLM_INTERP_PROVIDER |
— | deepseek or openai_compat |
DNA_RAG_LLM_INTERP_API_KEY |
— | API key (falls back to primary) |
DNA_RAG_LLM_INTERP_MODEL |
— | Model name (falls back to primary) |
DNA_RAG_LLM_INTERP_BASE_URL |
— | API base URL (falls back to primary) |
DNA_RAG_LLM_INTERP_TEMPERATURE |
0.0 |
Sampling temperature |
DNA_RAG_LLM_INTERP_MAX_TOKENS |
— | Max response tokens |
DNA_RAG_LLM_INTERP_TIMEOUT |
60.0 |
Request timeout in seconds |
DNA_RAG_LLM_INTERP_MAX_RETRIES |
3 |
Retries on connection/rate-limit errors |
Cache, Logging, Parser
| Variable | Default | Description |
|---|---|---|
DNA_RAG_CACHE_BACKEND |
memory |
memory or none |
DNA_RAG_CACHE_MAX_SIZE |
1000 |
Max cached entries |
DNA_RAG_CACHE_TTL_SECONDS |
3600 |
Cache entry lifetime in seconds |
DNA_RAG_LOG_LEVEL |
INFO |
Logging level |
DNA_RAG_LOG_FORMAT |
console |
console or json |
DNA_RAG_DEFAULT_DNA_FORMAT |
auto |
auto, 23andme, ancestrydna, or myheritage |
Project Structure
src/dna_rag/
engine.py # Core 2-step LLM pipeline
config.py # Pydantic Settings
models.py # Data models (SNPResult, AnalysisResult)
exceptions.py # Exception hierarchy
polygenic.py # Polygenic risk score calculator
snp_database.py # NCBI dbSNP validation client
vector_store.py # Optional ChromaDB RAG (requires [rag])
cli.py # Click CLI
llm/ # LLM protocol + providers (DeepSeek, OpenAI-compat)
cache/ # Cache protocol + in-memory backend
parsers/ # DNA parsers (23andMe, AncestryDNA, MyHeritage, VCF)
api/ # FastAPI server
routes/ # REST + WebSocket endpoints
middleware/ # Auth, rate-limit, request-id
services/ # Analysis, file management, async jobs
schemas/ # Request/response models
tests/
unit/ # Unit tests for all modules
api/ # API endpoint tests
integration/ # CLI + engine integration tests
test_vcf_parser.py # VCF parser tests
test_polygenic.py # Polygenic calculator tests
test_snp_database.py # SNP database client tests
Makefile
make help # Show all targets
make install # pip install -e ".[dev,api]"
make test # pytest
make lint # ruff check
make typecheck # mypy
make check # lint + typecheck + test
make serve # Run API server
make docker-build # Build Docker image
make docker-up # Start via docker-compose
API Documentation
- docs/API.md — endpoint reference, request/response examples
- ARCHITECTURE.md — FastAPI design document and target architecture
Interactive docs available at http://localhost:8000/docs when server is running.
Privacy & Data
Your genetic data is sensitive. Understand how it is processed:
- You provide your own API key. DNA data is sent to your chosen LLM provider and is subject to that provider's privacy policy and data retention rules. Review your provider's terms: OpenAI Privacy Policy, DeepSeek Privacy Policy.
- No data is stored by this tool. DNA RAG does not collect, store, or transmit your genetic data to any third party. All processing happens in your session.
- Every response includes a medical disclaimer (configurable via
DNA_RAG_MEDICAL_DISCLAIMER) reminding that genetic predisposition is not deterministic and recommending consultation with a healthcare professional. The LLM translates it into the response language.
NCBI Verification
When enabled, each SNP identified by the LLM is verified against real biomedical databases before interpretation:
LLM identifies SNPs → dbSNP confirms they exist → ClinVar adds clinical data → LLM interprets with verified context
What it does
| Step | Source | Data |
|---|---|---|
| 1. dbSNP lookup | NCBI dbSNP | Confirms RSID exists, corrects gene name, retrieves alleles and MAF |
| 2. ClinVar lookup | NCBI ClinVar | Clinical significance (Benign / Pathogenic / VUS), associated trait |
| 3. Gene correction | dbSNP → engine | If the LLM claimed a wrong gene, it is silently replaced with the authoritative one |
| 4. Prompt injection | engine → LLM | A VERIFIED DATA block with MAF, ClinVar, and gene is added to the interpretation prompt |
| 5. UI display | engine → UI | ClinVar verification expander shows both LLM opinion and NCBI data side by side |
How to enable
Streamlit UI — use the 🔬 NCBI verification toggle in the sidebar. Switching it on/off rebuilds the engine instantly, no restart needed.
Environment variable — set before starting the app:
DNA_RAG_VALIDATION_ENABLED=true # enable NCBI verification by default
Python API:
from dna_rag.snp_database import SNPDatabase
engine = DNAAnalysisEngine(
snp_llm=DeepSeekProvider(settings),
snp_database=SNPDatabase(), # enables NCBI verification
)
What the user sees
| Toggle state | Metric column | ClinVar expander | Speed |
|---|---|---|---|
| OFF | Validated: Disabled |
Hidden | Fast (~2-5s) |
| ON | Validated: ✅ NCBI |
Shows per-SNP clinical significance, trait, MAF | Slower (~5-15s) |
Note: NCBI E-utilities rate limit is 3 requests/second without an API key. For batch validation of many SNPs this adds ~3-10 seconds per query.
Configuration
| Variable | Default | Description |
|---|---|---|
DNA_RAG_VALIDATION_ENABLED |
false |
Enable NCBI dbSNP + ClinVar verification |
DNA_RAG_VALIDATION_TIMEOUT |
10.0 |
Timeout per NCBI request in seconds |
DNA_RAG_VALIDATION_RATE_LIMIT_DELAY |
0.34 |
Delay between NCBI requests (seconds) |
Guardrails
This tool is not a medical device and does not replace professional genetic counseling. Built-in safeguards:
- Structured LLM output — Pydantic models validate every LLM response; malformed or unexpected output is rejected, not silently passed through.
- RSID format validation — only SNP identifiers matching the
rs*format are accepted; arbitrary text from the LLM is filtered out. - NCBI verification — when enabled (see NCBI Verification above), each LLM-identified RSID is verified against NCBI dbSNP and ClinVar. Invalid RSIDs are removed; gene names are corrected; clinical significance is surfaced to the user.
- Anti-hallucination prompt — the interpretation LLM receives a
VERIFIED DATAblock from NCBI andCRITICAL RULESthat forbid inventing gene associations not supported by evidence. - Medical disclaimer in every response — a configurable disclaimer is appended to each interpretation, translated into the user's language.
- No diagnosis or treatment recommendations — the LLM prompt asks for genotype interpretation only, not medical advice.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dna_rag-1.1.3.tar.gz.
File metadata
- Download URL: dna_rag-1.1.3.tar.gz
- Upload date:
- Size: 412.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b8e63d1ffb2168cabc1712f21d10a8293ccafb8d99faf8d82dd68c56b35a1c2
|
|
| MD5 |
64ad8abbfc7e5888dd3a6227fe83d060
|
|
| BLAKE2b-256 |
27f97ca05a4ce4fe3915b7262f2b9260cc3208774e40427998dbb8c53b653607
|
File details
Details for the file dna_rag-1.1.3-py3-none-any.whl.
File metadata
- Download URL: dna_rag-1.1.3-py3-none-any.whl
- Upload date:
- Size: 75.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
de2726c16288973ad946217dc35347c7a0778034e035f8e12c514eb01da78ad4
|
|
| MD5 |
6bb93426dd135b771441e16bc9c99fa3
|
|
| BLAKE2b-256 |
834e63ae4786b9c813ce5866cc38b6497dbf4a48b8befb90ce655a5b966fdb4f
|