DNA analysis RAG pipeline powered by LLMs

These details have not been verified by PyPI

Project description

DNA RAG

Analyse your personal DNA data using Large Language Models.

⚠️ Not medical advice. This tool is for educational and research purposes only. Do not make health decisions based on its output. Always consult a qualified healthcare provider or genetic counselor for medical interpretation of genetic data.

Try it live on Hugging Face Spaces — bring your own API key from DeepSeek or any OpenAI-compatible provider.

💡 Cost: 2 days of active testing with OpenAI API didn't even cost $0.01 in tokens.

DNA RAG is a Python pipeline that answers questions about personal genetic data from consumer DNA testing services (23andMe, AncestryDNA, MyHeritage, VCF). It uses a two-step LLM approach:

SNP identification — the LLM determines which genetic variants (SNPs) are relevant to the user's question.
Interpretation — the user's DNA file is filtered for those variants, and the LLM interprets the matched genotypes.

Quick Start

1. Install

# Engine only (no FastAPI, no Streamlit)
pip install dna-rag

# With Streamlit UI
pip install dna-rag[ui]

# With API server
pip install dna-rag[api]

# Everything
pip install dna-rag[api,ui,rag]

Development (from source):

pip install -e ".[dev]"
pip install -e ".[dev,api,ui]"

2. Configure

cp .env.example .env

Edit .env — pick your provider:

DeepSeek (default):

DNA_RAG_LLM_PROVIDER=deepseek
DNA_RAG_LLM_API_KEY=your-deepseek-key
DNA_RAG_LLM_MODEL=deepseek-r1:free
DNA_RAG_LLM_BASE_URL=https://api.deepseek.com/v1

OpenAI (or any OpenAI-compatible API):

DNA_RAG_LLM_PROVIDER=openai_compat
DNA_RAG_LLM_API_KEY=sk-your-openai-key
DNA_RAG_LLM_MODEL=gpt-4o-mini
DNA_RAG_LLM_BASE_URL=https://api.openai.com/v1

The openai_compat provider works with any API that implements the OpenAI /chat/completions format. Only OpenAI and DeepSeek have been tested with real DNA data.

Per-step LLM (optional) — use a different model for the interpretation step:

# Interpretation step overrides (falls back to primary if not set)
DNA_RAG_LLM_INTERP_PROVIDER=openai_compat
DNA_RAG_LLM_INTERP_API_KEY=sk-your-openai-key
DNA_RAG_LLM_INTERP_MODEL=gpt-4o-mini
DNA_RAG_LLM_INTERP_BASE_URL=https://api.openai.com/v1

3. Run Tests

# All tests (194 tests, ~82% coverage)
pytest

# Quick run without coverage
pytest --override-ini="addopts=-v" --no-header

# Only unit tests
pytest tests/unit/ -v

# Only API tests
pytest tests/api/ -v

# Only integration tests
pytest tests/integration/ -v

# Specific module
pytest tests/test_vcf_parser.py -v
pytest tests/test_polygenic.py -v
pytest tests/test_snp_database.py -v

4. Lint & Type Check

ruff check src/ tests/
mypy src/dna_rag/ --exclude vector_store.py

5. Use the CLI

# Single question
dna-rag ask --dna-file path/to/genome.csv --question "lactose tolerance"

# JSON output
dna-rag ask --dna-file path/to/genome.csv --question "lactose tolerance" --output-format json

# Interactive session
dna-rag interactive --dna-file path/to/genome.csv

6. Run the API Server

# Direct
dna-rag-api

# Or via Docker
make docker-build
make docker-up

API available at http://localhost:8000:

# Health check
curl http://localhost:8000/health

# Analyze (with file upload)
curl -X POST http://localhost:8000/api/v1/analyze \
  -F "file=@genome_data.csv" \
  -F "question=lactose tolerance"

# Supported formats
curl http://localhost:8000/api/v1/formats

Architecture

graph LR
    Q["User question"] --> S1["Step 1: LLM identifies SNPs"]
    S1 --> F["Filter DNA file by RSIDs"]
    F --> S2["Step 2: LLM interprets genotypes"]
    S2 --> R["AnalysisResult"]

    DNA["DNA file<br/>(23andMe / Ancestry / MyHeritage / VCF)"] --> F

Key Design Principles

LLM-agnostic — each pipeline step can use a different LLM provider via Python Protocols
Pluggable — cache backends, LLM providers, and DNA parsers are all injected via constructor
Structured output — Pydantic models validate LLM responses and pipeline results
Lightweight core — only 7 runtime deps; heavy libs (chromadb, sentence-transformers) behind [rag] extra

Python API

from pathlib import Path
from dna_rag import DNAAnalysisEngine, Settings
from dna_rag.llm.deepseek import DeepSeekProvider
from dna_rag.cache import InMemoryCache

settings = Settings()  # reads DNA_RAG_* env vars
engine = DNAAnalysisEngine(
    snp_llm=DeepSeekProvider(settings),
    cache=InMemoryCache(),
)

result = engine.analyze("lactose tolerance", Path("genome_data.csv"))
print(result.interpretation)
print(f"Matched {result.snp_count_matched}/{result.snp_count_requested} SNPs")

Per-Step LLM Selection

from dna_rag.llm.deepseek import DeepSeekProvider
from dna_rag.llm.openai_compat import OpenAICompatProvider

engine = DNAAnalysisEngine(
    snp_llm=DeepSeekProvider(snp_settings),           # reasoning model
    interpretation_llm=OpenAICompatProvider(interp_settings),  # cheaper model
    cache=InMemoryCache(),
)

Polygenic Risk Scores

from dna_rag.polygenic import PolygenicScoreCalculator
from dna_rag.parsers.detector import detect_and_parse

df = detect_and_parse(Path("genome_data.csv"))
calc = PolygenicScoreCalculator()
result = calc.calculate("alzheimers_risk", df)
print(result.interpretation)

SNP Validation

from dna_rag.snp_database import SNPDatabase

db = SNPDatabase()
info = db.validate_rsid("rs429358")
print(f"{info.rsid}: gene={info.gene}, chr={info.chromosome}")

Supported DNA Formats

Input files are tabular data (TSV/CSV) exported by DNA testing services. Format is auto-detected by file content (header), not extension.

Service	File type	Delimiter	Example file
VCF	`.vcf`, `.vcf.gz`	Tab	`genome.vcf`
23andMe	`.txt` (TSV)	Tab	`genome_John_Doe.txt`
AncestryDNA	`.txt` (TSV)	Tab	`AncestryDNA_raw.txt`
MyHeritage	`.csv`	Comma	`MyHeritage_raw.csv`

Tested with real DNA data purchased from MyHeritage.

Configuration

All settings via DNA_RAG_-prefixed env vars or .env file.

Primary LLM (SNP identification + default)

Variable	Default	Description
`DNA_RAG_LLM_API_KEY`	required	API key for the LLM provider
`DNA_RAG_LLM_PROVIDER`	`deepseek`	`deepseek` or `openai_compat`
`DNA_RAG_LLM_MODEL`	`deepseek-r1:free`	Model name
`DNA_RAG_LLM_BASE_URL`	`https://api.deepseek.com/v1`	API base URL
`DNA_RAG_LLM_TEMPERATURE`	`0.0`	Sampling temperature (`0.0`–`2.0`)
`DNA_RAG_LLM_MAX_TOKENS`	—	Max response tokens (provider default if unset)
`DNA_RAG_LLM_TIMEOUT`	`60.0`	Request timeout in seconds
`DNA_RAG_LLM_MAX_RETRIES`	`3`	Retries on connection/rate-limit errors (`0`–`10`)

Interpretation LLM (optional, overrides primary for step 2)

If not set, the primary LLM settings are used for both steps.

Variable	Default	Description
`DNA_RAG_LLM_INTERP_PROVIDER`	—	`deepseek` or `openai_compat`
`DNA_RAG_LLM_INTERP_API_KEY`	—	API key (falls back to primary)
`DNA_RAG_LLM_INTERP_MODEL`	—	Model name (falls back to primary)
`DNA_RAG_LLM_INTERP_BASE_URL`	—	API base URL (falls back to primary)
`DNA_RAG_LLM_INTERP_TEMPERATURE`	`0.0`	Sampling temperature
`DNA_RAG_LLM_INTERP_MAX_TOKENS`	—	Max response tokens
`DNA_RAG_LLM_INTERP_TIMEOUT`	`60.0`	Request timeout in seconds
`DNA_RAG_LLM_INTERP_MAX_RETRIES`	`3`	Retries on connection/rate-limit errors

Cache, Logging, Parser

Variable	Default	Description
`DNA_RAG_CACHE_BACKEND`	`memory`	`memory` or `none`
`DNA_RAG_CACHE_MAX_SIZE`	`1000`	Max cached entries
`DNA_RAG_CACHE_TTL_SECONDS`	`3600`	Cache entry lifetime in seconds
`DNA_RAG_LOG_LEVEL`	`INFO`	Logging level
`DNA_RAG_LOG_FORMAT`	`console`	`console` or `json`
`DNA_RAG_DEFAULT_DNA_FORMAT`	`auto`	`auto`, `23andme`, `ancestrydna`, or `myheritage`

Project Structure

src/dna_rag/
    engine.py            # Core 2-step LLM pipeline
    config.py            # Pydantic Settings
    models.py            # Data models (SNPResult, AnalysisResult)
    exceptions.py        # Exception hierarchy
    polygenic.py         # Polygenic risk score calculator
    snp_database.py      # NCBI dbSNP validation client
    vector_store.py      # Optional ChromaDB RAG (requires [rag])
    cli.py               # Click CLI
    llm/                 # LLM protocol + providers (DeepSeek, OpenAI-compat)
    cache/               # Cache protocol + in-memory backend
    parsers/             # DNA parsers (23andMe, AncestryDNA, MyHeritage, VCF)
    api/                 # FastAPI server
        routes/          #   REST + WebSocket endpoints
        middleware/       #   Auth, rate-limit, request-id
        services/        #   Analysis, file management, async jobs
        schemas/         #   Request/response models
tests/
    unit/                # Unit tests for all modules
    api/                 # API endpoint tests
    integration/         # CLI + engine integration tests
    test_vcf_parser.py   # VCF parser tests
    test_polygenic.py    # Polygenic calculator tests
    test_snp_database.py # SNP database client tests

Makefile

make help          # Show all targets
make install       # pip install -e ".[dev,api]"
make test          # pytest
make lint          # ruff check
make typecheck     # mypy
make check         # lint + typecheck + test
make serve         # Run API server
make docker-build  # Build Docker image
make docker-up     # Start via docker-compose

API Documentation

docs/API.md — endpoint reference, request/response examples
ARCHITECTURE.md — FastAPI design document and target architecture

Interactive docs available at http://localhost:8000/docs when server is running.

Privacy & Data

Your genetic data is sensitive. Understand how it is processed:

You provide your own API key. DNA data is sent to your chosen LLM provider and is subject to that provider's privacy policy and data retention rules. Review your provider's terms: OpenAI Privacy Policy, DeepSeek Privacy Policy.
No data is stored by this tool. DNA RAG does not collect, store, or transmit your genetic data to any third party. All processing happens in your session.
Every response includes a medical disclaimer (configurable via DNA_RAG_MEDICAL_DISCLAIMER) reminding that genetic predisposition is not deterministic and recommending consultation with a healthcare professional. The LLM translates it into the response language.

NCBI Verification

When enabled, each SNP identified by the LLM is verified against real biomedical databases before interpretation:

LLM identifies SNPs → dbSNP confirms they exist → ClinVar adds clinical data → LLM interprets with verified context

What it does

Step	Source	Data
1. dbSNP lookup	NCBI dbSNP	Confirms RSID exists, corrects gene name, retrieves alleles and MAF
2. ClinVar lookup	NCBI ClinVar	Clinical significance (Benign / Pathogenic / VUS), associated trait
3. Gene correction	dbSNP → engine	If the LLM claimed a wrong gene, it is silently replaced with the authoritative one
4. Prompt injection	engine → LLM	A `VERIFIED DATA` block with MAF, ClinVar, and gene is added to the interpretation prompt
5. UI display	engine → UI	ClinVar verification expander shows both LLM opinion and NCBI data side by side

How to enable

Streamlit UI — use the 🔬 NCBI verification toggle in the sidebar. Switching it on/off rebuilds the engine instantly, no restart needed.

Environment variable — set before starting the app:

DNA_RAG_VALIDATION_ENABLED=true   # enable NCBI verification by default

Python API:

from dna_rag.snp_database import SNPDatabase

engine = DNAAnalysisEngine(
    snp_llm=DeepSeekProvider(settings),
    snp_database=SNPDatabase(),  # enables NCBI verification
)

What the user sees

Toggle state	Metric column	ClinVar expander	Speed
OFF	`Validated: Disabled`	Hidden	Fast (~2-5s)
ON	`Validated: ✅ NCBI`	Shows per-SNP clinical significance, trait, MAF	Slower (~5-15s)

Note: NCBI E-utilities rate limit is 3 requests/second without an API key. For batch validation of many SNPs this adds ~3-10 seconds per query.

Configuration

Variable	Default	Description
`DNA_RAG_VALIDATION_ENABLED`	`false`	Enable NCBI dbSNP + ClinVar verification
`DNA_RAG_VALIDATION_TIMEOUT`	`10.0`	Timeout per NCBI request in seconds
`DNA_RAG_VALIDATION_RATE_LIMIT_DELAY`	`0.34`	Delay between NCBI requests (seconds)

Guardrails

This tool is not a medical device and does not replace professional genetic counseling. Built-in safeguards:

Structured LLM output — Pydantic models validate every LLM response; malformed or unexpected output is rejected, not silently passed through.
RSID format validation — only SNP identifiers matching the rs* format are accepted; arbitrary text from the LLM is filtered out.
NCBI verification — when enabled (see NCBI Verification above), each LLM-identified RSID is verified against NCBI dbSNP and ClinVar. Invalid RSIDs are removed; gene names are corrected; clinical significance is surfaced to the user.
Anti-hallucination prompt — the interpretation LLM receives a VERIFIED DATA block from NCBI and CRITICAL RULES that forbid inventing gene associations not supported by evidence.
Medical disclaimer in every response — a configurable disclaimer is appended to each interpretation, translated into the user's language.
No diagnosis or treatment recommendations — the LLM prompt asks for genotype interpretation only, not medical advice.

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.1.3

Mar 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dna_rag-1.1.3.tar.gz (412.4 kB view details)

Uploaded Mar 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dna_rag-1.1.3-py3-none-any.whl (75.6 kB view details)

Uploaded Mar 1, 2026 Python 3

File details

Details for the file dna_rag-1.1.3.tar.gz.

File metadata

Download URL: dna_rag-1.1.3.tar.gz
Upload date: Mar 1, 2026
Size: 412.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for dna_rag-1.1.3.tar.gz
Algorithm	Hash digest
SHA256	`5b8e63d1ffb2168cabc1712f21d10a8293ccafb8d99faf8d82dd68c56b35a1c2`
MD5	`64ad8abbfc7e5888dd3a6227fe83d060`
BLAKE2b-256	`27f97ca05a4ce4fe3915b7262f2b9260cc3208774e40427998dbb8c53b653607`

See more details on using hashes here.

File details

Details for the file dna_rag-1.1.3-py3-none-any.whl.

File metadata

Download URL: dna_rag-1.1.3-py3-none-any.whl
Upload date: Mar 1, 2026
Size: 75.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for dna_rag-1.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`de2726c16288973ad946217dc35347c7a0778034e035f8e12c514eb01da78ad4`
MD5	`6bb93426dd135b771441e16bc9c99fa3`
BLAKE2b-256	`834e63ae4786b9c813ce5866cc38b6497dbf4a48b8befb90ce655a5b966fdb4f`

See more details on using hashes here.

dna-rag 1.1.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

DNA RAG

Quick Start

1. Install

2. Configure

3. Run Tests

4. Lint & Type Check

5. Use the CLI

6. Run the API Server

Architecture

Key Design Principles

Python API

Per-Step LLM Selection

Polygenic Risk Scores

SNP Validation

Supported DNA Formats

Configuration

Primary LLM (SNP identification + default)

Interpretation LLM (optional, overrides primary for step 2)

Cache, Logging, Parser

Project Structure

Makefile

API Documentation

Privacy & Data

NCBI Verification

What it does

How to enable

What the user sees

Configuration

Guardrails

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes