Skip to main content

DNA analysis RAG pipeline powered by LLMs

Project description

DNA RAG

PyPI version Downloads Python License

Analyse your personal DNA data using Large Language Models.

⚠️ Not medical advice. This tool is for educational and research purposes only. Do not make health decisions based on its output. Always consult a qualified healthcare provider or genetic counselor for medical interpretation of genetic data.

Try it live on Hugging Face Spaces — bring your own API key from DeepSeek or any OpenAI-compatible provider.

💡 Cost: 2 days of active testing with OpenAI API didn't even cost $0.01 in tokens.

OpenAI API usage — $0.00 for 21 requests

DNA RAG is a Python pipeline that answers questions about personal genetic data from consumer DNA testing services (23andMe, AncestryDNA, MyHeritage, VCF). It uses a two-step LLM approach:

  1. SNP identification — the LLM determines which genetic variants (SNPs) are relevant to the user's question.
  2. Interpretation — the user's DNA file is filtered for those variants, and the LLM interprets the matched genotypes.

Quick Start

1. Install

# Engine only (no FastAPI, no Streamlit)
pip install dna-rag

# With Streamlit UI
pip install dna-rag[ui]

# With API server
pip install dna-rag[api]

# Everything
pip install dna-rag[api,ui,rag]

Development (from source):

pip install -e ".[dev]"
pip install -e ".[dev,api,ui]"

2. Configure

cp .env.example .env

Edit .env — pick your provider:

DeepSeek (default):

DNA_RAG_LLM_PROVIDER=deepseek
DNA_RAG_LLM_API_KEY=your-deepseek-key
DNA_RAG_LLM_MODEL=deepseek-r1:free
DNA_RAG_LLM_BASE_URL=https://api.deepseek.com/v1

OpenAI (or any OpenAI-compatible API):

DNA_RAG_LLM_PROVIDER=openai_compat
DNA_RAG_LLM_API_KEY=sk-your-openai-key
DNA_RAG_LLM_MODEL=gpt-4o-mini
DNA_RAG_LLM_BASE_URL=https://api.openai.com/v1

The openai_compat provider works with any API that implements the OpenAI /chat/completions format. Only OpenAI and DeepSeek have been tested with real DNA data.

Per-step LLM (optional) — use a different model for the interpretation step:

# Interpretation step overrides (falls back to primary if not set)
DNA_RAG_LLM_INTERP_PROVIDER=openai_compat
DNA_RAG_LLM_INTERP_API_KEY=sk-your-openai-key
DNA_RAG_LLM_INTERP_MODEL=gpt-4o-mini
DNA_RAG_LLM_INTERP_BASE_URL=https://api.openai.com/v1

3. Run Tests

# All tests (194 tests, ~82% coverage)
pytest

# Quick run without coverage
pytest --override-ini="addopts=-v" --no-header

# Only unit tests
pytest tests/unit/ -v

# Only API tests
pytest tests/api/ -v

# Only integration tests
pytest tests/integration/ -v

# Specific module
pytest tests/test_vcf_parser.py -v
pytest tests/test_polygenic.py -v
pytest tests/test_snp_database.py -v

4. Lint & Type Check

ruff check src/ tests/
mypy src/dna_rag/ --exclude vector_store.py

5. Use the CLI

# Single question
dna-rag ask --dna-file path/to/genome.csv --question "lactose tolerance"

# JSON output
dna-rag ask --dna-file path/to/genome.csv --question "lactose tolerance" --output-format json

# Interactive session
dna-rag interactive --dna-file path/to/genome.csv

6. Run the API Server

# Direct
dna-rag-api

# Or via Docker
make docker-build
make docker-up

API available at http://localhost:8000:

# Health check
curl http://localhost:8000/health

# Analyze (with file upload)
curl -X POST http://localhost:8000/api/v1/analyze \
  -F "file=@genome_data.csv" \
  -F "question=lactose tolerance"

# Supported formats
curl http://localhost:8000/api/v1/formats

Architecture

graph LR
    Q["User question"] --> S1["Step 1: LLM identifies SNPs"]
    S1 --> F["Filter DNA file by RSIDs"]
    F --> S2["Step 2: LLM interprets genotypes"]
    S2 --> R["AnalysisResult"]

    DNA["DNA file<br/>(23andMe / Ancestry / MyHeritage / VCF)"] --> F

Key Design Principles

  • LLM-agnostic — each pipeline step can use a different LLM provider via Python Protocols
  • Pluggable — cache backends, LLM providers, and DNA parsers are all injected via constructor
  • Structured output — Pydantic models validate LLM responses and pipeline results
  • Lightweight core — only 7 runtime deps; heavy libs (chromadb, sentence-transformers) behind [rag] extra

Python API

from pathlib import Path
from dna_rag import DNAAnalysisEngine, Settings
from dna_rag.llm.deepseek import DeepSeekProvider
from dna_rag.cache import InMemoryCache

settings = Settings()  # reads DNA_RAG_* env vars
engine = DNAAnalysisEngine(
    snp_llm=DeepSeekProvider(settings),
    cache=InMemoryCache(),
)

result = engine.analyze("lactose tolerance", Path("genome_data.csv"))
print(result.interpretation)
print(f"Matched {result.snp_count_matched}/{result.snp_count_requested} SNPs")

Per-Step LLM Selection

from dna_rag.llm.deepseek import DeepSeekProvider
from dna_rag.llm.openai_compat import OpenAICompatProvider

engine = DNAAnalysisEngine(
    snp_llm=DeepSeekProvider(snp_settings),           # reasoning model
    interpretation_llm=OpenAICompatProvider(interp_settings),  # cheaper model
    cache=InMemoryCache(),
)

Polygenic Risk Scores

from dna_rag.polygenic import PolygenicScoreCalculator
from dna_rag.parsers.detector import detect_and_parse

df = detect_and_parse(Path("genome_data.csv"))
calc = PolygenicScoreCalculator()
result = calc.calculate("alzheimers_risk", df)
print(result.interpretation)

SNP Validation

from dna_rag.snp_database import SNPDatabase

db = SNPDatabase()
info = db.validate_rsid("rs429358")
print(f"{info.rsid}: gene={info.gene}, chr={info.chromosome}")

Supported DNA Formats

Input files are tabular data (TSV/CSV) exported by DNA testing services. Format is auto-detected by file content (header), not extension.

Service File type Delimiter Example file
VCF .vcf, .vcf.gz Tab genome.vcf
23andMe .txt (TSV) Tab genome_John_Doe.txt
AncestryDNA .txt (TSV) Tab AncestryDNA_raw.txt
MyHeritage .csv Comma MyHeritage_raw.csv

Tested with real DNA data purchased from MyHeritage.

Configuration

All settings via DNA_RAG_-prefixed env vars or .env file.

Primary LLM (SNP identification + default)

Variable Default Description
DNA_RAG_LLM_API_KEY required API key for the LLM provider
DNA_RAG_LLM_PROVIDER deepseek deepseek or openai_compat
DNA_RAG_LLM_MODEL deepseek-r1:free Model name
DNA_RAG_LLM_BASE_URL https://api.deepseek.com/v1 API base URL
DNA_RAG_LLM_TEMPERATURE 0.0 Sampling temperature (0.02.0)
DNA_RAG_LLM_MAX_TOKENS Max response tokens (provider default if unset)
DNA_RAG_LLM_TIMEOUT 60.0 Request timeout in seconds
DNA_RAG_LLM_MAX_RETRIES 3 Retries on connection/rate-limit errors (010)

Interpretation LLM (optional, overrides primary for step 2)

If not set, the primary LLM settings are used for both steps.

Variable Default Description
DNA_RAG_LLM_INTERP_PROVIDER deepseek or openai_compat
DNA_RAG_LLM_INTERP_API_KEY API key (falls back to primary)
DNA_RAG_LLM_INTERP_MODEL Model name (falls back to primary)
DNA_RAG_LLM_INTERP_BASE_URL API base URL (falls back to primary)
DNA_RAG_LLM_INTERP_TEMPERATURE 0.0 Sampling temperature
DNA_RAG_LLM_INTERP_MAX_TOKENS Max response tokens
DNA_RAG_LLM_INTERP_TIMEOUT 60.0 Request timeout in seconds
DNA_RAG_LLM_INTERP_MAX_RETRIES 3 Retries on connection/rate-limit errors

Cache, Logging, Parser

Variable Default Description
DNA_RAG_CACHE_BACKEND memory memory or none
DNA_RAG_CACHE_MAX_SIZE 1000 Max cached entries
DNA_RAG_CACHE_TTL_SECONDS 3600 Cache entry lifetime in seconds
DNA_RAG_LOG_LEVEL INFO Logging level
DNA_RAG_LOG_FORMAT console console or json
DNA_RAG_DEFAULT_DNA_FORMAT auto auto, 23andme, ancestrydna, or myheritage

Project Structure

src/dna_rag/
    engine.py            # Core 2-step LLM pipeline
    config.py            # Pydantic Settings
    models.py            # Data models (SNPResult, AnalysisResult)
    exceptions.py        # Exception hierarchy
    polygenic.py         # Polygenic risk score calculator
    snp_database.py      # NCBI dbSNP validation client
    vector_store.py      # Optional ChromaDB RAG (requires [rag])
    cli.py               # Click CLI
    llm/                 # LLM protocol + providers (DeepSeek, OpenAI-compat)
    cache/               # Cache protocol + in-memory backend
    parsers/             # DNA parsers (23andMe, AncestryDNA, MyHeritage, VCF)
    api/                 # FastAPI server
        routes/          #   REST + WebSocket endpoints
        middleware/       #   Auth, rate-limit, request-id
        services/        #   Analysis, file management, async jobs
        schemas/         #   Request/response models
tests/
    unit/                # Unit tests for all modules
    api/                 # API endpoint tests
    integration/         # CLI + engine integration tests
    test_vcf_parser.py   # VCF parser tests
    test_polygenic.py    # Polygenic calculator tests
    test_snp_database.py # SNP database client tests

Makefile

make help          # Show all targets
make install       # pip install -e ".[dev,api]"
make test          # pytest
make lint          # ruff check
make typecheck     # mypy
make check         # lint + typecheck + test
make serve         # Run API server
make docker-build  # Build Docker image
make docker-up     # Start via docker-compose

API Documentation

  • docs/API.md — endpoint reference, request/response examples
  • ARCHITECTURE.md — FastAPI design document and target architecture

Interactive docs available at http://localhost:8000/docs when server is running.

Privacy & Data

Your genetic data is sensitive. Understand how it is processed:

  • You provide your own API key. DNA data is sent to your chosen LLM provider and is subject to that provider's privacy policy and data retention rules. Review your provider's terms: OpenAI Privacy Policy, DeepSeek Privacy Policy.
  • No data is stored by this tool. DNA RAG does not collect, store, or transmit your genetic data to any third party. All processing happens in your session.
  • Every response includes a medical disclaimer (configurable via DNA_RAG_MEDICAL_DISCLAIMER) reminding that genetic predisposition is not deterministic and recommending consultation with a healthcare professional. The LLM translates it into the response language.

NCBI Verification

When enabled, each SNP identified by the LLM is verified against real biomedical databases before interpretation:

LLM identifies SNPs → dbSNP confirms they exist → ClinVar adds clinical data → LLM interprets with verified context

What it does

Step Source Data
1. dbSNP lookup NCBI dbSNP Confirms RSID exists, corrects gene name, retrieves alleles and MAF
2. ClinVar lookup NCBI ClinVar Clinical significance (Benign / Pathogenic / VUS), associated trait
3. Gene correction dbSNP → engine If the LLM claimed a wrong gene, it is silently replaced with the authoritative one
4. Prompt injection engine → LLM A VERIFIED DATA block with MAF, ClinVar, and gene is added to the interpretation prompt
5. UI display engine → UI ClinVar verification expander shows both LLM opinion and NCBI data side by side

How to enable

Streamlit UI — use the 🔬 NCBI verification toggle in the sidebar. Switching it on/off rebuilds the engine instantly, no restart needed.

Environment variable — set before starting the app:

DNA_RAG_VALIDATION_ENABLED=true   # enable NCBI verification by default

Python API:

from dna_rag.snp_database import SNPDatabase

engine = DNAAnalysisEngine(
    snp_llm=DeepSeekProvider(settings),
    snp_database=SNPDatabase(),  # enables NCBI verification
)

What the user sees

Toggle state Metric column ClinVar expander Speed
OFF Validated: Disabled Hidden Fast (~2-5s)
ON Validated: ✅ NCBI Shows per-SNP clinical significance, trait, MAF Slower (~5-15s)

Note: NCBI E-utilities rate limit is 3 requests/second without an API key. For batch validation of many SNPs this adds ~3-10 seconds per query.

Configuration

Variable Default Description
DNA_RAG_VALIDATION_ENABLED false Enable NCBI dbSNP + ClinVar verification
DNA_RAG_VALIDATION_TIMEOUT 10.0 Timeout per NCBI request in seconds
DNA_RAG_VALIDATION_RATE_LIMIT_DELAY 0.34 Delay between NCBI requests (seconds)

Guardrails

This tool is not a medical device and does not replace professional genetic counseling. Built-in safeguards:

  • Structured LLM output — Pydantic models validate every LLM response; malformed or unexpected output is rejected, not silently passed through.
  • RSID format validation — only SNP identifiers matching the rs* format are accepted; arbitrary text from the LLM is filtered out.
  • NCBI verification — when enabled (see NCBI Verification above), each LLM-identified RSID is verified against NCBI dbSNP and ClinVar. Invalid RSIDs are removed; gene names are corrected; clinical significance is surfaced to the user.
  • Anti-hallucination prompt — the interpretation LLM receives a VERIFIED DATA block from NCBI and CRITICAL RULES that forbid inventing gene associations not supported by evidence.
  • Medical disclaimer in every response — a configurable disclaimer is appended to each interpretation, translated into the user's language.
  • No diagnosis or treatment recommendations — the LLM prompt asks for genotype interpretation only, not medical advice.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dna_rag-1.1.3.tar.gz (412.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dna_rag-1.1.3-py3-none-any.whl (75.6 kB view details)

Uploaded Python 3

File details

Details for the file dna_rag-1.1.3.tar.gz.

File metadata

  • Download URL: dna_rag-1.1.3.tar.gz
  • Upload date:
  • Size: 412.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for dna_rag-1.1.3.tar.gz
Algorithm Hash digest
SHA256 5b8e63d1ffb2168cabc1712f21d10a8293ccafb8d99faf8d82dd68c56b35a1c2
MD5 64ad8abbfc7e5888dd3a6227fe83d060
BLAKE2b-256 27f97ca05a4ce4fe3915b7262f2b9260cc3208774e40427998dbb8c53b653607

See more details on using hashes here.

File details

Details for the file dna_rag-1.1.3-py3-none-any.whl.

File metadata

  • Download URL: dna_rag-1.1.3-py3-none-any.whl
  • Upload date:
  • Size: 75.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for dna_rag-1.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 de2726c16288973ad946217dc35347c7a0778034e035f8e12c514eb01da78ad4
MD5 6bb93426dd135b771441e16bc9c99fa3
BLAKE2b-256 834e63ae4786b9c813ce5866cc38b6497dbf4a48b8befb90ce655a5b966fdb4f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page