Skip to main content

Extract LLM DNA vectors — low-dimensional representations that capture functional behavior and model evolution.

Project description

LLM-DNA

Python 3.10+ PyPI version License Tests

Extract LLM DNA vectors — low-dimensional, training-free representations that capture functional behavior and evolutionary relationships between language models.

📄 Paper: LLM DNA: Tracing Model Evolution via Functional Representations (ICLR 2026 Oral)

Overview

The explosive growth of large language models has created a vast but opaque landscape: millions of models exist, yet their evolutionary relationships through fine-tuning, distillation, or adaptation are often undocumented. LLM-DNA provides a general, scalable, training-free pipeline for extracting LLM DNA — mathematically-grounded representations that satisfy inheritance and genetic determinism properties.

Key Features:

  • 🧬 Extract DNA vectors from any HuggingFace or local model
  • 🚀 Training-free, works across architectures and tokenizers
  • 📊 Tested on 305+ LLMs with superior or competitive performance
  • 🔍 Uncover undocumented relationships between models
  • 🌳 Build evolutionary trees using phylogenetic algorithms

Installation

pip install llm-dna

Use llm-dna for install/package naming, and llm_dna for Python imports.

Optional extras are available for model families that need additional runtime dependencies:

# Apple Silicon / MLX-backed models
pip install "llm-dna[apple]"

# Quantized HuggingFace models (bitsandbytes, GPTQ, compressed-tensors, optimum)
pip install "llm-dna[quantization]"

# Architecture-specific model families such as Mamba or TIMM-backed models
pip install "llm-dna[model_families]"

# Everything above
pip install "llm-dna[full]"

Extra guidance:

  • apple: required for MLX and mlx-community/* style model families on Apple Silicon.
  • quantization: required for many GPTQ, bitsandbytes, and compressed-tensors model families.
  • model_families: required for specific architectures whose modeling code depends on packages like mamba-ssm or timm.

Quick Start

from llm_dna import DNAExtractionConfig, calc_dna

config = DNAExtractionConfig(
    model_name="distilgpt2",
    dataset="rand",
    gpu_id=0,
    max_samples=100,
)

result = calc_dna(config)
print(f"DNA shape: {result.vector.shape}")  # (128,)

Python API

from llm_dna import DNAExtractionConfig, calc_dna

config = DNAExtractionConfig(
    model_name="Qwen/Qwen2.5-0.5B-Instruct",
    dataset="rand",
    gpu_id=0,
    max_samples=100,
    dna_dim=128,
    reduction_method="random_projection",  # or "pca", "svd"
    trust_remote_code=True,
)

result = calc_dna(config)

# DNA vector (numpy.ndarray)
vector = result.vector

# Saved paths (when save=True)
print(result.output_path)
print(result.summary_path)

CLI

# Single model
calc-dna --model-name distilgpt2 --dataset rand --gpus 0

# Multiple models with round-robin GPU assignment
calc-dna --llm-list ./configs/llm_list.txt --gpus 0,1

# With hyperparameters
calc-dna \
  --model-name mistralai/Mistral-7B-v0.1 \
  --dna-dim 256 \
  --max-samples 200 \
  --reduction-method pca \
  --load-in-8bit

Notes

  • Metadata auto-fetched: Model metadata is automatically retrieved from HuggingFace Hub and cached.
  • Auth token: Pass via token=... or set HF_TOKEN environment variable.
  • Chat templates: Disabled by default. Enable with --use-chat-template (CLI) or use_chat_template=True (API).

Tests

# All tests (including integration tests with real model loading)
pytest tests/ -v

# Fast tests only (skip real model loading)
pytest tests/ -m "not slow"

Citation

If you use LLM-DNA in your research, please cite:

@inproceedings{wu2026llmdna,
  title={LLM DNA: Tracing Model Evolution via Functional Representations},
  author={Wu, Zhaomin and Zhao, Haodong and Wang, Ziyang and Guo, Jizhou and Wang, Qian and He, Bingsheng},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026},
  url={https://openreview.net/pdf?id=UIxHaAqFqQ}
}

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_dna-0.2.3.tar.gz (87.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_dna-0.2.3-py3-none-any.whl (101.5 kB view details)

Uploaded Python 3

File details

Details for the file llm_dna-0.2.3.tar.gz.

File metadata

  • Download URL: llm_dna-0.2.3.tar.gz
  • Upload date:
  • Size: 87.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_dna-0.2.3.tar.gz
Algorithm Hash digest
SHA256 ada2a1360d932bfdf0bb045a6049f9930d3b09e87cd5a58c9fddba11be5bff5c
MD5 47199c9e0bf651ab0cd703ca93a66271
BLAKE2b-256 0cb006effcc257fae00162894094437457a0e67637ad9dcdedf0a07b0e94b800

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_dna-0.2.3.tar.gz:

Publisher: release.yml on Xtra-Computing/LLM-DNA

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llm_dna-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: llm_dna-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 101.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_dna-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 1d13de31f58aa0dbee1a9c50402286ad6b20035157ba62a51a76ca87c90ea1d8
MD5 de1472d9ccb529a6560926a196c6c12f
BLAKE2b-256 354e2aa0bef843671f5d00268fc9627ba8cd2cc19ca4a9043787de088aa1f273

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_dna-0.2.3-py3-none-any.whl:

Publisher: release.yml on Xtra-Computing/LLM-DNA

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page