Skip to main content

Extract LLM DNA vectors — low-dimensional representations that capture functional behavior and model evolution.

Project description

LLM-DNA

Python 3.10+ PyPI version License Tests

Extract LLM DNA vectors — low-dimensional, training-free representations that capture functional behavior and evolutionary relationships between language models.

📄 Paper: LLM DNA: Tracing Model Evolution via Functional Representations (ICLR 2026 Oral)

Overview

The explosive growth of large language models has created a vast but opaque landscape: millions of models exist, yet their evolutionary relationships through fine-tuning, distillation, or adaptation are often undocumented. LLM-DNA provides a general, scalable, training-free pipeline for extracting LLM DNA — mathematically-grounded representations that satisfy inheritance and genetic determinism properties.

Key Features:

  • 🧬 Extract DNA vectors from any HuggingFace or local model
  • 🚀 Training-free, works across architectures and tokenizers
  • 📊 Tested on 305+ LLMs with superior or competitive performance
  • 🔍 Uncover undocumented relationships between models
  • 🌳 Build evolutionary trees using phylogenetic algorithms

Installation

pip install llm-dna

Use llm-dna for install/package naming, and llm_dna for Python imports.

Quick Start

from llm_dna import DNAExtractionConfig, calc_dna

config = DNAExtractionConfig(
    model_name="distilgpt2",
    dataset="rand",
    gpu_id=0,
    max_samples=100,
)

result = calc_dna(config)
print(f"DNA shape: {result.vector.shape}")  # (128,)

Python API

from llm_dna import DNAExtractionConfig, calc_dna

config = DNAExtractionConfig(
    model_name="Qwen/Qwen2.5-0.5B-Instruct",
    dataset="rand",
    gpu_id=0,
    max_samples=100,
    dna_dim=128,
    reduction_method="random_projection",  # or "pca", "svd"
    trust_remote_code=True,
)

result = calc_dna(config)

# DNA vector (numpy.ndarray)
vector = result.vector

# Saved paths (when save=True)
print(result.output_path)
print(result.summary_path)

CLI

# Single model
calc-dna --model-name distilgpt2 --dataset rand --gpus 0

# Multiple models with round-robin GPU assignment
calc-dna --llm-list ./configs/llm_list.txt --gpus 0,1

# With hyperparameters
calc-dna \
  --model-name mistralai/Mistral-7B-v0.1 \
  --dna-dim 256 \
  --max-samples 200 \
  --reduction-method pca \
  --load-in-8bit

Notes

  • Metadata auto-fetched: Model metadata is automatically retrieved from HuggingFace Hub and cached.
  • Auth token: Pass via token=... or set HF_TOKEN environment variable.
  • Chat templates: Applied automatically when supported by the tokenizer.

Tests

# All tests (including integration tests with real model loading)
pytest tests/ -v

# Fast tests only (skip real model loading)
pytest tests/ -m "not slow"

Citation

If you use LLM-DNA in your research, please cite:

@inproceedings{wu2026llmdna,
  title={LLM DNA: Tracing Model Evolution via Functional Representations},
  author={Wu, Zhaomin and Zhao, Haodong and Wang, Ziyang and Guo, Jizhou and Wang, Qian and He, Bingsheng},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026},
  url={https://openreview.net/pdf?id=UIxHaAqFqQ}
}

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_dna-0.1.1.tar.gz (82.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_dna-0.1.1-py3-none-any.whl (95.7 kB view details)

Uploaded Python 3

File details

Details for the file llm_dna-0.1.1.tar.gz.

File metadata

  • Download URL: llm_dna-0.1.1.tar.gz
  • Upload date:
  • Size: 82.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llm_dna-0.1.1.tar.gz
Algorithm Hash digest
SHA256 991a12f3589075fb90570d10ae1fb0084808e09c416be8f5208015d5552af0a2
MD5 9e56892053e62326fadc47279ef61c4b
BLAKE2b-256 bef549474097eac5ac147f58d6729ab9f4613f43ae2a93144b086273b9268fbe

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_dna-0.1.1.tar.gz:

Publisher: release.yml on Xtra-Computing/LLM-DNA

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llm_dna-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: llm_dna-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 95.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llm_dna-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b6bcda575734fe0e74d2df166576eb2d9d2f782796455a2dcd18fd254bb04fac
MD5 b61525d1ef97972e5d97fc58613dfe5e
BLAKE2b-256 c6fee2c6f9f9d36a32ddf53e0305115b1c849c22a304499bc10e5f1454005021

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_dna-0.1.1-py3-none-any.whl:

Publisher: release.yml on Xtra-Computing/LLM-DNA

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page