Skip to main content

Extract LLM DNA vectors — low-dimensional representations that capture functional behavior and model evolution.

Project description

LLM-DNA

Python 3.10+ PyPI version License Tests

Extract LLM DNA vectors — low-dimensional, training-free representations that capture functional behavior and evolutionary relationships between language models.

📄 Paper: LLM DNA: Tracing Model Evolution via Functional Representations (ICLR 2026 Oral)

Overview

The explosive growth of large language models has created a vast but opaque landscape: millions of models exist, yet their evolutionary relationships through fine-tuning, distillation, or adaptation are often undocumented. LLM-DNA provides a general, scalable, training-free pipeline for extracting LLM DNA — mathematically-grounded representations that satisfy inheritance and genetic determinism properties.

Key Features:

  • 🧬 Extract DNA vectors from any HuggingFace or local model
  • 🚀 Training-free, works across architectures and tokenizers
  • 📊 Tested on 305+ LLMs with superior or competitive performance
  • 🔍 Uncover undocumented relationships between models
  • 🌳 Build evolutionary trees using phylogenetic algorithms

Installation

pip install llm-dna

Use llm-dna for install/package naming, and llm_dna for Python imports.

Quick Start

from llm_dna import DNAExtractionConfig, calc_dna

config = DNAExtractionConfig(
    model_name="distilgpt2",
    dataset="rand",
    gpu_id=0,
    max_samples=100,
)

result = calc_dna(config)
print(f"DNA shape: {result.vector.shape}")  # (128,)

Python API

from llm_dna import DNAExtractionConfig, calc_dna

config = DNAExtractionConfig(
    model_name="Qwen/Qwen2.5-0.5B-Instruct",
    dataset="rand",
    gpu_id=0,
    max_samples=100,
    dna_dim=128,
    reduction_method="random_projection",  # or "pca", "svd"
    trust_remote_code=True,
)

result = calc_dna(config)

# DNA vector (numpy.ndarray)
vector = result.vector

# Saved paths (when save=True)
print(result.output_path)
print(result.summary_path)

CLI

# Single model
calc-dna --model-name distilgpt2 --dataset rand --gpus 0

# Multiple models with round-robin GPU assignment
calc-dna --llm-list ./configs/llm_list.txt --gpus 0,1

# With hyperparameters
calc-dna \
  --model-name mistralai/Mistral-7B-v0.1 \
  --dna-dim 256 \
  --max-samples 200 \
  --reduction-method pca \
  --load-in-8bit

Notes

  • Metadata auto-fetched: Model metadata is automatically retrieved from HuggingFace Hub and cached.
  • Auth token: Pass via token=... or set HF_TOKEN environment variable.
  • Chat templates: Applied automatically when supported by the tokenizer.

Tests

# All tests (including integration tests with real model loading)
pytest tests/ -v

# Fast tests only (skip real model loading)
pytest tests/ -m "not slow"

Citation

If you use LLM-DNA in your research, please cite:

@inproceedings{wu2026llmdna,
  title={LLM DNA: Tracing Model Evolution via Functional Representations},
  author={Wu, Zhaomin and Zhao, Haodong and Wang, Ziyang and Guo, Jizhou and Wang, Qian and He, Bingsheng},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026},
  url={https://openreview.net/pdf?id=UIxHaAqFqQ}
}

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_dna-0.1.0.tar.gz (9.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_dna-0.1.0-py3-none-any.whl (10.8 kB view details)

Uploaded Python 3

File details

Details for the file llm_dna-0.1.0.tar.gz.

File metadata

  • Download URL: llm_dna-0.1.0.tar.gz
  • Upload date:
  • Size: 9.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llm_dna-0.1.0.tar.gz
Algorithm Hash digest
SHA256 daeecccdf2710747f846d3aaddc6f4f8638b60d99155073522c327707898e057
MD5 6d27cce0843d51675ec106d56c439ee8
BLAKE2b-256 ed6f8c1c662785b94bd6f675e8336084dabc669ff701ef32955ff4574cb2e9f0

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_dna-0.1.0.tar.gz:

Publisher: release.yml on Xtra-Computing/LLM-DNA

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llm_dna-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: llm_dna-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 10.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llm_dna-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a1289a5a7a418b387e10d03c07df1e00ad194ad7947342366e42323fab5b3164
MD5 ec4fae63a1806a781fe28cd9713caa4d
BLAKE2b-256 7fd426d2e228f0c1c73ca68ed501a85b0cd98029254a911e848bf2768e707552

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_dna-0.1.0-py3-none-any.whl:

Publisher: release.yml on Xtra-Computing/LLM-DNA

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page