Skip to main content

Extract LLM DNA vectors — low-dimensional representations that capture functional behavior and model evolution.

Project description

LLM-DNA

Python 3.10+ PyPI version License Tests

Extract LLM DNA vectors — low-dimensional, training-free representations that capture functional behavior and evolutionary relationships between language models.

📄 Paper: LLM DNA: Tracing Model Evolution via Functional Representations (ICLR 2026 Oral)

Overview

The explosive growth of large language models has created a vast but opaque landscape: millions of models exist, yet their evolutionary relationships through fine-tuning, distillation, or adaptation are often undocumented. LLM-DNA provides a general, scalable, training-free pipeline for extracting LLM DNA — mathematically-grounded representations that satisfy inheritance and genetic determinism properties.

Key Features:

  • 🧬 Extract DNA vectors from any HuggingFace or local model
  • 🚀 Training-free, works across architectures and tokenizers
  • 📊 Tested on 305+ LLMs with superior or competitive performance
  • 🔍 Uncover undocumented relationships between models
  • 🌳 Build evolutionary trees using phylogenetic algorithms

Installation

pip install llm-dna

Use llm-dna for install/package naming, and llm_dna for Python imports.

Quick Start

from llm_dna import DNAExtractionConfig, calc_dna

config = DNAExtractionConfig(
    model_name="distilgpt2",
    dataset="rand",
    gpu_id=0,
    max_samples=100,
)

result = calc_dna(config)
print(f"DNA shape: {result.vector.shape}")  # (128,)

Python API

from llm_dna import DNAExtractionConfig, calc_dna

config = DNAExtractionConfig(
    model_name="Qwen/Qwen2.5-0.5B-Instruct",
    dataset="rand",
    gpu_id=0,
    max_samples=100,
    dna_dim=128,
    reduction_method="random_projection",  # or "pca", "svd"
    trust_remote_code=True,
)

result = calc_dna(config)

# DNA vector (numpy.ndarray)
vector = result.vector

# Saved paths (when save=True)
print(result.output_path)
print(result.summary_path)

CLI

# Single model
calc-dna --model-name distilgpt2 --dataset rand --gpus 0

# Multiple models with round-robin GPU assignment
calc-dna --llm-list ./configs/llm_list.txt --gpus 0,1

# With hyperparameters
calc-dna \
  --model-name mistralai/Mistral-7B-v0.1 \
  --dna-dim 256 \
  --max-samples 200 \
  --reduction-method pca \
  --load-in-8bit

Notes

  • Metadata auto-fetched: Model metadata is automatically retrieved from HuggingFace Hub and cached.
  • Auth token: Pass via token=... or set HF_TOKEN environment variable.
  • Chat templates: Applied automatically when supported by the tokenizer.

Tests

# All tests (including integration tests with real model loading)
pytest tests/ -v

# Fast tests only (skip real model loading)
pytest tests/ -m "not slow"

Citation

If you use LLM-DNA in your research, please cite:

@inproceedings{wu2026llmdna,
  title={LLM DNA: Tracing Model Evolution via Functional Representations},
  author={Wu, Zhaomin and Zhao, Haodong and Wang, Ziyang and Guo, Jizhou and Wang, Qian and He, Bingsheng},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026},
  url={https://openreview.net/pdf?id=UIxHaAqFqQ}
}

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_dna-0.1.2.tar.gz (82.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_dna-0.1.2-py3-none-any.whl (95.7 kB view details)

Uploaded Python 3

File details

Details for the file llm_dna-0.1.2.tar.gz.

File metadata

  • Download URL: llm_dna-0.1.2.tar.gz
  • Upload date:
  • Size: 82.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llm_dna-0.1.2.tar.gz
Algorithm Hash digest
SHA256 8ca5b57362d2c96f4b98abc5aaea5853c3bd6afd9dbbdbe3eff637c4698b831e
MD5 de7f3f0d928db9229714b832d4203ec6
BLAKE2b-256 d853a99c763a5ffed01f5b243403af8bbcd0fb5b669ee4cdf327b60451adfc25

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_dna-0.1.2.tar.gz:

Publisher: release.yml on Xtra-Computing/LLM-DNA

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llm_dna-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: llm_dna-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 95.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llm_dna-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f5f09edfe6e853487c004c6ec606a72583b2a9377fb7dd16867d06e91828d4c4
MD5 bbd3b0ca23b627c14dc3ee1c30702391
BLAKE2b-256 7fafb389e7fc74f47d0bc5079eb19b46138684101b5d8f4d105199d34621a53c

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_dna-0.1.2-py3-none-any.whl:

Publisher: release.yml on Xtra-Computing/LLM-DNA

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page