Skip to main content

Tools for merging pre-trained large language models

Project description

mergekitty

mergekitty is a toolkit for merging pre-trained language models. It uses an out-of-core approach so you can run surprisingly complex merges on modest hardware — entirely on CPU, or with as little as 8 GB of VRAM.

What's this fork?

Forked from mergekit (originally by Charles Goddard, then maintained by Arcee.ai). The original project switched to a BSL license after a ton of community contribution, then switched back to LGPL but added a CLA that lets them relicense at will. So here we are.

What changed?

A few things from upstream mergekit:

  • All names/imports/scripts renamed to mergekitty (find-replace and you're good)
  • VLM support with templated pre/post-weights (architecture files are incompatible with mergekit's)
  • tokenizer_source now defaults to "base"; legacy tokenizer copying is gone
  • nuslerpslerp (old slerp removed). Supports both t (SLERP) and weight (NuSLERP) params
  • bakllama, mergekit-legacy, and mergekit-evolve removed
  • LoRA merging script via mergekitty-merge-lora
  • Switched to ruff for formatting/linting and hatch for builds

Why merge models?

Model merging is chaos magick. Done right, the result is better than any of its inputs. It's been proven repeatedly and nobody fully understands why. Ship it.

Features

  • Works with Llama 3, Qwen 3 (Dense & MoE), Mistral, GLM4, GPT-NeoX, BERT, and more
  • Tons of merge methods — arguably too many
  • GPU or CPU — your call
  • Lazy tensor loading for low memory use
  • Interpolated gradient parameters for fine control
  • Layer-stacking / "Frankenmerging" (à la Goliath, Midnight Miqu)
  • MoE merging and LoRA extraction

Install

# recommended — isolated tool install
uv tool install mergekitty

# or just pip
pip install mergekitty

# from source
git clone https://github.com/allura-org/mergekitty.git
cd mergekitty
pip install -e .

Usage

mergekitty-yaml path/to/config.yml ./output-model [--cuda] [--lazy-unpickle] [--allow-crimes]

Run mergekitty-yaml --help for the full list of options.

Sharing on Huggingface

mergekitty generates a README.md for your merge. Edit it, keep it as-is, whatever — then upload:

huggingface-cli login
huggingface-cli upload your_username/my-cool-model ./output-model .

Merge Configuration

Configs are YAML. The main fields:

Field Description
merge_method Which algorithm to use (see below)
slices / models Input model definitions (mutually exclusive)
base_model Base model, for methods that need one
parameters Weights, densities, etc. — specifiable at multiple levels
dtype Data type for the merge
tokenizer Vocabulary and embedding configuration
chat_template Override the output chat template

Parameters

Parameters (weight, density, etc.) can be set at four levels, most-specific wins:

  1. slices.*.sources.parameters — per input slice
  2. slices.*.parameters — per output slice
  3. models.*.parameters — per input model
  4. parameters — global fallback

Values can be scalars or interpolated gradients (a list of floats for smooth transitions across layers).

Tokenizer

Use the tokenizer field for full control, or tokenizer_source for the simple legacy behavior.

tokenizer:
  source: union          # "union", "base", or a model path
  tokens:                # optional: per-token embedding overrides
    :
      source: "chatml_model"
    <|start_header_id|>:
      source: "llama3_model"
      force: true
  pad_to_multiple_of: null

Defaults are sensible: base model embeddings win if the token exists there, single-model tokens use that model, otherwise it averages. You can override any of this per-token.

Chat Template

chat_template: "auto"    # picks the most common template from inputs
# or: "alpaca", "chatml", "llama3", "mistral", "exaone"
# or: a raw Jinja2 template string

Examples

Check examples/ for real configs.

Merge Methods

Method merge_method Multi-Model Needs Base
Linear (Model Soups) linear
SLERP slerp ✅*
Nearswap nearswap
Task Arithmetic task_arithmetic
TIES ties
DARE + TIES dare_ties
DARE + Linear dare_linear
Passthrough passthrough
Model Breadcrumbs breadcrumbs
Breadcrumbs + TIES breadcrumbs_ties
Model Stock model_stock
DELLA della
DELLA + Linear della_linear
SCE sce

* SLERP supports two to three models.

Linear

Weighted average. Simple, classic, effective.

  • weight — relative weighting per tensor
  • normalize — normalize weights across models (default: true)

SLERP

Spherical interpolation. Supports t (classic SLERP, 0 = base, 1 = other) or weight (NuSLERP-style per-tensor weighting).

  • nuslerp_flatten — treat tensor as flat vector vs. row/column-wise
  • nuslerp_row_wise — SLERP row vectors instead of column vectors

Nearswap

Interpolates between base and secondary model when similarity drops below threshold t.

Task Arithmetic

Subtract base model → get "task vectors" → merge them linearly → add base back. Great for models fine-tuned from a common ancestor. Also the mental model behind most of the fancier methods.

TIES

Task arithmetic + sparsification + sign consensus. Lets you merge more models without them stepping on each other.

  • density — fraction of task vector weights to keep

DARE

Random pruning with rescaling, instead of TIES's magnitude-based sparsification. Works with TIES sign consensus (dare_ties) or without (dare_linear).

Passthrough

No-op. Passes tensors through unchanged. Useful for layer-stacking / frankenmerging where you only have one input per slice.

Model Breadcrumbs

Drops both tiny and huge differences from base. Works with (breadcrumbs_ties) or without (breadcrumbs) TIES.

  • density — fraction of weights to keep
  • gamma — fraction of largest-magnitude differences to remove (paper's β)
  • Defaults: density: 0.9, gamma: 0.01

Model Stock

Geometric trick to compute good linear weights. Needs at least three models including a base.

DELLA

Adaptive pruning based on magnitude ranking — keeps important changes, drops the rest. Like DARE but smarter about what it prunes.

  • density — fraction of weights to keep
  • epsilon — spread of drop probabilities (range: density ± epsilon)
  • lambda — scaling factor for merged deltas

SCE

Selects high-variance elements, computes matrix-level weights, erases minority contributions.

  • select_topk — fraction of high-variance elements to retain

LoRA Extraction

Extract PEFT-compatible LoRA adapters from finetuned models:

mergekitty-extract-lora finetuned_model base_model output_path --rank=32

MoE Merging

Merge dense models into a Mixture of Experts with mergekitty-moe. See the MoE docs.

Development

Uses Hatch + uv:

uv tool install hatch
hatch test              # run tests
hatch run lint          # ruff linting
hatch run format        # ruff formatting
hatch run mergekitty-yaml examples/bio-merge.yml ./bio-merge --cuda

Citation

If you use mergekitty in research, please cite the original mergekit paper:

@inproceedings{goddard-etal-2024-arcees,
    title = "Arcee{'}s {M}erge{K}it: A Toolkit for Merging Large Language Models",
    author = "Goddard, Charles  and
      Siriwardhana, Shamane  and
      Ehghaghi, Malikeh  and
      Meyers, Luke  and
      Karpukhin, Vladimir  and
      Benedict, Brian  and
      McQuade, Mark  and
      Solawetz, Jacob",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track",
    month = nov,
    year = "2024",
    pages = "477--485",
    url = "https://aclanthology.org/2024.emnlp-industry.36",
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mergekitty-0.3.1.post1.tar.gz (114.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mergekitty-0.3.1.post1-py3-none-any.whl (164.2 kB view details)

Uploaded Python 3

File details

Details for the file mergekitty-0.3.1.post1.tar.gz.

File metadata

  • Download URL: mergekitty-0.3.1.post1.tar.gz
  • Upload date:
  • Size: 114.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.5 cpython/3.11.13 HTTPX/0.28.1

File hashes

Hashes for mergekitty-0.3.1.post1.tar.gz
Algorithm Hash digest
SHA256 bbcfe077d15d8b14a0e10eaee5005fc19e799f417f36001f838ff7e66e33c7b9
MD5 e53d7732dea70625bdb4557045d8de42
BLAKE2b-256 0e53beabe892317527b292fe5ef589674a5aeef3918893554130c9ce2b9e5f31

See more details on using hashes here.

File details

Details for the file mergekitty-0.3.1.post1-py3-none-any.whl.

File metadata

  • Download URL: mergekitty-0.3.1.post1-py3-none-any.whl
  • Upload date:
  • Size: 164.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.5 cpython/3.11.13 HTTPX/0.28.1

File hashes

Hashes for mergekitty-0.3.1.post1-py3-none-any.whl
Algorithm Hash digest
SHA256 6b6c2cf6e01c10c759a5240052a8aaec473a861ceef03b45e98478b7d2bcefa0
MD5 447583a19245513b95a100ff3a102233
BLAKE2b-256 190aef38c4df52d074c0556a7c8d1a31ead55bb5627d973080afbe23adf9d798

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page