Skip to main content

Tools for merging pre-trained large language models

Project description

mergekitty

mergekitty is a toolkit for merging pre-trained language models. It uses an out-of-core approach so you can run surprisingly complex merges on modest hardware — entirely on CPU, or with as little as 8 GB of VRAM.

What's this fork?

Forked from mergekit (originally by Charles Goddard, then maintained by Arcee.ai). The original project switched to a BSL license after a ton of community contribution, then switched back to LGPL but added a CLA that lets them relicense at will. So here we are.

What changed?

A few things from upstream mergekit:

  • All names/imports/scripts renamed to mergekitty (find-replace and you're good)
  • VLM support with templated pre/post-weights (architecture files are incompatible with mergekit's)
  • tokenizer_source now defaults to "base"; legacy tokenizer copying is gone
  • nuslerpslerp (old slerp removed). Supports both t (SLERP) and weight (NuSLERP) params
  • bakllama, mergekit-legacy, and mergekit-evolve removed
  • LoRA merging script via mergekitty-merge-lora
  • Switched to ruff for formatting/linting and hatch for builds

Why merge models?

Model merging is chaos magick. Done right, the result is better than any of its inputs. It's been proven repeatedly and nobody fully understands why. Ship it.

Features

  • Works with Llama 3, Qwen 3 (Dense & MoE), Mistral, GLM4, GPT-NeoX, BERT, and more
  • Tons of merge methods — arguably too many
  • GPU or CPU — your call
  • Lazy tensor loading for low memory use
  • Interpolated gradient parameters for fine control
  • Layer-stacking / "Frankenmerging" (à la Goliath, Midnight Miqu)
  • MoE merging and LoRA extraction

Install

# recommended — isolated tool install
uv tool install mergekitty

# or just pip
pip install mergekitty

# from source
git clone https://github.com/allura-org/mergekitty.git
cd mergekitty
pip install -e .

Usage

mergekitty-yaml path/to/config.yml ./output-model [--cuda] [--lazy-unpickle] [--allow-crimes]

Run mergekitty-yaml --help for the full list of options.

Sharing on Huggingface

mergekitty generates a README.md for your merge. Edit it, keep it as-is, whatever — then upload:

huggingface-cli login
huggingface-cli upload your_username/my-cool-model ./output-model .

Merge Configuration

Configs are YAML. The main fields:

Field Description
merge_method Which algorithm to use (see below)
slices / models Input model definitions (mutually exclusive)
base_model Base model, for methods that need one
parameters Weights, densities, etc. — specifiable at multiple levels
dtype Data type for the merge
tokenizer Vocabulary and embedding configuration
chat_template Override the output chat template

Parameters

Parameters (weight, density, etc.) can be set at four levels, most-specific wins:

  1. slices.*.sources.parameters — per input slice
  2. slices.*.parameters — per output slice
  3. models.*.parameters — per input model
  4. parameters — global fallback

Values can be scalars or interpolated gradients (a list of floats for smooth transitions across layers).

Tokenizer

Use the tokenizer field for full control, or tokenizer_source for the simple legacy behavior.

tokenizer:
  source: union          # "union", "base", or a model path
  tokens:                # optional: per-token embedding overrides
    :
      source: "chatml_model"
    <|start_header_id|>:
      source: "llama3_model"
      force: true
  pad_to_multiple_of: null

Defaults are sensible: base model embeddings win if the token exists there, single-model tokens use that model, otherwise it averages. You can override any of this per-token.

Chat Template

chat_template: "auto"    # picks the most common template from inputs
# or: "alpaca", "chatml", "llama3", "mistral", "exaone"
# or: a raw Jinja2 template string

Examples

Check examples/ for real configs.

Merge Methods

Method merge_method Multi-Model Needs Base
Linear (Model Soups) linear
SLERP slerp ✅*
Nearswap nearswap
Task Arithmetic task_arithmetic
TIES ties
DARE + TIES dare_ties
DARE + Linear dare_linear
Passthrough passthrough
Model Breadcrumbs breadcrumbs
Breadcrumbs + TIES breadcrumbs_ties
Model Stock model_stock
DELLA della
DELLA + Linear della_linear
SCE sce

* SLERP supports two to three models.

Linear

Weighted average. Simple, classic, effective.

  • weight — relative weighting per tensor
  • normalize — normalize weights across models (default: true)

SLERP

Spherical interpolation. Supports t (classic SLERP, 0 = base, 1 = other) or weight (NuSLERP-style per-tensor weighting).

  • nuslerp_flatten — treat tensor as flat vector vs. row/column-wise
  • nuslerp_row_wise — SLERP row vectors instead of column vectors

Nearswap

Interpolates between base and secondary model when similarity drops below threshold t.

Task Arithmetic

Subtract base model → get "task vectors" → merge them linearly → add base back. Great for models fine-tuned from a common ancestor. Also the mental model behind most of the fancier methods.

TIES

Task arithmetic + sparsification + sign consensus. Lets you merge more models without them stepping on each other.

  • density — fraction of task vector weights to keep

DARE

Random pruning with rescaling, instead of TIES's magnitude-based sparsification. Works with TIES sign consensus (dare_ties) or without (dare_linear).

Passthrough

No-op. Passes tensors through unchanged. Useful for layer-stacking / frankenmerging where you only have one input per slice.

Model Breadcrumbs

Drops both tiny and huge differences from base. Works with (breadcrumbs_ties) or without (breadcrumbs) TIES.

  • density — fraction of weights to keep
  • gamma — fraction of largest-magnitude differences to remove (paper's β)
  • Defaults: density: 0.9, gamma: 0.01

Model Stock

Geometric trick to compute good linear weights. Needs at least three models including a base.

DELLA

Adaptive pruning based on magnitude ranking — keeps important changes, drops the rest. Like DARE but smarter about what it prunes.

  • density — fraction of weights to keep
  • epsilon — spread of drop probabilities (range: density ± epsilon)
  • lambda — scaling factor for merged deltas

SCE

Selects high-variance elements, computes matrix-level weights, erases minority contributions.

  • select_topk — fraction of high-variance elements to retain

LoRA Extraction

Extract PEFT-compatible LoRA adapters from finetuned models:

mergekitty-extract-lora finetuned_model base_model output_path --rank=32

MoE Merging

Merge dense models into a Mixture of Experts with mergekitty-moe. See the MoE docs.

Development

Uses Hatch + uv:

uv tool install hatch
hatch test              # run tests
hatch run lint          # ruff linting
hatch run format        # ruff formatting
hatch run mergekitty-yaml examples/bio-merge.yml ./bio-merge --cuda

Citation

If you use mergekitty in research, please cite the original mergekit paper:

@inproceedings{goddard-etal-2024-arcees,
    title = "Arcee{'}s {M}erge{K}it: A Toolkit for Merging Large Language Models",
    author = "Goddard, Charles  and
      Siriwardhana, Shamane  and
      Ehghaghi, Malikeh  and
      Meyers, Luke  and
      Karpukhin, Vladimir  and
      Benedict, Brian  and
      McQuade, Mark  and
      Solawetz, Jacob",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track",
    month = nov,
    year = "2024",
    pages = "477--485",
    url = "https://aclanthology.org/2024.emnlp-industry.36",
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mergekitty-0.3.1.post2.tar.gz (114.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mergekitty-0.3.1.post2-py3-none-any.whl (164.3 kB view details)

Uploaded Python 3

File details

Details for the file mergekitty-0.3.1.post2.tar.gz.

File metadata

  • Download URL: mergekitty-0.3.1.post2.tar.gz
  • Upload date:
  • Size: 114.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.5 cpython/3.11.13 HTTPX/0.28.1

File hashes

Hashes for mergekitty-0.3.1.post2.tar.gz
Algorithm Hash digest
SHA256 0ba61f18c608233bb9e9bec0f94c5ff8c9384a11bd6531a7debf5699aff315a2
MD5 e4d7bf61d916a31b565ef42ce3b5d15a
BLAKE2b-256 06e5314acdd9a941b62bb8066a30ca3e0f86bec8ccddffde6322efd7e630d897

See more details on using hashes here.

File details

Details for the file mergekitty-0.3.1.post2-py3-none-any.whl.

File metadata

  • Download URL: mergekitty-0.3.1.post2-py3-none-any.whl
  • Upload date:
  • Size: 164.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.5 cpython/3.11.13 HTTPX/0.28.1

File hashes

Hashes for mergekitty-0.3.1.post2-py3-none-any.whl
Algorithm Hash digest
SHA256 fb362f9770cedf44aa557314bf57713bd15281e7d68d7373d9541b729b9ba95b
MD5 6069396c05e1ec2252cabcb4f1c1c690
BLAKE2b-256 b54d780c7df18d86cbb0266ceb62f060deb4dc17a06fe084bc6905e98c1d4d29

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page