Tools for merging pre-trained large language models
Project description
mergekitty
mergekitty is a toolkit for merging pre-trained language models. It uses an out-of-core approach so you can run surprisingly complex merges on modest hardware — entirely on CPU, or with as little as 8 GB of VRAM.
What's this fork?
Forked from mergekit (originally by Charles Goddard, then maintained by Arcee.ai). The original project switched to a BSL license after a ton of community contribution, then switched back to LGPL but added a CLA that lets them relicense at will. So here we are.
What changed?
A few things from upstream mergekit:
- All names/imports/scripts renamed to
mergekitty(find-replace and you're good) - VLM support with templated pre/post-weights (architecture files are incompatible with mergekit's)
tokenizer_sourcenow defaults to"base"; legacy tokenizer copying is gonenuslerp→slerp(old slerp removed). Supports botht(SLERP) andweight(NuSLERP) paramsbakllama,mergekit-legacy, andmergekit-evolveremoved- LoRA merging script via
mergekitty-merge-lora - Switched to
rufffor formatting/linting andhatchfor builds
Why merge models?
Model merging is chaos magick. Done right, the result is better than any of its inputs. It's been proven repeatedly and nobody fully understands why. Ship it.
Features
- Works with Llama 3, Qwen 3 (Dense & MoE), Mistral, GLM4, GPT-NeoX, BERT, and more
- Tons of merge methods — arguably too many
- GPU or CPU — your call
- Lazy tensor loading for low memory use
- Interpolated gradient parameters for fine control
- Layer-stacking / "Frankenmerging" (à la Goliath, Midnight Miqu)
- MoE merging and LoRA extraction
Install
# recommended — isolated tool install
uv tool install mergekitty
# or just pip
pip install mergekitty
# from source
git clone https://github.com/allura-org/mergekitty.git
cd mergekitty
pip install -e .
Usage
mergekitty-yaml path/to/config.yml ./output-model [--cuda] [--lazy-unpickle] [--allow-crimes]
Run mergekitty-yaml --help for the full list of options.
Sharing on Huggingface
mergekitty generates a README.md for your merge. Edit it, keep it as-is, whatever — then upload:
huggingface-cli login
huggingface-cli upload your_username/my-cool-model ./output-model .
Merge Configuration
Configs are YAML. The main fields:
| Field | Description |
|---|---|
merge_method |
Which algorithm to use (see below) |
slices / models |
Input model definitions (mutually exclusive) |
base_model |
Base model, for methods that need one |
parameters |
Weights, densities, etc. — specifiable at multiple levels |
dtype |
Data type for the merge |
tokenizer |
Vocabulary and embedding configuration |
chat_template |
Override the output chat template |
Parameters
Parameters (weight, density, etc.) can be set at four levels, most-specific wins:
slices.*.sources.parameters— per input sliceslices.*.parameters— per output slicemodels.*.parameters— per input modelparameters— global fallback
Values can be scalars or interpolated gradients (a list of floats for smooth transitions across layers).
Tokenizer
Use the tokenizer field for full control, or tokenizer_source for the simple legacy behavior.
tokenizer:
source: union # "union", "base", or a model path
tokens: # optional: per-token embedding overrides
:
source: "chatml_model"
<|start_header_id|>:
source: "llama3_model"
force: true
pad_to_multiple_of: null
Defaults are sensible: base model embeddings win if the token exists there, single-model tokens use that model, otherwise it averages. You can override any of this per-token.
Chat Template
chat_template: "auto" # picks the most common template from inputs
# or: "alpaca", "chatml", "llama3", "mistral", "exaone"
# or: a raw Jinja2 template string
Examples
Check examples/ for real configs.
Merge Methods
| Method | merge_method |
Multi-Model | Needs Base |
|---|---|---|---|
| Linear (Model Soups) | linear |
✅ | ❌ |
| SLERP | slerp |
✅* | ✅ |
| Nearswap | nearswap |
❌ | ✅ |
| Task Arithmetic | task_arithmetic |
✅ | ✅ |
| TIES | ties |
✅ | ✅ |
| DARE + TIES | dare_ties |
✅ | ✅ |
| DARE + Linear | dare_linear |
✅ | ✅ |
| Passthrough | passthrough |
❌ | ❌ |
| Model Breadcrumbs | breadcrumbs |
✅ | ✅ |
| Breadcrumbs + TIES | breadcrumbs_ties |
✅ | ✅ |
| Model Stock | model_stock |
✅ | ✅ |
| DELLA | della |
✅ | ✅ |
| DELLA + Linear | della_linear |
✅ | ✅ |
| SCE | sce |
✅ | ✅ |
* SLERP supports two to three models.
Linear
Weighted average. Simple, classic, effective.
weight— relative weighting per tensornormalize— normalize weights across models (default: true)
SLERP
Spherical interpolation. Supports t (classic SLERP, 0 = base, 1 = other) or weight (NuSLERP-style per-tensor weighting).
nuslerp_flatten— treat tensor as flat vector vs. row/column-wisenuslerp_row_wise— SLERP row vectors instead of column vectors
Nearswap
Interpolates between base and secondary model when similarity drops below threshold t.
Task Arithmetic
Subtract base model → get "task vectors" → merge them linearly → add base back. Great for models fine-tuned from a common ancestor. Also the mental model behind most of the fancier methods.
TIES
Task arithmetic + sparsification + sign consensus. Lets you merge more models without them stepping on each other.
density— fraction of task vector weights to keep
DARE
Random pruning with rescaling, instead of TIES's magnitude-based sparsification. Works with TIES sign consensus (dare_ties) or without (dare_linear).
Passthrough
No-op. Passes tensors through unchanged. Useful for layer-stacking / frankenmerging where you only have one input per slice.
Model Breadcrumbs
Drops both tiny and huge differences from base. Works with (breadcrumbs_ties) or without (breadcrumbs) TIES.
density— fraction of weights to keepgamma— fraction of largest-magnitude differences to remove (paper's β)- Defaults:
density: 0.9,gamma: 0.01
Model Stock
Geometric trick to compute good linear weights. Needs at least three models including a base.
DELLA
Adaptive pruning based on magnitude ranking — keeps important changes, drops the rest. Like DARE but smarter about what it prunes.
density— fraction of weights to keepepsilon— spread of drop probabilities (range:density ± epsilon)lambda— scaling factor for merged deltas
SCE
Selects high-variance elements, computes matrix-level weights, erases minority contributions.
select_topk— fraction of high-variance elements to retain
LoRA Extraction
Extract PEFT-compatible LoRA adapters from finetuned models:
mergekitty-extract-lora finetuned_model base_model output_path --rank=32
MoE Merging
Merge dense models into a Mixture of Experts with mergekitty-moe. See the MoE docs.
Development
Uses Hatch + uv:
uv tool install hatch
hatch test # run tests
hatch run lint # ruff linting
hatch run format # ruff formatting
hatch run mergekitty-yaml examples/bio-merge.yml ./bio-merge --cuda
Citation
If you use mergekitty in research, please cite the original mergekit paper:
@inproceedings{goddard-etal-2024-arcees,
title = "Arcee{'}s {M}erge{K}it: A Toolkit for Merging Large Language Models",
author = "Goddard, Charles and
Siriwardhana, Shamane and
Ehghaghi, Malikeh and
Meyers, Luke and
Karpukhin, Vladimir and
Benedict, Brian and
McQuade, Mark and
Solawetz, Jacob",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track",
month = nov,
year = "2024",
pages = "477--485",
url = "https://aclanthology.org/2024.emnlp-industry.36",
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mergekitty-0.3.1.tar.gz.
File metadata
- Download URL: mergekitty-0.3.1.tar.gz
- Upload date:
- Size: 114.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: Hatch/1.16.5 cpython/3.11.13 HTTPX/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
79b7781d0f2150749a7a6f330291717a8219b7494ae5a93d579f99f04b18c6e1
|
|
| MD5 |
b6b5e4897616fc64a786446b2e6ad4f6
|
|
| BLAKE2b-256 |
72f9b5e9f82c1c68238e81bdbe6de551b865e426df04681ae1b5dea3e8567483
|
File details
Details for the file mergekitty-0.3.1-py3-none-any.whl.
File metadata
- Download URL: mergekitty-0.3.1-py3-none-any.whl
- Upload date:
- Size: 163.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: Hatch/1.16.5 cpython/3.11.13 HTTPX/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
822401fc57ba5050ab51ca8e660b57755582c46fa1a04a822aff82dcf2aa2876
|
|
| MD5 |
a6ed587611e6f2a83d8c010d73c44ac3
|
|
| BLAKE2b-256 |
6e622a12725b256e79d5f88a97ea88500dceec84bf7406144231816de9ef94e5
|