Run and export HT-Demucs / Demucs music source separation as ONNX. Pure numpy + onnxruntime inference (no PyTorch). Fixes the 4 blockers that prevent vanilla torch.onnx.export from working on htdemucs.
Project description
demucs-onnx
The canonical way to run and export HT-Demucs / Demucs music source separation as ONNX. Pure numpy + onnxruntime at inference (no PyTorch), and a one-liner export pipeline that fixes the four known blockers in
torch.onnx.export. Powers the StemSplit production stack.
pip install demucs-onnx
demucs-onnx separate song.mp3 stems/ # writes drums/bass/other/vocals.wav
That's the whole thing. Models auto-download from the Hugging Face Hub on first run and are cached forever.
Why this package exists
For the entire history of the demucs
repo (2021 – 2026) nobody on PyPI has shipped working ONNX export
tooling for HT-Demucs. Searching GitHub turns up half a dozen abandoned
forks, all stuck on one of four blockers, all without a working .onnx
file to show for it. The official demucs README has no mention of ONNX.
We solved it. This package ships:
- A pure-numpy + onnxruntime inference path that runs the official HT-Demucs FT models with no PyTorch dependency. Install footprint drops from ~2 GB (PyTorch) to ~50 MB (onnxruntime).
- A one-call export pipeline —
export_to_onnx("htdemucs_ft", ...)— that applies all four patches, parity-checks the output against PyTorch fp32, and only writes the file if max abs diff < 1e-3. - The same patches as independent, grep-able modules (
stft.py,mha.py,pos_embed.py,segment.py) so you can debug your own exports of related architectures.
Mirror published as five Hugging Face repos under
StemSplitio for direct download.
| Want to … | Use this |
|---|---|
| Run htdemucs_ft on CPU / mobile / web with no PyTorch | from demucs_onnx import separate |
| Convert your own demucs checkpoint to ONNX | from demucs_onnx.export import export_to_onnx |
| Skip the infrastructure entirely | The hosted StemSplit API |
Comparison vs alternatives
| Project | Working ONNX export? | Working ONNX inference? | PyPI? |
|---|---|---|---|
| demucs-onnx (this) | Yes, parity-verified to 1.6e-4 | Yes, no torch needed | Yes |
facebookresearch/demucs |
No (4 blockers, see below) | n/a | Yes (PyTorch only) |
lstm-mode/demucs-onnx (GH fork) |
Stuck on STFT complex blocker | n/a | No |
| Various Stack Overflow gists | Each stuck on one of the 4 blockers | n/a | No |
mvsep / Audio Separator GUIs |
Use bundled MDX/UVR ONNX, not htdemucs | Yes for MDX, not htdemucs | n/a |
If you find a comparable working solution after this package was published — please open an issue so we can update this table.
Quick start
Install
pip install demucs-onnx # inference only — onnxruntime + numpy + soundfile
pip install "demucs-onnx[export]" # also installs torch + demucs for the export pipeline
Separate (Python)
from demucs_onnx import separate
# Full 4-stem bag (default). Auto-downloads from HF on first run.
stems = separate("song.mp3")
# stems: {"drums": ndarray (2, S), "bass": ..., "other": ..., "vocals": ...}
# Just one stem — 4× faster, 75% less RAM, model size 316 MB instead of 1.26 GB.
from demucs_onnx import separate_stem
vocals = separate_stem("song.mp3", "vocals")
# Write the WAVs out as you separate.
separate("song.mp3", "stems/", model="htdemucs_ft", verbose=True)
Separate (CLI)
demucs-onnx separate song.mp3 stems/
demucs-onnx separate song.mp3 stems/ --stem vocals
demucs-onnx separate song.mp3 stems/ --stems drums vocals
demucs-onnx separate song.mp3 stems/ --provider coreml # macOS GPU
demucs-onnx separate song.mp3 stems/ --provider cuda # NVIDIA
demucs-onnx separate song.mp3 stems/ --provider dml # any DX12 GPU
demucs-onnx list-models
Export (Python)
from demucs_onnx.export import export_to_onnx
# Export every specialist of htdemucs_ft into out/ as 4 .onnx files.
paths = export_to_onnx("htdemucs_ft", "out/")
# paths == {"drums": Path("out/htdemucs_ft_drums.onnx"), "bass": ..., ...}
# Export just the vocals specialist to a single file.
export_to_onnx("htdemucs_ft", "vocals.onnx", stem="vocals")
# Export your own fine-tuned checkpoint.
from pathlib import Path
export_to_onnx(Path("my_finetune.th"), "my_finetune.onnx")
Export (CLI)
demucs-onnx export htdemucs_ft out/ # all 4 specialists
demucs-onnx export htdemucs_ft drums.onnx --stem drums # one stem -> single file
demucs-onnx export htdemucs_ft out/ --opset 17 # change opset
demucs-onnx export htdemucs_ft out/ --no-parity-check # advanced (don't)
Mobile / web (after exporting)
// iOS / Swift, ORT 1.17+
import onnxruntime_objc
let opts = try ORTSessionOptions()
try opts.appendCoreMLExecutionProvider(with: ORTCoreMLExecutionProviderOptions())
let session = try ORTSession(env: env,
modelPath: bundle.path(forResource: "htdemucs_ft_vocals",
ofType: "onnx")!,
sessionOptions: opts)
// Browser / web, onnxruntime-web
import * as ort from "onnxruntime-web";
const session = await ort.InferenceSession.create("htdemucs_ft_vocals.onnx", {
executionProviders: ["wasm"],
graphOptimizationLevel: "all",
});
const tensor = new ort.Tensor("float32", audioBuffer, [1, 2, 343980]);
const out = await session.run({ mix: tensor });
The 4 blockers explained
These are the four things that break vanilla torch.onnx.export on
HT-Demucs (PyTorch 2.4 / opset 17). Each lives in its own grep-able
module so you can lift the fix into a different project.
Blocker 1 — torch.stft returns complex tensors
# demucs/htdemucs.py
z = torch.stft(x, n_fft, hop_length, return_complex=True) # complex64 output
torch.onnx.export raises Exporting STFT does not currently support complex types. The dynamo exporter sometimes lowers it, but the resulting
graph fails ORT shape inference.
Fix — demucs_onnx/export/stft.py.
Replace torch.stft with a Conv1d whose kernels are precomputed
sin/cos DFT bases for n_fft = 4096, hop = 1024, hann window,
normalized=True. The output is two real channels (real, imag) instead
of one complex channel. Inverse: a matching ConvTranspose1d plus an
OLA(window²) envelope normalisation. The class also overrides demucs's
own _spec / _ispec / _magnitude / _mask methods so the rest of
the network sees (B, C, 2, F, T) real tensors throughout.
Verified to 5×10⁻⁶ max abs diff against torch.stft on real audio.
Blocker 2 — model.segment is a fractions.Fraction
# demucs/htdemucs.py
self.segment = Fraction(39, 5) # = 7.8 seconds
torch._dynamo allow-lists a small set of "user-defined classes" it can
trace through. Fraction is not on it (PyTorch 2.4) and graph capture
crashes. The legacy exporter is more permissive but still produces a
wrong graph because Fraction arithmetic is opaque to it.
Fix — demucs_onnx/export/segment.py.
Coerce to float. Mathematically identical at inference, side-steps both
exporter limitations.
Blocker 3 — random.randrange in the transformer pos-embedding
# demucs/transformer.py
shift = random.randrange(self.sin_random_shift + 1) # = 0 at eval
Used during training for positional-embedding augmentation. At eval,
sin_random_shift = 0 so the call always returns 0, but neither the
legacy exporter nor dynamo can trace through a call to random —
UnsupportedOperatorError and graph break, respectively.
Fix — demucs_onnx/export/pos_embed.py.
Monkey-patch CrossTransformerEncoder._get_pos_embedding with a
deterministic version that hardcodes shift = 0. Mathematically
identical at inference time.
Blocker 4 — aten::_native_multi_head_attention has no ONNX symbolic
# torch/nn/functional.py — internally
return torch._native_multi_head_attention(...) # fused C++ kernel
nn.MultiheadAttention dispatches to a fast fused C++ kernel when its
inputs satisfy a fast-path check. The fused kernel has no ONNX symbolic:
the exporter raises UnsupportedOperatorError: Exporting the operator 'aten::_native_multi_head_attention' to ONNX opset version 17 is not supported.
Fix — demucs_onnx/export/mha.py.
Replace nn.MultiheadAttention.forward (per instance, via
types.MethodType) with a manual scaled-dot-product attention built
from Linear / bmm / softmax. The exporter handles those primitives
without complaint. Output is bit-identical to the fused kernel up to
fp32 round-off.
Net result
After all four patches, end-to-end parity vs PyTorch fp32:
| Stem | max abs diff (1×2×343980 random input) |
|---|---|
| drums | 1.63 × 10⁻⁴ |
| bass | 1.42 × 10⁻⁴ |
| other | 1.71 × 10⁻⁴ |
| vocals | 1.55 × 10⁻⁴ |
…and the ONNX graph runs in onnxruntime CPU at 1.31× the speed of
PyTorch CPU on Apple M4 Pro (no GPU).
Pre-trained ONNX models on Hugging Face
We host five companion model repos. The Python package downloads from these automatically on first run; you can also fetch them by hand.
| Repo | Stems | Size | Use case |
|---|---|---|---|
StemSplitio/htdemucs-ft-onnx |
all 4 | 1.26 GB | Full bag, single download |
StemSplitio/htdemucs-ft-drums-onnx |
drums | 316 MB | Drum extraction, beat transcription |
StemSplitio/htdemucs-ft-bass-onnx |
bass | 316 MB | Bassline isolation, mix rebalancing |
StemSplitio/htdemucs-ft-other-onnx |
other | 316 MB | Karaoke instrumental, sample-flipping |
StemSplitio/htdemucs-ft-vocals-onnx |
vocals | 316 MB | #1 open-source vocal SDR — vocal removal, acapella, karaoke |
All five are MIT-licensed and parity-verified to < 1e-3 vs PyTorch fp32.
Performance
Real measurements on Apple M4 Pro (8-core CPU, no GPU):
| Mode | Per 7.8-s segment | Per 3-min song | RTF |
|---|---|---|---|
demucs-onnx, single specialist (CPU) |
1.59 s | ~22 s | 0.20 |
demucs-onnx, full bag (CPU) |
6.4 s | ~88 s | 0.49 |
| PyTorch CPU (single specialist) | 2.09 s | ~29 s | 0.26 |
| PyTorch MPS (full bag) | 1.0 s | ~12 s | 0.07 |
CUDA / DirectML / CoreML ONNX EPs are all ≥ 5× faster than the CPU EP on real GPUs — see the model card on each HF repo for hardware-specific numbers.
API
demucs_onnx.separate(input, output_dir=None, *, model="htdemucs_ft", stems=None, providers=None, cache_dir=None, token=None, verbose=False) -> dict[str, np.ndarray]
Run separation on an audio file. Returns {stem_name: (channels, samples)}
in float32 at 44.1 kHz. If output_dir is given, also writes
<stem>.wav files into it.
model accepts:
"htdemucs_ft"(default) — full 4-stem bag."htdemucs_ft_<stem>"or just"<stem>"— single specialist (drums/bass/other/vocals).
providers accepts a short alias ("cpu", "coreml", "cuda",
"dml"), an explicit ORT provider name, or a list of either.
demucs_onnx.separate_stem(input, stem, output_dir=None, **kwargs) -> np.ndarray
Shorthand: run only one specialist and return the single stem as a numpy array. ~4× faster than running the full bag when you only need one stem.
demucs_onnx.separate_all(input, output_dir=None, **kwargs) -> dict[str, np.ndarray]
Shorthand for separate(..., model="htdemucs_ft").
demucs_onnx.export.export_to_onnx(checkpoint, output, *, stem=None, stems=None, opset=17, parity_check=True, parity_tolerance=1e-3, ...) -> dict[str, Path]
Convert a demucs/htdemucs PyTorch checkpoint (by name or .th path) to
one or more ONNX files. Applies all four patches, runs a numerical
parity check before writing, and aborts if max abs diff > tolerance.
demucs_onnx.export.patch_htdemucs_for_onnx(model) -> nn.Module
Apply all four patches in place, return the same model. Useful when you want to keep the patched model around for alternative tracers.
Individual patches
Each blocker is a single-purpose module so you can pull just one fix into a different project:
demucs_onnx.export.coerce_segment_to_float— Fraction → floatdemucs_onnx.export.disable_random_pos_shift— droprandom.randrangedemucs_onnx.export.onnx_friendly_mha_forward— manual MHA forwarddemucs_onnx.export.RealSTFT/RealISTFT— complex STFT replacement
Skip the infrastructure — use the StemSplit API
Don't want to bundle a 316 MB model in your app, manage a GPU pool, or write overlap-add chunking? Use the StemSplit API instead — same models under the hood, hosted for you, with credits and a dashboard.
Or use the no-code tools that ship the same model family:
License & attribution
This package is MIT-licensed, matching the original HT-Demucs.
Please cite the original authors if you use the model in research:
@inproceedings{rouard2023hybrid,
title = {Hybrid Transformers for Music Source Separation},
author = {Rouard, Simon and Massa, Francisco and D{\'e}fossez, Alexandre},
booktitle = {ICASSP},
year = {2023}
}
- Original PyTorch model:
facebookresearch/demucs - ONNX export, parity verification, packaging, and host inference by StemSplit
- Search keywords: demucs onnx, htdemucs onnx, demucs export python, demucs ios, demucs android, demucs mobile, htdemucs export onnx, demucs onnxruntime, demucs source separation onnx, vocal remover onnx, karaoke onnx, acapella extractor onnx.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file demucs_onnx-0.1.0.tar.gz.
File metadata
- Download URL: demucs_onnx-0.1.0.tar.gz
- Upload date:
- Size: 27.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
95be2228869603bcf51c3e09e5b36082a00c4f5ea61c9c52ebcb71c40dc387f9
|
|
| MD5 |
eb148f1eb2d0dc56e6a66e39ac71aee6
|
|
| BLAKE2b-256 |
7402acee276d3b5d2bc3c38e95ac68426289313b8a77a09f050fc08582a4b6c2
|
File details
Details for the file demucs_onnx-0.1.0-py3-none-any.whl.
File metadata
- Download URL: demucs_onnx-0.1.0-py3-none-any.whl
- Upload date:
- Size: 33.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d67c2986894b4e6ac6508f77e53c4ceca940b651e3aa9ad350ed04e912422d6d
|
|
| MD5 |
6ce6b87e99dd0973cde2b334984583de
|
|
| BLAKE2b-256 |
1cecee2ffd71a05aec08e5c8dee3ba96a6750ded2724ffea22b9976fa11bd31b
|