Skip to main content

Run and export HT-Demucs / Demucs music source separation as ONNX. Pure numpy + onnxruntime inference (no PyTorch). Karaoke / acapella CLI, auto-resampling, auto execution provider routing, fp16-weight downloads, MP3 output. Fixes the 4 blockers that prevent vanilla torch.onnx.export from working on htdemucs.

Project description

demucs-onnx

PyPI Python License: MIT

The canonical way to run and export HT-Demucs / Demucs music source separation as ONNX. Pure numpy + onnxruntime at inference (no PyTorch), and a one-liner export pipeline that fixes the four known blockers in torch.onnx.export. Powers the StemSplit production stack.

pip install 'demucs-onnx[mp3]'

# One command -> karaoke instrumental as a shareable MP3.
demucs-onnx separate song.mp3 out/ --karaoke --mp3
# writes out/karaoke.mp3  (drums + bass + other, vocals removed)

# Or every stem, automatically picking the best GPU on this host.
demucs-onnx separate song.mp3 out/
# writes out/drums.wav out/bass.wav out/other.wav out/vocals.wav

That's the whole thing. Models auto-download from the Hugging Face Hub on first run and are cached forever. Inputs at any sample rate (48 kHz, 22 kHz, mono, anything) work transparently — we resample in for inference and resample back out so the file you get matches the file you put in.


Why this package exists

For the entire history of the demucs repo (2021 – 2026) nobody on PyPI has shipped working ONNX export tooling for HT-Demucs. Searching GitHub turns up half a dozen abandoned forks, all stuck on one of four blockers, all without a working .onnx file to show for it. The official demucs README has no mention of ONNX.

We solved it. This package ships:

  1. A pure-numpy + onnxruntime inference path that runs the official HT-Demucs FT models with no PyTorch dependency. Install footprint drops from ~2 GB (PyTorch) to ~50 MB (onnxruntime).
  2. A one-call export pipelineexport_to_onnx("htdemucs_ft", ...) — that applies all four patches, parity-checks the output against PyTorch fp32, and only writes the file if max abs diff < 1e-3.
  3. The same patches as independent, grep-able modules (stft.py, mha.py, pos_embed.py, segment.py) so you can debug your own exports of related architectures.

Mirror published as five Hugging Face repos under StemSplitio for direct download.

Want to … Use this
Run htdemucs_ft on CPU / mobile / web with no PyTorch from demucs_onnx import separate
Convert your own demucs checkpoint to ONNX from demucs_onnx.export import export_to_onnx
Skip the infrastructure entirely The hosted StemSplit API

What's new in v0.2.0 — the UX bundle

  • 🎤 --karaoke shortcut — one flag, instant karaoke instrumental (sum of drums/bass/other, vocals removed).
  • 🔀 --mix-stems vocals,drums — write a single file that's the sum of whichever stems you list. Great for "vocals + drums only" remix beds, acapella + drums tracks, etc.
  • 🎧 --mp3 output with --bitrate 192k (32-320 kbps). Powered by the tiny lameenc wheel — no ffmpeg required.
  • Auto execution-provider routing: providers="auto" (the new default) picks CoreML on macOS arm64, CUDA on Linux+NVIDIA, DML on Windows DX12, CPU otherwise. No more --provider coreml boilerplate.
  • 🪶 fp16-weight downloads with --small / precision="fp16weights": 166 MB per model instead of 316 MB (1.91× smaller). Same runtime memory and latency, max abs diff vs fp32 is ~6e-5.
  • 🎚️ Auto-resampling: any sample rate input (8 kHz to 192 kHz, mono or multi-channel) is transparently resampled to 44.1 kHz for inference and back to the input rate before writing.
  • 📊 Progress bar via tqdm when stdout is a TTY (--quiet to silence everything, --verbose for the old chunk-by-chunk log).

See CHANGELOG.md for the full diff vs v0.1.0.


Comparison vs alternatives

Project Working ONNX export? Working ONNX inference? PyPI?
demucs-onnx (this) Yes, parity-verified to 1.6e-4 Yes, no torch needed Yes
facebookresearch/demucs No (4 blockers, see below) n/a Yes (PyTorch only)
lstm-mode/demucs-onnx (GH fork) Stuck on STFT complex blocker n/a No
Various Stack Overflow gists Each stuck on one of the 4 blockers n/a No
mvsep / Audio Separator GUIs Use bundled MDX/UVR ONNX, not htdemucs Yes for MDX, not htdemucs n/a

If you find a comparable working solution after this package was published — please open an issue so we can update this table.


Quick start

Install

pip install demucs-onnx                # inference only — onnxruntime + numpy + soundfile + soxr
pip install "demucs-onnx[mp3]"         # adds the lameenc encoder for --mp3 output
pip install "demucs-onnx[export]"      # adds torch + demucs for the export pipeline

Separate (Python)

from demucs_onnx import separate

# Full 4-stem bag (default). Auto-downloads from HF on first run, auto
# picks the best execution provider for this host (CoreML / CUDA / DML).
stems = separate("song.mp3")
# stems: {"drums": ndarray (2, S), "bass": ..., "other": ..., "vocals": ...}

# Just one stem — 4× faster, 75% less RAM, model size 316 MB instead of 1.26 GB.
from demucs_onnx import separate_stem
vocals = separate_stem("song.mp3", "vocals")

# Smaller download (166 MB per stem instead of 316 MB), no runtime cost.
stems = separate("song.mp3", precision="fp16weights")

# Write straight to MP3, including a karaoke instrumental mix.
separate(
    "song.mp3", "stems/",
    output_format="mp3", bitrate_kbps=192,
    mix_stems=("drums", "bass", "other"), mix_output_name="karaoke",
)

Separate (CLI)

# Killer feature — one command -> karaoke.mp3 ready to share.
demucs-onnx separate song.mp3 stems/ --karaoke --mp3

# All 4 stems, auto provider (CoreML on macOS, CUDA on Linux, etc).
demucs-onnx separate song.mp3 stems/

# Single specialist mode — 4x faster than the bag.
demucs-onnx separate song.mp3 stems/ --stem vocals

# Smaller download (1.91x), same runtime cost.
demucs-onnx separate song.mp3 stems/ --small

# Custom mix-down: write one file that's vocals + drums only.
demucs-onnx separate song.mp3 stems/ --mix-stems vocals,drums --mp3

# Explicit provider override (auto is the default).
demucs-onnx separate song.mp3 stems/ --providers coreml
demucs-onnx separate song.mp3 stems/ --providers cuda
demucs-onnx separate song.mp3 stems/ --providers dml

demucs-onnx list-models

Export (Python)

from demucs_onnx.export import export_to_onnx

# Export every specialist of htdemucs_ft into out/ as 4 .onnx files.
paths = export_to_onnx("htdemucs_ft", "out/")
# paths == {"drums": Path("out/htdemucs_ft_drums.onnx"), "bass": ..., ...}

# Export just the vocals specialist to a single file.
export_to_onnx("htdemucs_ft", "vocals.onnx", stem="vocals")

# Export your own fine-tuned checkpoint.
from pathlib import Path
export_to_onnx(Path("my_finetune.th"), "my_finetune.onnx")

Export (CLI)

demucs-onnx export htdemucs_ft out/                    # all 4 specialists
demucs-onnx export htdemucs_ft drums.onnx --stem drums # one stem -> single file
demucs-onnx export htdemucs_ft out/ --opset 17         # change opset
demucs-onnx export htdemucs_ft out/ --no-parity-check  # advanced (don't)

Mobile / web (after exporting)

// iOS / Swift, ORT 1.17+
import onnxruntime_objc
let opts = try ORTSessionOptions()
try opts.appendCoreMLExecutionProvider(with: ORTCoreMLExecutionProviderOptions())
let session = try ORTSession(env: env,
                              modelPath: bundle.path(forResource: "htdemucs_ft_vocals",
                                                     ofType: "onnx")!,
                              sessionOptions: opts)
// Browser / web, onnxruntime-web
import * as ort from "onnxruntime-web";
const session = await ort.InferenceSession.create("htdemucs_ft_vocals.onnx", {
  executionProviders: ["wasm"],
  graphOptimizationLevel: "all",
});
const tensor = new ort.Tensor("float32", audioBuffer, [1, 2, 343980]);
const out = await session.run({ mix: tensor });

The 4 blockers explained

These are the four things that break vanilla torch.onnx.export on HT-Demucs (PyTorch 2.4 / opset 17). Each lives in its own grep-able module so you can lift the fix into a different project.

Blocker 1 — torch.stft returns complex tensors

# demucs/htdemucs.py
z = torch.stft(x, n_fft, hop_length, return_complex=True)  # complex64 output

torch.onnx.export raises Exporting STFT does not currently support complex types. The dynamo exporter sometimes lowers it, but the resulting graph fails ORT shape inference.

Fixdemucs_onnx/export/stft.py. Replace torch.stft with a Conv1d whose kernels are precomputed sin/cos DFT bases for n_fft = 4096, hop = 1024, hann window, normalized=True. The output is two real channels (real, imag) instead of one complex channel. Inverse: a matching ConvTranspose1d plus an OLA(window²) envelope normalisation. The class also overrides demucs's own _spec / _ispec / _magnitude / _mask methods so the rest of the network sees (B, C, 2, F, T) real tensors throughout.

Verified to 5×10⁻⁶ max abs diff against torch.stft on real audio.

Blocker 2 — model.segment is a fractions.Fraction

# demucs/htdemucs.py
self.segment = Fraction(39, 5)  # = 7.8 seconds

torch._dynamo allow-lists a small set of "user-defined classes" it can trace through. Fraction is not on it (PyTorch 2.4) and graph capture crashes. The legacy exporter is more permissive but still produces a wrong graph because Fraction arithmetic is opaque to it.

Fixdemucs_onnx/export/segment.py. Coerce to float. Mathematically identical at inference, side-steps both exporter limitations.

Blocker 3 — random.randrange in the transformer pos-embedding

# demucs/transformer.py
shift = random.randrange(self.sin_random_shift + 1)  # = 0 at eval

Used during training for positional-embedding augmentation. At eval, sin_random_shift = 0 so the call always returns 0, but neither the legacy exporter nor dynamo can trace through a call to randomUnsupportedOperatorError and graph break, respectively.

Fixdemucs_onnx/export/pos_embed.py. Monkey-patch CrossTransformerEncoder._get_pos_embedding with a deterministic version that hardcodes shift = 0. Mathematically identical at inference time.

Blocker 4 — aten::_native_multi_head_attention has no ONNX symbolic

# torch/nn/functional.py — internally
return torch._native_multi_head_attention(...)  # fused C++ kernel

nn.MultiheadAttention dispatches to a fast fused C++ kernel when its inputs satisfy a fast-path check. The fused kernel has no ONNX symbolic: the exporter raises UnsupportedOperatorError: Exporting the operator 'aten::_native_multi_head_attention' to ONNX opset version 17 is not supported.

Fixdemucs_onnx/export/mha.py. Replace nn.MultiheadAttention.forward (per instance, via types.MethodType) with a manual scaled-dot-product attention built from Linear / bmm / softmax. The exporter handles those primitives without complaint. Output is bit-identical to the fused kernel up to fp32 round-off.

Net result

After all four patches, end-to-end parity vs PyTorch fp32:

Stem max abs diff (1×2×343980 random input)
drums 1.63 × 10⁻⁴
bass 1.42 × 10⁻⁴
other 1.71 × 10⁻⁴
vocals 1.55 × 10⁻⁴

…and the ONNX graph runs in onnxruntime CPU at 1.31× the speed of PyTorch CPU on Apple M4 Pro (no GPU).


Pre-trained ONNX models on Hugging Face

We host five companion model repos. The Python package downloads from these automatically on first run; you can also fetch them by hand.

Repo Stems Size Use case
StemSplitio/htdemucs-ft-onnx all 4 1.26 GB Full bag, single download
StemSplitio/htdemucs-ft-drums-onnx drums 316 MB Drum extraction, beat transcription
StemSplitio/htdemucs-ft-bass-onnx bass 316 MB Bassline isolation, mix rebalancing
StemSplitio/htdemucs-ft-other-onnx other 316 MB Karaoke instrumental, sample-flipping
StemSplitio/htdemucs-ft-vocals-onnx vocals 316 MB #1 open-source vocal SDR — vocal removal, acapella, karaoke

All five are MIT-licensed and parity-verified to < 1e-3 vs PyTorch fp32.


Performance

Real measurements on Apple M4 Pro (8-core CPU, no GPU):

Mode Per 7.8-s segment Per 3-min song RTF
demucs-onnx, single specialist (CPU) 1.59 s ~22 s 0.20
demucs-onnx, full bag (CPU) 6.4 s ~88 s 0.49
PyTorch CPU (single specialist) 2.09 s ~29 s 0.26
PyTorch MPS (full bag) 1.0 s ~12 s 0.07

CUDA / DirectML / CoreML ONNX EPs are all ≥ 5× faster than the CPU EP on real GPUs — see the model card on each HF repo for hardware-specific numbers.


API

demucs_onnx.separate(input, output_dir=None, *, model="htdemucs_ft", stems=None, providers="auto", precision="fp32", cache_dir=None, token=None, verbose=False, progress=True, output_format="wav", bitrate_kbps=192, mix_stems=None, mix_output_name="mix") -> dict[str, np.ndarray]

Run separation on an audio file. Returns {stem_name: (channels, samples)} in float32 at the input file's native sample rate (we auto-resample for inference and back). If output_dir is given, also writes <stem>.wav (or .mp3) files into it; pass mix_stems=("drums","bass","other") to additionally write a single karaoke instrumental file.

model accepts:

  • "htdemucs_ft" (default) — full 4-stem bag.
  • "htdemucs_ft_<stem>" or just "<stem>" — single specialist (drums / bass / other / vocals).

providers accepts:

  • "auto" (default, new in v0.2.0) — auto-detect the best EP for this host (CoreML / CUDA / DML / CPU).
  • A short alias ("cpu", "coreml", "cuda", "dml"), an explicit ORT provider name, or a list of either.

precision accepts "fp32" (default) or "fp16weights". The latter downloads a 166 MB variant per stem (1.91× smaller) with identical runtime memory and latency; max abs diff vs fp32 is ~6e-5.

demucs_onnx.auto_select_providers() -> list[str]

Return the EP list separate() would pick on this host. Useful for debugging — print it from your code if auto selects something surprising.

demucs_onnx.describe_runtime() -> dict[str, object]

Returns {system, machine, python, onnxruntime, available_providers, in_browser}. Print this if auto doesn't pick the EP you expect.

demucs_onnx.separate_stem(input, stem, output_dir=None, **kwargs) -> np.ndarray

Shorthand: run only one specialist and return the single stem as a numpy array. ~4× faster than running the full bag when you only need one stem.

demucs_onnx.separate_all(input, output_dir=None, **kwargs) -> dict[str, np.ndarray]

Shorthand for separate(..., model="htdemucs_ft").

demucs_onnx.export.export_to_onnx(checkpoint, output, *, stem=None, stems=None, opset=17, parity_check=True, parity_tolerance=1e-3, ...) -> dict[str, Path]

Convert a demucs/htdemucs PyTorch checkpoint (by name or .th path) to one or more ONNX files. Applies all four patches, runs a numerical parity check before writing, and aborts if max abs diff > tolerance.

demucs_onnx.export.patch_htdemucs_for_onnx(model) -> nn.Module

Apply all four patches in place, return the same model. Useful when you want to keep the patched model around for alternative tracers.

Individual patches

Each blocker is a single-purpose module so you can pull just one fix into a different project:

  • demucs_onnx.export.coerce_segment_to_float — Fraction → float
  • demucs_onnx.export.disable_random_pos_shift — drop random.randrange
  • demucs_onnx.export.onnx_friendly_mha_forward — manual MHA forward
  • demucs_onnx.export.RealSTFT / RealISTFT — complex STFT replacement

Skip the infrastructure — use the StemSplit API

Don't want to bundle a 316 MB model in your app, manage a GPU pool, or write overlap-add chunking? Use the StemSplit API instead — same models under the hood, hosted for you, with credits and a dashboard.

Or use the no-code tools that ship the same model family:


License & attribution

This package is MIT-licensed, matching the original HT-Demucs.

Please cite the original authors if you use the model in research:

@inproceedings{rouard2023hybrid,
  title     = {Hybrid Transformers for Music Source Separation},
  author    = {Rouard, Simon and Massa, Francisco and D{\'e}fossez, Alexandre},
  booktitle = {ICASSP},
  year      = {2023}
}
  • Original PyTorch model: facebookresearch/demucs
  • ONNX export, parity verification, packaging, and host inference by StemSplit
  • Search keywords: demucs onnx, htdemucs onnx, demucs export python, demucs ios, demucs android, demucs mobile, htdemucs export onnx, demucs onnxruntime, demucs source separation onnx, vocal remover onnx, karaoke onnx, acapella extractor onnx.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

demucs_onnx-0.2.0.tar.gz (36.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

demucs_onnx-0.2.0-py3-none-any.whl (41.8 kB view details)

Uploaded Python 3

File details

Details for the file demucs_onnx-0.2.0.tar.gz.

File metadata

  • Download URL: demucs_onnx-0.2.0.tar.gz
  • Upload date:
  • Size: 36.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for demucs_onnx-0.2.0.tar.gz
Algorithm Hash digest
SHA256 f78fa3d35e2ace94ea3951d72ea1a88079251d9dea7e04830f778005a5973c2c
MD5 e8a2498ddce08fc48208ef489692aaa9
BLAKE2b-256 aab92f6007812efa05579d9e2cf518354802814270d769c590b96c91564f250e

See more details on using hashes here.

File details

Details for the file demucs_onnx-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: demucs_onnx-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 41.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for demucs_onnx-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ff99473d9233a1b171131250d3d168b283ee20a08cfb932093bb2c0062953826
MD5 70070c33377350dee8709718a30d8bf9
BLAKE2b-256 ae9cf5c38398c7bfcd7be644e97f65de783dc874cd26e537592abc4b5ba4466b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page