Strip disfluencies (um, uh, er, ah, hmm) from spoken audio.
Project description
erm
Local CLI that strips disfluencies (um, uh, er, erm, ah, hmm, mhm,
mm, uh-huh, plus any-length elongations like ummmm / uhhhhh) from
recordings of English speech.
It uses faster-whisper (running
the large-v3 Whisper model by default — override with --model) for
word-level timestamps, three audio-domain detectors that catch fillers Whisper
hides, and ffmpeg for the cuts. Each splice is snapped to a local energy
minimum and zero-crossing, optionally crossfaded with a length that scales
with the cut size, and laid over a constant looped sample of the recording's
own room tone so the noise floor stays uniform across edits.
Full docs at dougcalobrisi.github.io/erm (source in
docs/): usage guides for getting good results — tuning & workflow, recipes, troubleshooting — plus maintainer-facing design docs on the detection passes, render pipeline, denoise/room-tone, and transcription.
Install
Requires Python 3.11+ and ffmpeg / ffprobe on PATH.
python3.13 -m venv .venv
source .venv/bin/activate
pip install -e '.[dev]'
Transcription device (GPU vs CPU)
Transcription runs on CPU by default and needs no extra setup. If you have an
NVIDIA GPU, faster-whisper can use it — but only when the CUDA runtime libraries
(libcublas, libcudnn) are installed. A machine with an NVIDIA GPU and driver
but no CUDA runtime is the common case that produces:
RuntimeError: Library libcublas.so.12 is not found or cannot be loaded
erm handles this automatically: with the default --device auto, if the GPU
can't be loaded it prints a warning and falls back to CPU, so transcription
still completes. You have two ways to make it explicit:
-
Force CPU (no warning, skips the GPU probe):
erm input.wav --device cpu -
Enable the GPU by installing the CUDA wheels into the same environment:
pip install nvidia-cublas-cu12 nvidia-cudnn-cu12
faster-whisper's CUDA backend needs CUDA 12 / cuDNN 9. See the faster-whisper GPU notes for details.
Usage
# Remove fillers; output and cut-list paths are auto-generated next to the input.
erm input.wav
# Specify output explicitly.
erm input.wav -o cleaned.wav
# Inspect what would be cut without rendering.
erm input.wav --dry-run
# Validate a rendered output against its source.
erm validate input.wav cleaned.wav --cuts cuts.json
When -o / --json are omitted, output paths are written next to the input as
{stem}-cleaned-{YYYYMMDD-HHMMSS}.wav and {stem}-cuts-{YYYYMMDD-HHMMSS}.json.
Use inside AI coding agents
erm ships agent guidance so an AI assistant can install, run, and tune it for you.
Claude Code / Cowork (plugin):
/plugin marketplace add dougcalobrisi/erm
/plugin install erm@erm
This adds two skills: erm (install + clean a file, asks which mode/recipe
fits) and erm-tune (diagnoses a bad result and maps the symptom to the
right knob). Ask in plain language — "install erm", "clean up this podcast", "the
splices sound smeared" — and the right skill triggers.
Other agents (Codex, Copilot, OpenCode, Cursor, Gemini CLI, pi.dev):
- A root
AGENTS.mdgives install/use/tune guidance, read automatically by these tools. - The skills in
skills/are open-format Agent Skills. Copy them into your agent's skills directory —~/.claude/skills/(Claude, OpenCode) or~/.agents/skills/(Codex, OpenCode, pi.dev). Codex reads only.agents/skills/.
How it works
- Transcribe.
faster-whisperruns withword_timestamps=Trueand a verbatim-biasinitial_promptso it emits filler tokens instead of silently cleaning them up. - Detect. Four passes produce candidate cut ranges:
- Word-list match — words whose normalized text is in
--fillers, including arbitrary-length elongations (e.g.ummmmmatches theumstem). - Gap fillers — voiced regions in inter-word gaps longer than
--gap-min-ms. Catches fillers Whisper drops entirely. - Intra-word fillers — long words whose interior splits across a
silence dip into multiple voiced runs. The non-vowel run whose duration
best matches the word's expected duration is treated as the real word;
siblings become cuts. Catches
"in, uhhhhh"that Whisper rolls into one'in'token. - Overlong words — words much longer than
expected_max_word_durationfor their text. The trailing portion is scanned for voiced runs. Optionally pitch-confirmed (--confirm-pitch) by checking the cut region looks like a sustained filler vowel (stable spectral centroid, voiced ZCR), so we don't trim slow-but-real speech.
- Word-list match — words whose normalized text is in
- Refine. Each cut endpoint snaps to a local RMS-energy minimum within
±
--search-ms, then to the nearest zero-crossing. Refinement is clamped so it never crosses a neighboring word's timestamp. - Merge. Cuts whose surviving fragment would be shorter than
--merge-gap-msare collapsed into one — a 40ms surviving fragment between two cuts gets eaten by the surrounding crossfades and would otherwise blurp. - Render. In
removemode (default), ffmpegatrim+acrossfaderenders the kept segments. Each splice's crossfade length scales with that splice's cut size:clamp(min, cut_ms * factor, max). Crossfades are also clamped so they never reach back across a real word boundary.--mode silenceinstead mutes the cut spans in place, preserving the original duration (see Modes). - Room tone (optional, on by default). A quiet region of the original
recording is sampled and looped under the output at
--room-tone-level-db. This keeps the noise floor identical everywhere, masking the residual noise-floor mismatch at each splice.
Modes
--mode chooses how detected cuts are applied to the audio:
| Mode | Timeline | What happens |
|---|---|---|
remove (default) |
shrinks | Each cut span is excised and the survivors are spliced together with crossfades. |
silence |
preserved | Each cut span is muted in place (a single ffmpeg volume pass); the output keeps the input's exact duration. |
Use silence when timing must be preserved — A/V sync, multi-track alignment
(you can't excise one mic without de-syncing the others), or caption/transcript
timestamps. It removes the sound of the filler but leaves a hole of the
original length.
silence depends on a floor in the hole. The muted spans are filled by the
room-tone overlay so the noise floor stays uniform. Muting zeroes the span and
denoising only reduces signal (it never backfills a zeroed hole), so room tone
is the only thing that restores a floor — with --mode silence --no-room-tone
the holes are bare digital silence (an audible "drop out") in any denoise mode,
and erm warns whenever room tone is off. Keep room tone on (the default) for
natural-sounding mutes.
silence mode ignores --pad-pause-factor and --min-gap-ms — those only
shape the splices that remove mode creates, and silence makes no splices.
Denoising
--denoise picks how ffmpeg's afftdn denoiser is used:
| Mode | Detection sees | ffmpeg cuts from | Notes |
|---|---|---|---|
none |
original | original | No denoising. |
pre |
denoised | denoised | Cleanest splices, but detection less sensitive (denoising flattens energy/pitch signals). |
post |
original | original; output denoised at end | Full detection sensitivity; splice noise-floor mismatch smoothed afterward. |
hybrid (default) |
original | denoised | Full detection sensitivity and clean splices. Recommended. |
Tune with --denoise-nr (reduction strength dB) and --denoise-nf (noise
floor dB).
Flags
Detection
| Flag | Default | Notes |
|---|---|---|
--model |
large-v3 |
Any faster-whisper model. medium.en / small.en faster but less accurate. |
--device |
auto |
auto / cpu / cuda. auto uses the GPU when available and falls back to CPU if the CUDA runtime can't be loaded (see Transcription device). |
--compute-type |
auto |
faster-whisper compute type (e.g. int8, float16). auto lets the backend choose. |
--fillers |
ah,er,erm,hmm,mhm,mm,uh,uh-huh,um |
Comma-separated stems. Elongations matched dynamically. |
--detect-gaps / --no-detect-gaps |
on | Run gap + intra-word + overlong detectors. |
--gap-min-ms |
350 |
Minimum inter-word gap to scan for fillers. |
--gap-min-voiced-ms / --gap-max-voiced-ms |
100 / 1500 |
Voiced-run length bounds. |
--intraword-min-ms |
550 |
Minimum word length to scan internally. |
--confirm-pitch / --no-confirm-pitch |
on | Drop overlong/intra candidates that don't look like sustained filler vowels. |
Cuts and splices
Two independent knobs control the spacing left behind by a remove-mode cut —
they compose but do different things:
--pad-pause-factorretains a fraction of the silence that already existed inside a cut (the bitrefinesnapped over). It's context-aware and never adds time: a tight mid-sentence "um" with no surrounding silence gets ~0 padding and its flanking words still butt together. Bounded per side by--pad-min-ms/--pad-max-msand by the silence that actually exists.--min-gap-msguarantees at least N ms between the two words flanking a cut, injecting silence at the splice when the natural pause is below N. This is what fixes "words too close after cutting an um." It adds a little duration when it engages. The injected silence is filled by the room-tone overlay (bare digital silence if room tone is off —ermwarns).
The factor shapes how much existing pause survives; the floor puts a hard minimum under it.
| Flag | Default | Notes |
|---|---|---|
--mode |
remove |
remove (excise + splice) or silence (mute in place, duration preserved). See Modes. |
--search-ms |
60 |
How far each endpoint may slide to find a local energy minimum. |
--crossfade-ms |
(unset) | Force a fixed crossfade length for every splice. When unset, per-splice scaling is used. |
--min-crossfade-ms / --max-crossfade-ms |
50 / 120 |
Floor and ceiling for the per-splice crossfade scaling. |
--crossfade-factor |
0.15 |
cut_ms * factor, clamped to [min, max]. Higher = smoother but blurrier. |
--merge-gap-ms |
120 |
Merge two cuts whose surviving fragment would be shorter than this. |
--pad-pause-factor |
0.0 |
(remove mode) Fraction of each cut's snapped silence to retain. 0 removes the whole cut. Never adds time beyond the cut's own silence. |
--pad-min-ms / --pad-max-ms |
0 / 120 |
Lower/upper clamp on the retained pause, per side (ms). |
--min-gap-ms |
0.0 |
(remove mode) Guarantee at least this much gap between the words flanking each splice, injecting silence when the natural pause is shorter. 0 injects nothing. Mono/stereo input only (the injected silence must match the channel layout). |
Audio cleanup
| Flag | Default | Notes |
|---|---|---|
--denoise |
hybrid |
none / pre / post / hybrid (see table above). |
--denoise-nr |
12.0 |
afftdn noise reduction (dB). |
--denoise-nf |
-25.0 |
afftdn noise floor (dB). |
--room-tone / --no-room-tone |
on | Loop a quiet sample of the original under the output. |
--room-tone-level-db |
-12.0 |
Attenuation applied to the looped tone. -12 to -20 is usually right. |
--room-tone-source |
auto |
auto finds a quiet region; otherwise START-END in seconds (e.g. 0.05-1.4). |
Output
| Flag | Default | Notes |
|---|---|---|
-o, --output |
auto-named next to input | Output .wav path. |
--json PATH |
auto-named next to input | Cut list JSON. |
--dry-run |
off | Print the cut list and exit; no audio rendered. |
validate subcommand
erm validate input.wav cleaned.wav --cuts cuts.json
Runs three deterministic checks:
- Container sanity —
ffprobereads the output without errors. - Duration math — within 50ms of the per-mode expectation, read from the
cut-list JSON's
modeandinjected_gap_sfields (both default toremove/0.0when absent, so older cut lists validate unchanged):remove:output ≈ input − sum(cut lengths) + injected_gap_s.silence:output ≈ input(nothing is excised; cuts are muted in place).
- No-filler invariant — re-transcribe the output; assert no token in the filler set survives.
Writes a JSON report to --report PATH (or auto-named next to the output)
and exits non-zero if any check fails.
Tests
pytest
The pure helpers (find_fillers, invert_to_keep_ranges,
refine_boundaries, merge_close_cuts, expected_max_word_duration,
_voiced_runs_in_region, …) run without faster-whisper or librosa imported.
Heavy deps are imported lazily inside transcribe, render,
load_audio_mono, and is_sustained_vowel.
The suite is split into:
test_pure.py— pure logic, no heavy imports: filler matching, range inversion, boundary refinement, close-cut merging (merge_close_cuts), the per-word duration bound (expected_max_word_duration), room-tone region selection (find_quiet_region), the per-splice crossfade clamp (_splice_crossfade_s), pause-proportional padding (pad_cuts), min-gap injection (inject_min_gaps), and the mute filter (_mute_filter).test_render_modes.py— real-ffmpeg checks ofrender_silenced(duration preserved) andrender(..., gap_inserts=...)(injected gap lands exactly). Skipped automatically whenffmpeg/ffprobearen't on PATH.test_asr_fallback.py— the CUDA → CPU fallback intranscribe, with faster-whisper mocked.test_cli.py— argument parsing, defaults, andmain()subcommand routing (remove/validate/ bare-input). The pipeline handlers are monkeypatched, so nothing heavy runs.test_integration.py— a golden-path--dry-runover a synthesized WAV with a stubbed transcriber, wiring transcription → filler detection → refinement → range inversion → JSON. Gated onlibrosa(the audio loader); skipped automatically if it isn't installed.
Out of scope
- Removing
like,you know,I mean— too risky for meaning. - Languages other than English.
- Real-time / streaming.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file erm-0.3.0.tar.gz.
File metadata
- Download URL: erm-0.3.0.tar.gz
- Upload date:
- Size: 54.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e14b3dbb54df5ea96f33f3f2829d435811c73e45d79d940a34a7aa64e4d73839
|
|
| MD5 |
ac3699d6577f6ea3b3477e590b63f381
|
|
| BLAKE2b-256 |
5cb4bdd074be44ce8a53dfcef45665aa78dd4f3651d7dca144f1612efb77fa9c
|
Provenance
The following attestation bundles were made for erm-0.3.0.tar.gz:
Publisher:
release.yml on dougcalobrisi/erm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
erm-0.3.0.tar.gz -
Subject digest:
e14b3dbb54df5ea96f33f3f2829d435811c73e45d79d940a34a7aa64e4d73839 - Sigstore transparency entry: 1807729887
- Sigstore integration time:
-
Permalink:
dougcalobrisi/erm@35e0ecd987d479653b252d805df7deffffd0b399 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/dougcalobrisi
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@35e0ecd987d479653b252d805df7deffffd0b399 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file erm-0.3.0-py3-none-any.whl.
File metadata
- Download URL: erm-0.3.0-py3-none-any.whl
- Upload date:
- Size: 38.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a6132fab03c6bc9cf8c40921a85681bc86a2b710da1e167ad1c53a6a919e12fe
|
|
| MD5 |
ce70019c81bbf318d49961344ec6f819
|
|
| BLAKE2b-256 |
db1e5f9b5eb39e87911eb41a154db2cfed1d40efb6e0da2df4b6fb2f49bf6b8a
|
Provenance
The following attestation bundles were made for erm-0.3.0-py3-none-any.whl:
Publisher:
release.yml on dougcalobrisi/erm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
erm-0.3.0-py3-none-any.whl -
Subject digest:
a6132fab03c6bc9cf8c40921a85681bc86a2b710da1e167ad1c53a6a919e12fe - Sigstore transparency entry: 1807729894
- Sigstore integration time:
-
Permalink:
dougcalobrisi/erm@35e0ecd987d479653b252d805df7deffffd0b399 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/dougcalobrisi
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@35e0ecd987d479653b252d805df7deffffd0b399 -
Trigger Event:
workflow_dispatch
-
Statement type: