Skip to main content

Strip disfluencies (um, uh, er, ah, hmm) from spoken audio.

Project description

erm

Local CLI that strips disfluencies (um, uh, er, erm, ah, hmm, mhm, mm, uh-huh, plus any-length elongations like ummmm / uhhhhh) from recordings of English speech.

It uses faster-whisper (running the medium.en Whisper model by default — override with --model) for word-level timestamps, three audio-domain detectors that catch fillers Whisper hides, and ffmpeg for the cuts. Each splice is snapped to a local energy minimum and zero-crossing, optionally crossfaded with a length that scales with the cut size, and laid over a constant looped sample of the recording's own room tone so the noise floor stays uniform across edits.

Install

Requires Python 3.11+ and ffmpeg / ffprobe on PATH.

python3.13 -m venv .venv
source .venv/bin/activate
pip install -e '.[dev]'

Usage

# Remove fillers; output and cut-list paths are auto-generated next to the input.
erm input.wav

# Specify output explicitly.
erm input.wav -o cleaned.wav

# Inspect what would be cut without rendering.
erm input.wav --dry-run

# Validate a rendered output against its source.
erm validate input.wav cleaned.wav --cuts cuts.json

When -o / --json are omitted, output paths are written next to the input as {stem}-cleaned-{YYYYMMDD-HHMMSS}.wav and {stem}-cuts-{YYYYMMDD-HHMMSS}.json.

How it works

  1. Transcribe. faster-whisper runs with word_timestamps=True and a verbatim-bias initial_prompt so it emits filler tokens instead of silently cleaning them up.
  2. Detect. Four passes produce candidate cut ranges:
    • Word-list match — words whose normalized text is in --fillers, including arbitrary-length elongations (e.g. ummmm matches the um stem).
    • Gap fillers — voiced regions in inter-word gaps longer than --gap-min-ms. Catches fillers Whisper drops entirely.
    • Intra-word fillers — long words whose interior splits across a silence dip into multiple voiced runs. The non-vowel run whose duration best matches the word's expected duration is treated as the real word; siblings become cuts. Catches "in, uhhhhh" that Whisper rolls into one 'in' token.
    • Overlong words — words much longer than expected_max_word_duration for their text. The trailing portion is scanned for voiced runs. Optionally pitch-confirmed (--confirm-pitch) by checking the cut region looks like a sustained filler vowel (stable spectral centroid, voiced ZCR), so we don't trim slow-but-real speech.
  3. Refine. Each cut endpoint snaps to a local RMS-energy minimum within ±--search-ms, then to the nearest zero-crossing. Refinement is clamped so it never crosses a neighboring word's timestamp.
  4. Merge. Cuts whose surviving fragment would be shorter than --merge-gap-ms are collapsed into one — a 40ms surviving fragment between two cuts gets eaten by the surrounding crossfades and would otherwise blurp.
  5. Render. ffmpeg atrim + acrossfade renders the kept segments. Each splice's crossfade length scales with that splice's cut size: clamp(min, cut_ms * factor, max). Crossfades are also clamped so they never reach back across a real word boundary.
  6. Room tone (optional, on by default). A quiet region of the original recording is sampled and looped under the output at --room-tone-level-db. This keeps the noise floor identical everywhere, masking the residual noise-floor mismatch at each splice.

Denoising

--denoise picks how ffmpeg's afftdn denoiser is used:

Mode Detection sees ffmpeg cuts from Notes
none original original No denoising.
pre denoised denoised Cleanest splices, but detection less sensitive (denoising flattens energy/pitch signals).
post original original; output denoised at end Full detection sensitivity; splice noise-floor mismatch smoothed afterward.
hybrid (default) original denoised Full detection sensitivity and clean splices. Recommended.

Tune with --denoise-nr (reduction strength dB) and --denoise-nf (noise floor dB).

Flags

Detection

Flag Default Notes
--model medium.en Any faster-whisper model. small.en faster; large-v3 more accurate.
--fillers ah,er,erm,hmm,mhm,mm,uh,uh-huh,um Comma-separated stems. Elongations matched dynamically.
--detect-gaps / --no-detect-gaps on Run gap + intra-word + overlong detectors.
--gap-min-ms 350 Minimum inter-word gap to scan for fillers.
--gap-min-voiced-ms / --gap-max-voiced-ms 100 / 1500 Voiced-run length bounds.
--intraword-min-ms 550 Minimum word length to scan internally.
--confirm-pitch / --no-confirm-pitch on Drop overlong/intra candidates that don't look like sustained filler vowels.

Cuts and splices

Flag Default Notes
--search-ms 60 How far each endpoint may slide to find a local energy minimum.
--crossfade-ms (unset) Force a fixed crossfade length for every splice. When unset, per-splice scaling is used.
--min-crossfade-ms / --max-crossfade-ms 50 / 120 Floor and ceiling for the per-splice crossfade scaling.
--crossfade-factor 0.15 cut_ms * factor, clamped to [min, max]. Higher = smoother but blurrier.
--merge-gap-ms 120 Merge two cuts whose surviving fragment would be shorter than this.

Audio cleanup

Flag Default Notes
--denoise hybrid none / pre / post / hybrid (see table above).
--denoise-nr 12.0 afftdn noise reduction (dB).
--denoise-nf -25.0 afftdn noise floor (dB).
--room-tone / --no-room-tone on Loop a quiet sample of the original under the output.
--room-tone-level-db -12.0 Attenuation applied to the looped tone. -12 to -20 is usually right.
--room-tone-source auto auto finds a quiet region; otherwise START-END in seconds (e.g. 0.05-1.4).

Output

Flag Default Notes
-o, --output auto-named next to input Output .wav path.
--json PATH auto-named next to input Cut list JSON.
--dry-run off Print the cut list and exit; no audio rendered.

validate subcommand

erm validate input.wav cleaned.wav --cuts cuts.json

Runs three deterministic checks:

  • Container sanityffprobe reads the output without errors.
  • Duration mathoutput_duration ≈ input_duration - sum(cut lengths), within 50ms.
  • No-filler invariant — re-transcribe the output; assert no token in the filler set survives.

Writes a JSON report to --report PATH (or auto-named next to the output) and exits non-zero if any check fails.

Tests

pytest

The pure helpers (find_fillers, invert_to_keep_ranges, refine_boundaries, merge_close_cuts, expected_max_word_duration, _voiced_runs_in_region, …) run without faster-whisper or librosa imported. Heavy deps are imported lazily inside transcribe, render, load_audio_mono, and is_sustained_vowel.

Out of scope

  • Removing like, you know, I mean — too risky for meaning.
  • Languages other than English.
  • Real-time / streaming.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

erm-0.1.1.tar.gz (27.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

erm-0.1.1-py3-none-any.whl (26.7 kB view details)

Uploaded Python 3

File details

Details for the file erm-0.1.1.tar.gz.

File metadata

  • Download URL: erm-0.1.1.tar.gz
  • Upload date:
  • Size: 27.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for erm-0.1.1.tar.gz
Algorithm Hash digest
SHA256 7ef87968d8b24437d7ee38634fdac081202c2a85b7b1bdfb2202f7bc92d43c3b
MD5 2202b61d569d10a077cc1108b2746cfc
BLAKE2b-256 d928053e5a14b92dc49f54dbde10d69622b8e5cf9ddbe78aade3dd6f32365e8e

See more details on using hashes here.

Provenance

The following attestation bundles were made for erm-0.1.1.tar.gz:

Publisher: release.yml on dougcalobrisi/erm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file erm-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: erm-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 26.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for erm-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6bcf724a1891e089c3b131bae85f2cf5e814911514c01d5fcc4d3fc9995f4905
MD5 06bb1fabc07deff0a36a7dd99381c784
BLAKE2b-256 38a1f06ef2cb8cb099cccf237dcaf5a1b5e6573ad8758d79a47a4071f2f5c2d0

See more details on using hashes here.

Provenance

The following attestation bundles were made for erm-0.1.1-py3-none-any.whl:

Publisher: release.yml on dougcalobrisi/erm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page