HPSS-based voice denoiser optimized for ASR preprocessing (STT, diarization, speaker embedding)
Project description
HPSS Voice Denoiser
A production-ready audio denoising pipeline optimized for ASR preprocessing (Speech-to-Text, Speaker Diarization, Voice Embedding).
Built on Harmonic-Percussive Source Separation (HPSS) with context-aware mixing to preserve voice quality while removing environmental noise.
Features
- Optimized for ASR: Preserves voice characteristics critical for STT, diarization, and speaker embedding
- Stateless Processing: Each audio chunk is processed independently (perfect for streaming)
- Voice-Preserving: 99% voice band preservation, consonants intact
- Low Latency: Suitable for real-time applications
- Simple API: Easy to integrate as a library or use via CLI
Benchmark Results
Tested on real audio (88 seconds total). Run benchmarks/benchmark.py for full analysis.
| Metric | Value | Description |
|---|---|---|
| STT Confidence | +16% improvement | Whisper word probability increased after denoising |
| Speaker Embedding | 93.5% similar | Voice identity preserved (cosine similarity before/after) |
| Diarization | 98% consistent | Speaker segments unchanged by denoising |
| Voice Band (300-3kHz) | 75% preserved | Mid frequencies containing voice fundamentals |
| High Freq (3k-8kHz) | 48% preserved | Reduced by design (noise lives here) |
Installation
From PyPI (recommended)
pip install hpss-voice-denoiser
With visualization support
pip install hpss-voice-denoiser[visualization]
From source
git clone https://github.com/atomys/hpss-voice-denoiser.git
cd hpss-voice-denoiser
pip install -e .
Quick Start
As a Library
from hpss_denoiser import HPSSDenoiser
# Create denoiser with default settings
denoiser = HPSSDenoiser()
# Process PCM audio (16kHz, 16-bit, mono)
with open("input.pcm", "rb") as f:
pcm_data = f.read()
# Denoise
cleaned_pcm = denoiser.process(pcm_data)
# Save result
with open("output.pcm", "wb") as f:
f.write(cleaned_pcm)
With NumPy Arrays
import numpy as np
from hpss_denoiser import HPSSDenoiser
denoiser = HPSSDenoiser()
# Float audio (-1.0 to 1.0)
audio = np.random.randn(16000).astype(np.float64) * 0.1
# Process
cleaned = denoiser.process_array(audio)
Custom Configuration
from hpss_denoiser import HPSSDenoiser, DenoiserConfig
# Adjust for your use case
config = DenoiserConfig(
sample_rate=16000,
# More aggressive noise reduction in silence
no_context_perc_gain=0.02,
# Preserve more consonants
voice_context_perc_gain=0.25,
)
denoiser = HPSSDenoiser(config)
CLI Usage
# Basic usage
hpss-denoise input.pcm output.pcm
# Process with intermediate stages (for debugging)
hpss-denoise input.pcm output.pcm --stages
# Generate analysis visualization
hpss-denoise input.pcm --analyze --output-image analysis.png
# Custom sample rate
hpss-denoise input.pcm output.pcm --sample-rate 8000
# Show all options
hpss-denoise --help
Audio Format
The denoiser expects and produces:
- Format: Raw PCM
- Sample Rate: 16000 Hz (configurable)
- Bit Depth: 16-bit signed integer
- Channels: Mono
Converting from other formats
# WAV to PCM
ffmpeg -i input.wav -f s16le -acodec pcm_s16le -ar 16000 -ac 1 input.pcm
# MP3 to PCM
ffmpeg -i input.mp3 -f s16le -acodec pcm_s16le -ar 16000 -ac 1 input.pcm
# PCM to WAV (for playback)
ffmpeg -f s16le -ar 16000 -ac 1 -i output.pcm output.wav
How It Works
Pipeline Architecture
Audio Input (PCM 16kHz, 16-bit)
│
▼
┌─────────────────────────────────────┐
│ High-pass Filter (80 Hz) │ Remove DC offset & rumble
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ STFT Analysis │ Time-frequency representation
│ (25ms frames, 6ms hop) │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ HPSS Separation │ Split into harmonic (voice)
│ (median filtering) │ and percussive (transients)
└─────────────────────────────────────┘
│
├─── Harmonic ───┐
│ ▼
│ ┌─────────────────────────────┐
│ │ Envelope Tightening │ Reduce HPSS echo artifacts
│ │ (asymmetric follower) │
│ └─────────────────────────────┘
│ │
│ ▼
│ ┌─────────────────────────────┐
├───▶│ Context-Based Mixing │ Detect voice activity
│ │ - Voice: keep 20% perc │ Mix based on context
│ │ - Silence: keep 4% perc │
│ └─────────────────────────────┘
│ │
└── Percussive ──┘
│
▼
┌─────────────────────────────────────┐
│ Low-Frequency Denoising │ Spectral subtraction <350Hz
│ (percentile-based) │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ ISTFT Synthesis │ Reconstruct audio
└─────────────────────────────────────┘
│
▼
Audio Output (PCM 16kHz, 16-bit)
Why HPSS?
Harmonic-Percussive Source Separation uses median filtering on the spectrogram:
- Harmonic components (voice fundamentals, vowels) appear as horizontal lines
- Percussive components (transients, consonants, noise) appear as vertical lines
By separating these, we can:
- Keep the harmonic component (clean voice)
- Selectively mix percussive based on voice context
- During speech: include percussive (consonants like 't', 's', 'k')
- During silence: suppress percussive (noise transients)
Key Innovation: Context-Aware Mixing
The challenge with HPSS for voice is that consonants are percussive. Naive suppression of the percussive component removes 't', 's', 'f', etc.
Our solution: detect voice context using harmonic energy in the 200-4000 Hz band:
- If voice is present: mix more percussive (preserve consonants)
- If silence: aggressively suppress percussive (remove noise)
Configuration Reference
@dataclass
class DenoiserConfig:
"""Configuration for HPSS voice denoiser."""
# Audio parameters
sample_rate: int = 16000 # Input/output sample rate
# STFT parameters
frame_size_ms: int = 25 # Analysis frame size
hop_size_ms: int = 6 # Frame hop size
# HPSS separation
harmonic_kernel: int = 9 # Median filter size (time)
percussive_kernel: int = 9 # Median filter size (freq)
hpss_margin: float = 2.5 # Separation hardness
# Context detection
context_window: int = 10 # Frames to extend voice context
harmonic_threshold_db: float = -20.0 # Voice detection threshold
# Percussive mixing
voice_context_perc_gain: float = 0.20 # Keep 20% during voice
no_context_perc_gain: float = 0.04 # Keep 4% during silence
# Envelope tightening (echo reduction)
envelope_tightening: bool = True
envelope_attack_frames: int = 2
envelope_release_frames: int = 3
envelope_min_gain: float = 0.15
# Low-frequency denoising
noise_reduction_strength: float = 0.8
noise_reduction_max_freq: float = 350.0
Use Cases
Speech-to-Text (STT)
from hpss_denoiser import HPSSDenoiser
import whisper
denoiser = HPSSDenoiser()
# Denoise before transcription
with open("noisy_audio.pcm", "rb") as f:
noisy = f.read()
cleaned = denoiser.process(noisy)
# Save and transcribe
with open("cleaned.pcm", "wb") as f:
f.write(cleaned)
# Use with Whisper
model = whisper.load_model("base")
result = model.transcribe("cleaned.wav")
Speaker Diarization
from hpss_denoiser import HPSSDenoiser
# Denoising improves speaker boundary detection
denoiser = HPSSDenoiser()
# Process chunks for streaming diarization
chunk_size = 30 * 16000 * 2 # 30 seconds
with open("meeting.pcm", "rb") as f:
while chunk := f.read(chunk_size):
cleaned_chunk = denoiser.process(chunk)
# Send to diarization pipeline
Voice Embedding
from hpss_denoiser import HPSSDenoiser
# Clean audio produces more stable embeddings
denoiser = HPSSDenoiser()
# Process enrollment audio
enrollment_clean = denoiser.process(enrollment_pcm)
# Process verification audio
verification_clean = denoiser.process(verification_pcm)
# Compare embeddings (using your embedding model)
Performance
Processing Speed
| Audio Duration | Processing Time | Real-time Factor |
|---|---|---|
| 1 second | ~44 ms | ~23x |
| 10 seconds | ~420 ms | ~23x |
| 88 seconds | ~3.8 s | ~23x |
Tested on macOS (Darwin), Python 3.12, single-threaded
Memory Usage
- ~50 MB base memory
- ~2 MB per second of audio being processed
- Streaming-friendly: process in chunks
Troubleshooting
Muffled output
Increase voice_context_perc_gain:
config = DenoiserConfig(voice_context_perc_gain=0.30)
Too much noise remaining
Decrease no_context_perc_gain:
config = DenoiserConfig(no_context_perc_gain=0.02)
Echo/reverb artifacts
Reduce envelope release time:
config = DenoiserConfig(envelope_release_frames=2)
Consonants being cut
Increase context window:
config = DenoiserConfig(context_window=15)
Development
Setup
git clone https://github.com/atomys/hpss-voice-denoiser.git
cd hpss-voice-denoiser
pip install -e ".[dev]"
Run tests
pytest
Type checking
mypy src/hpss_denoiser
Linting
ruff check src/
ruff format src/
Algorithm References
- HPSS: Fitzgerald, D. (2010). "Harmonic/Percussive Separation using Median Filtering"
- Spectral Subtraction: Boll, S. (1979). "Suppression of Acoustic Noise in Speech Using Spectral Subtraction"
License
MIT License - see LICENSE for details.
Contributing
Contributions welcome! Please open an issue first to discuss what you would like to change.
Acknowledgments
Developed to improve audio coming from wearable device project.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hpss_voice_denoiser-1.23.1.tar.gz.
File metadata
- Download URL: hpss_voice_denoiser-1.23.1.tar.gz
- Upload date:
- Size: 19.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4d721e2e3420810d31c2ae76645654614be6b1d8d2fab5086fe1c5a12cd118ed
|
|
| MD5 |
b7a8d830e4283c02d6b5cdc6eaada7f6
|
|
| BLAKE2b-256 |
c71ab92db3cf75df1c7382a6f3a65d84ea06974f883c7ae2b8406e8b3fd535b7
|
Provenance
The following attestation bundles were made for hpss_voice_denoiser-1.23.1.tar.gz:
Publisher:
publish.yml on 42atomys/hpss-voice-denoiser
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hpss_voice_denoiser-1.23.1.tar.gz -
Subject digest:
4d721e2e3420810d31c2ae76645654614be6b1d8d2fab5086fe1c5a12cd118ed - Sigstore transparency entry: 782524378
- Sigstore integration time:
-
Permalink:
42atomys/hpss-voice-denoiser@82592d4d5e6da0871b627c550a7fcf08a09a6db7 -
Branch / Tag:
refs/tags/v1.23.1 - Owner: https://github.com/42atomys
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@82592d4d5e6da0871b627c550a7fcf08a09a6db7 -
Trigger Event:
push
-
Statement type:
File details
Details for the file hpss_voice_denoiser-1.23.1-py3-none-any.whl.
File metadata
- Download URL: hpss_voice_denoiser-1.23.1-py3-none-any.whl
- Upload date:
- Size: 25.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dad47dee1629c82494a820ed826ee3466449d7a12e5e311bff28f23343126721
|
|
| MD5 |
5ef64e066973bda391218482f5d3e0cc
|
|
| BLAKE2b-256 |
3f0feb0869920435a732355ed1e562c87216f293ebe4d912050f31fa66e82724
|
Provenance
The following attestation bundles were made for hpss_voice_denoiser-1.23.1-py3-none-any.whl:
Publisher:
publish.yml on 42atomys/hpss-voice-denoiser
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hpss_voice_denoiser-1.23.1-py3-none-any.whl -
Subject digest:
dad47dee1629c82494a820ed826ee3466449d7a12e5e311bff28f23343126721 - Sigstore transparency entry: 782524382
- Sigstore integration time:
-
Permalink:
42atomys/hpss-voice-denoiser@82592d4d5e6da0871b627c550a7fcf08a09a6db7 -
Branch / Tag:
refs/tags/v1.23.1 - Owner: https://github.com/42atomys
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@82592d4d5e6da0871b627c550a7fcf08a09a6db7 -
Trigger Event:
push
-
Statement type: