Skip to main content

HPSS-based voice denoiser optimized for ASR preprocessing (STT, diarization, speaker embedding)

Project description

HPSS Voice Denoiser

A production-ready audio denoising pipeline optimized for ASR preprocessing (Speech-to-Text, Speaker Diarization, Voice Embedding).

Built on Harmonic-Percussive Source Separation (HPSS) with context-aware mixing to preserve voice quality while removing environmental noise.

Features

  • Optimized for ASR: Preserves voice characteristics critical for STT, diarization, and speaker embedding
  • Stateless Processing: Each audio chunk is processed independently (perfect for streaming)
  • Voice-Preserving: 99% voice band preservation, consonants intact
  • Low Latency: Suitable for real-time applications
  • Simple API: Easy to integrate as a library or use via CLI

Benchmark Results

Tested on real audio (88 seconds total). Run benchmarks/benchmark.py for full analysis.

Metric Value Description
STT Confidence +16% improvement Whisper word probability increased after denoising
Speaker Embedding 93.5% similar Voice identity preserved (cosine similarity before/after)
Diarization 98% consistent Speaker segments unchanged by denoising
Voice Band (300-3kHz) 75% preserved Mid frequencies containing voice fundamentals
High Freq (3k-8kHz) 48% preserved Reduced by design (noise lives here)

Installation

From PyPI (recommended)

pip install hpss-voice-denoiser

With visualization support

pip install hpss-voice-denoiser[visualization]

From source

git clone https://github.com/atomys/hpss-voice-denoiser.git
cd hpss-voice-denoiser
pip install -e .

Quick Start

As a Library

from hpss_denoiser import HPSSDenoiser

# Create denoiser with default settings
denoiser = HPSSDenoiser()

# Process PCM audio (16kHz, 16-bit, mono)
with open("input.pcm", "rb") as f:
    pcm_data = f.read()

# Denoise
cleaned_pcm = denoiser.process(pcm_data)

# Save result
with open("output.pcm", "wb") as f:
    f.write(cleaned_pcm)

With NumPy Arrays

import numpy as np
from hpss_denoiser import HPSSDenoiser

denoiser = HPSSDenoiser()

# Float audio (-1.0 to 1.0)
audio = np.random.randn(16000).astype(np.float64) * 0.1

# Process
cleaned = denoiser.process_array(audio)

Custom Configuration

from hpss_denoiser import HPSSDenoiser, DenoiserConfig

# Adjust for your use case
config = DenoiserConfig(
    sample_rate=16000,
    
    # More aggressive noise reduction in silence
    no_context_perc_gain=0.02,
    
    # Preserve more consonants
    voice_context_perc_gain=0.25,
)

denoiser = HPSSDenoiser(config)

CLI Usage

# Basic usage
hpss-denoise input.pcm output.pcm

# Process with intermediate stages (for debugging)
hpss-denoise input.pcm output.pcm --stages

# Generate analysis visualization
hpss-denoise input.pcm --analyze --output-image analysis.png

# Custom sample rate
hpss-denoise input.pcm output.pcm --sample-rate 8000

# Show all options
hpss-denoise --help

Audio Format

The denoiser expects and produces:

  • Format: Raw PCM
  • Sample Rate: 16000 Hz (configurable)
  • Bit Depth: 16-bit signed integer
  • Channels: Mono

Converting from other formats

# WAV to PCM
ffmpeg -i input.wav -f s16le -acodec pcm_s16le -ar 16000 -ac 1 input.pcm

# MP3 to PCM
ffmpeg -i input.mp3 -f s16le -acodec pcm_s16le -ar 16000 -ac 1 input.pcm

# PCM to WAV (for playback)
ffmpeg -f s16le -ar 16000 -ac 1 -i output.pcm output.wav

How It Works

Pipeline Architecture

Audio Input (PCM 16kHz, 16-bit)
    │
    ▼
┌─────────────────────────────────────┐
│  High-pass Filter (80 Hz)           │  Remove DC offset & rumble
└─────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────┐
│  STFT Analysis                      │  Time-frequency representation
│  (25ms frames, 6ms hop)             │
└─────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────┐
│  HPSS Separation                    │  Split into harmonic (voice)
│  (median filtering)                 │  and percussive (transients)
└─────────────────────────────────────┘
    │
    ├─── Harmonic ───┐
    │                ▼
    │    ┌─────────────────────────────┐
    │    │  Envelope Tightening        │  Reduce HPSS echo artifacts
    │    │  (asymmetric follower)      │
    │    └─────────────────────────────┘
    │                │
    │                ▼
    │    ┌─────────────────────────────┐
    ├───▶│  Context-Based Mixing       │  Detect voice activity
    │    │  - Voice: keep 20% perc     │  Mix based on context
    │    │  - Silence: keep 4% perc    │
    │    └─────────────────────────────┘
    │                │
    └── Percussive ──┘
                     │
                     ▼
┌─────────────────────────────────────┐
│  Low-Frequency Denoising            │  Spectral subtraction <350Hz
│  (percentile-based)                 │
└─────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────┐
│  ISTFT Synthesis                    │  Reconstruct audio
└─────────────────────────────────────┘
    │
    ▼
Audio Output (PCM 16kHz, 16-bit)

Why HPSS?

Harmonic-Percussive Source Separation uses median filtering on the spectrogram:

  • Harmonic components (voice fundamentals, vowels) appear as horizontal lines
  • Percussive components (transients, consonants, noise) appear as vertical lines

By separating these, we can:

  1. Keep the harmonic component (clean voice)
  2. Selectively mix percussive based on voice context
  3. During speech: include percussive (consonants like 't', 's', 'k')
  4. During silence: suppress percussive (noise transients)

Key Innovation: Context-Aware Mixing

The challenge with HPSS for voice is that consonants are percussive. Naive suppression of the percussive component removes 't', 's', 'f', etc.

Our solution: detect voice context using harmonic energy in the 200-4000 Hz band:

  • If voice is present: mix more percussive (preserve consonants)
  • If silence: aggressively suppress percussive (remove noise)

Configuration Reference

@dataclass
class DenoiserConfig:
    """Configuration for HPSS voice denoiser."""
    
    # Audio parameters
    sample_rate: int = 16000          # Input/output sample rate
    
    # STFT parameters
    frame_size_ms: int = 25           # Analysis frame size
    hop_size_ms: int = 6              # Frame hop size
    
    # HPSS separation
    harmonic_kernel: int = 9          # Median filter size (time)
    percussive_kernel: int = 9        # Median filter size (freq)
    hpss_margin: float = 2.5          # Separation hardness
    
    # Context detection
    context_window: int = 10          # Frames to extend voice context
    harmonic_threshold_db: float = -20.0  # Voice detection threshold
    
    # Percussive mixing
    voice_context_perc_gain: float = 0.20  # Keep 20% during voice
    no_context_perc_gain: float = 0.04     # Keep 4% during silence
    
    # Envelope tightening (echo reduction)
    envelope_tightening: bool = True
    envelope_attack_frames: int = 2
    envelope_release_frames: int = 3
    envelope_min_gain: float = 0.15
    
    # Low-frequency denoising
    noise_reduction_strength: float = 0.8
    noise_reduction_max_freq: float = 350.0

Use Cases

Speech-to-Text (STT)

from hpss_denoiser import HPSSDenoiser
import whisper

denoiser = HPSSDenoiser()

# Denoise before transcription
with open("noisy_audio.pcm", "rb") as f:
    noisy = f.read()

cleaned = denoiser.process(noisy)

# Save and transcribe
with open("cleaned.pcm", "wb") as f:
    f.write(cleaned)

# Use with Whisper
model = whisper.load_model("base")
result = model.transcribe("cleaned.wav")

Speaker Diarization

from hpss_denoiser import HPSSDenoiser

# Denoising improves speaker boundary detection
denoiser = HPSSDenoiser()

# Process chunks for streaming diarization
chunk_size = 30 * 16000 * 2  # 30 seconds

with open("meeting.pcm", "rb") as f:
    while chunk := f.read(chunk_size):
        cleaned_chunk = denoiser.process(chunk)
        # Send to diarization pipeline

Voice Embedding

from hpss_denoiser import HPSSDenoiser

# Clean audio produces more stable embeddings
denoiser = HPSSDenoiser()

# Process enrollment audio
enrollment_clean = denoiser.process(enrollment_pcm)

# Process verification audio
verification_clean = denoiser.process(verification_pcm)

# Compare embeddings (using your embedding model)

Performance

Processing Speed

Audio Duration Processing Time Real-time Factor
1 second ~44 ms ~23x
10 seconds ~420 ms ~23x
88 seconds ~3.8 s ~23x

Tested on macOS (Darwin), Python 3.12, single-threaded

Memory Usage

  • ~50 MB base memory
  • ~2 MB per second of audio being processed
  • Streaming-friendly: process in chunks

Troubleshooting

Muffled output

Increase voice_context_perc_gain:

config = DenoiserConfig(voice_context_perc_gain=0.30)

Too much noise remaining

Decrease no_context_perc_gain:

config = DenoiserConfig(no_context_perc_gain=0.02)

Echo/reverb artifacts

Reduce envelope release time:

config = DenoiserConfig(envelope_release_frames=2)

Consonants being cut

Increase context window:

config = DenoiserConfig(context_window=15)

Development

Setup

git clone https://github.com/atomys/hpss-voice-denoiser.git
cd hpss-voice-denoiser
pip install -e ".[dev]"

Run tests

pytest

Type checking

mypy src/hpss_denoiser

Linting

ruff check src/
ruff format src/

Algorithm References

  • HPSS: Fitzgerald, D. (2010). "Harmonic/Percussive Separation using Median Filtering"
  • Spectral Subtraction: Boll, S. (1979). "Suppression of Acoustic Noise in Speech Using Spectral Subtraction"

License

MIT License - see LICENSE for details.

Contributing

Contributions welcome! Please open an issue first to discuss what you would like to change.

Acknowledgments

Developed to improve audio coming from wearable device project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hpss_voice_denoiser-1.23.0.tar.gz (19.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hpss_voice_denoiser-1.23.0-py3-none-any.whl (25.3 kB view details)

Uploaded Python 3

File details

Details for the file hpss_voice_denoiser-1.23.0.tar.gz.

File metadata

  • Download URL: hpss_voice_denoiser-1.23.0.tar.gz
  • Upload date:
  • Size: 19.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for hpss_voice_denoiser-1.23.0.tar.gz
Algorithm Hash digest
SHA256 63f3d06273c5aef043d3eb679eeb5c18a7b3c4f9d1d1d5c7bb6fd957223e97eb
MD5 75dbc952b52455d629449638b69148a6
BLAKE2b-256 963856efdfa525f9a4503aa2b5365de3b3a54396400c7deb65d2bcdf6d2a86b2

See more details on using hashes here.

Provenance

The following attestation bundles were made for hpss_voice_denoiser-1.23.0.tar.gz:

Publisher: publish.yml on 42atomys/hpss-voice-denoiser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hpss_voice_denoiser-1.23.0-py3-none-any.whl.

File metadata

File hashes

Hashes for hpss_voice_denoiser-1.23.0-py3-none-any.whl
Algorithm Hash digest
SHA256 036ecb308a08803824ebea0f8c4e9eed39893629a2ecca87fd077e0ff8d40134
MD5 d68fa81816b744740bb05defbc021794
BLAKE2b-256 c423d7b2784edd171057ecf3b88d08b23586ac5bb2e06b32a9ee45e449af9389

See more details on using hashes here.

Provenance

The following attestation bundles were made for hpss_voice_denoiser-1.23.0-py3-none-any.whl:

Publisher: publish.yml on 42atomys/hpss-voice-denoiser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page