torch-log-wmse-audio-quality

logWMSE is an audio quality metric & loss function with support for digital silence target. Useful for training and evaluating audio source separation systems.

These details have not been verified by PyPI

Project links

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

torch-log-wmse-logo

This repository contains the torch implementation of an audio quality metric, logWMSE, originally proposed by Iver Jordal of Nomono. In addition to the original metric, this implementation can also be used as a loss function for training audio separation and denoising models.

logWMSE is a custom metric and loss function for audio signals that calculates the logarithm (log) of a frequency-weighted (W) Mean Squared Error (MSE). It is designed to address several shortcomings of common audio metrics, most importantly the lack of support for digital silence targets.

Installation

pip install torch-log-wmse

Usage Example

import torch
from torch_log_wmse import LogWMSE

# Tensor shapes
audio_length = 1.0
sample_rate = 44100
audio_stems = 4 # 4 audio stems (e.g. vocals, drums, bass, other)
audio_channels = 2 # stereo
batch = 4 # batch size

# Instantiate logWMSE
# Set `return_as_loss=False` to resturn as a positive metric
log_wmse = LogWMSE(audio_length=audio_length, sample_rate=sample_rate, return_as_loss=True)

# Generate random inputs (scale between -1 and 1)
audio_lengths_samples = int(audio_length * sample_rate)
unprocessed_audio = 2 * torch.rand(batch, audio_channels, audio_lengths_samples) - 1
processed_audio = unprocessed_audio.unsqueeze(1).expand(-1, audio_stems, -1, -1) * 0.1
target_audio = torch.zeros(batch, audio_stems, audio_channels, audio_lengths_samples)

log_wmse = log_wmse(unprocessed_audio, processed_audio, target_audio)
print(log_wmse)  # Expected output: approx. -18.42

logWMSE accepts three torch tensors of the following shapes:

unprocessed_audio: [batch, audio_channels, samples]
processed_audio: [batch, audio_stems, audio_channels, samples]
target_audio: [batch, audio_stems, audio_channels, samples]

Each dimension being:

batch: Number of audio files in a batch (i.e. batch size).
audio_stems: Number of separate audio sources. For source separation, this could be multiple different instruments, vocals, etc. For denoising audio, this will be 1.
audio_channels: Number of channels (i.e. 1 for mono and 2 for stereo).
samples: Number of audio samples (e.g. 1 second of audio @ 44.1kHz is 44100 samples).

Motivation

The goal of this metric is to account for several factors not present in current audio evaluation metrics, such as dealing with digital silence. Mean Squared Error (MSE) is well-defined for digital silence targets, but has its own set of drawbacks. Attempting to mitigate these issues, the following are some attributes of logWMSE:

Supports digital silence targets not supported by other audio metrics. i.e. (SI-)SDR, SIR, SAR, ISR, VISQOL_audio, STOI, CDPAM, and VISQOL.
Overcomes the small value range issue of MSE (i.e. between 1e-8 and 1e-3), making number formatting and sight-reading easier. It is scaled similarly to SI-SDR for consistency with current benchmark metrics (i.e. 3 is poor, 30 is very good).
Scale-invariant, aligns with the frequency sensitivity of human hearing.
Invariant to the tiny errors of MSE that are inaudible to humans.
Logarithmic, reflecting the logarithmic sensitivity of human hearing.
Tailored specifically for audio signals.

Frequency Weighting

To measure the frequencies of a signal closer to that of human hearing, the following frequency weighting is applied. This helps the model effectively pay less attention to errors at frequencies that humans are not sensitive to (e.g. 50 Hz) and give more weight to those that we are acutely tuned to (e.g. 3kHz).

Frequency Weighting

This metric has been constructed with high-fidelity audio in mind (sample rates ≥ 44.1kHz). It theoretically could work for lower sample rates, like 16kHz, but the metric performs an internal resampling to 44.1kHz for consistency across any input sample rates.

Inputs

Unlike many audio quality metrics, logWMSE accepts 3 audio inputs rather than 2:

Unprocessed audio (e.g. raw, noisy audio)
Processed audio (e.g. denoised or separated audio)
Target audio (e.g. ground truth, clean audio)

Typically audio loss functions only use the processed audio and target audio to compare against one another. However, logWMSE requires the initial, unprocessed audio because it needs to be able to measure how well the processed audio was attenuated from the unprocessed version. This adds a factor that accounts for when the input contains silence (digital zero).

This also adds a factor of scale invariance in the sense that the processed audio needs to be scaled appropriately relative to both the unprocessed audio and ground truth. Conceptually, this means that if all 3 inputs are gained by the same arbitrary amount, the metric score will stay the same.

Limitations

The metric isn't invariant to arbitrary scaling, polarity inversion, or offsets in the estimated audio relative to the target.
Although it incorporates frequency filtering inspired by human auditory sensitivity, it doesn't fully model human auditory perception. For instance, it doesn't consider auditory masking.

Contributing

Contributions are welcome! Please open an issue or submit a pull request if you have any improvements or new features to suggest.

License

This project is licensed under the Apache License 2.0. See LICENSE for details.

Acknowledgments

Thanks to Whitebalance for backing this project.

Project details

These details have not been verified by PyPI

Project links

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.3.1

Jan 29, 2026

0.3.0

Apr 24, 2025

0.2.9

Jul 31, 2024

0.2.8

Jul 31, 2024

0.2.7

Jun 21, 2024

0.2.6

Jun 21, 2024

0.2.5

Jun 21, 2024

0.2.4

Jun 21, 2024

0.2.3

Jun 19, 2024

0.2.2

Jun 18, 2024

0.2.1

Jun 18, 2024

This version

0.2.0

Jun 18, 2024

0.1.9

Jun 18, 2024

0.1.8

May 22, 2024

0.1.7

May 22, 2024

0.1.6

May 20, 2024

0.1.5

May 17, 2024

0.1.4

May 16, 2024

0.1.3

May 16, 2024

0.1.2

May 16, 2024

0.1.1

May 16, 2024

0.1.0

May 15, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

torch_log_wmse_audio_quality-0.2.0.tar.gz (15.9 kB view details)

Uploaded Jun 18, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

torch_log_wmse_audio_quality-0.2.0-py3-none-any.whl (13.9 kB view details)

Uploaded Jun 18, 2024 Python 3

File details

Details for the file torch_log_wmse_audio_quality-0.2.0.tar.gz.

File metadata

Download URL: torch_log_wmse_audio_quality-0.2.0.tar.gz
Upload date: Jun 18, 2024
Size: 15.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.0 CPython/3.8.18

File hashes

Hashes for torch_log_wmse_audio_quality-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`7c7a67fb6b1dd09c659fc6b47d34053c232d7a77526fcf71413e406d69616b15`
MD5	`703575edc2d99bf3a5a9dede456beab8`
BLAKE2b-256	`4e181de445b4ea3e7697199d80ef4e9d695767fb6a43b5a7b9c4deba40822e17`

See more details on using hashes here.

File details

Details for the file torch_log_wmse_audio_quality-0.2.0-py3-none-any.whl.

File metadata

Download URL: torch_log_wmse_audio_quality-0.2.0-py3-none-any.whl
Upload date: Jun 18, 2024
Size: 13.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.0 CPython/3.8.18

File hashes

Hashes for torch_log_wmse_audio_quality-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c374b76869fd9cfedd00160edd48684a89ed2490048d19f0a7fa0121578cc212`
MD5	`0f6a567deec2a3e3d929a0401d557539`
BLAKE2b-256	`7c375a9f9653452b262d91f39f05051b66312863afb1bad41a3ee8b7bad00091`

See more details on using hashes here.

torch-log-wmse-audio-quality 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Installation

Usage Example

Motivation

Frequency Weighting

Inputs

Limitations

Contributing

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes