Skip to main content

Unofficial fairseq-free PyTorch implementation of UTMOS

Project description

UTMOS-PyTorch

AboutUsageHow To ReproduceCreditsLicenseCitation

PyPI version Python versions Hugging Face model License: MIT UTMOS paper

About

This is an unofficial fairseq-free implementation of the UTMOS MOS Prediction system proposed in UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022.

The original implementation is based on fairseq. However, fairseq is difficult to install with recent Python, PyTorch, and dependency versions, which makes UTMOS hard to use in modern environments. Recent study from ICASSP 2026 highlights the high correlation of UTMOS with subjective listening scores for neural codecs. Therefore, modern neural audio codec and TTS research benefits from an easy-to-install UTMOS implementation.

We provide a fairseq-free implementation written in PyTorch that matches the original system using converted weights and re-written modules.

We also provide a TorchScript variant that can be loaded with only PyTorch, without installing this package.

The PyTorch and TorchScript versions are validated against the original implementation and produce matching scores.

[!NOTE] As in the original version, we recommend running UTMOS with batch size 1 to avoid metric shifts caused by padding.

Usage

You can install the repo as a package:

pip install utmos-pytorch

Or from source:

git clone https://github.com/Blinorot/UTMOS-PyTorch.git
cd UTMOS-PyTorch
pip install -e .

The code requires:

Package Version
Python >=3.9
PyTorch >=2.2.0
HuggingFace Hub >=0.20

The TorchScript checkpoint was scripted with PyTorch 2.5.1. Loading it with older PyTorch versions is not guaranteed; PyTorch >=2.5.1 is recommended for the TorchScript variant.

Then, you can run the model as follows:

import torchaudio
from utmos_pytorch import UTMOSScoreTorch

device = "cpu" # set to "cuda" to use on GPU
utmos = UTMOSScoreTorch(device=device) # already in eval mode

# load an audio file, e.g. using torchaudio
audio_path = ... # path to an audio file
wav, sr = torchaudio.load(audio_path)

# convert to MONO 16 kHz
TARGET_SR = 16000
if wav.shape[0] != 1:
    wav = wav[0:1]
if sr != TARGET_SR:
    wav = torchaudio.functional.resample(wav, orig_freq=sr, new_freq=TARGET_SR)

# put on device
wav = wav.to(device)

# calculate the score
# accepts T, 1xT, Bx1xT
utmos_score = utmos.score(wav) # tensor of shape (batch_size,)

You can replace UTMOSScoreTorch with UTMOSScoreScripted to use the TorchScript variant instead. On first use, the package downloads converted UTMOS weights from Hugging Face Hub and caches them locally using the Hugging Face cache.

For TorchScript, you can avoid downloading the package and use the model directly:

import torch
import torchaudio
import wget

# download scripted checkpoint, e.g. using wget
checkpoint_url = "https://huggingface.co/Blinorot/UTMOS-PyTorch/resolve/main/utmos_scripted.pt"
checkpoint_path = ... # path to saved checkpoint
wget.download(checkpoint_url, checkpoint_path)

# load directly with torch.jit
device = "cpu" # set to "cuda" to use on GPU
utmos = torch.jit.load(checkpoint_path, map_location=device)
utmos.eval()

# load an audio file, e.g. using torchaudio
audio_path = ... # path to an audio file
wav, sr = torchaudio.load(audio_path)

# convert to MONO 16 kHz
TARGET_SR = 16000
if wav.shape[0] != 1:
    wav = wav[0:1]
if sr != TARGET_SR:
    wav = torchaudio.functional.resample(wav, orig_freq=sr, new_freq=TARGET_SR)

# put on device
wav = wav.to(device)

# calculate the score
# accepts T, 1xT, Bx1xT
with torch.no_grad():
    utmos_score = utmos.score(wav) # tensor of shape (batch_size,)

Notes

The model expects audio sampled at 16 kHz.

Accepted tensor shapes:

Shape Meaning
(T,) single mono waveform
(1, T) single mono waveform with channel dimension
(B, 1, T) batch of mono waveforms

The input should be a floating point PyTorch tensor. Stereo audio should be converted to mono before scoring. utmos.score(wav) returns a tensor of shape (batch_size,), where each value is a predicted MOS score. Higher is better. Batch size 1 is recommended to avoid padding-related score shifts.

API classes:

Class Description
UTMOSScoreTorch PyTorch implementation using converted weights.
UTMOSScoreScripted Wrapper around the TorchScript checkpoint.

How To Reproduce

To reproduce PyTorch and Scripted checkpoints and validate them against the original UTMOS module, follow the steps below.

First, install all required packages in a new environment:

# Optional
conda create -n utmos python=3.9.7
conda activate utmos

pip install pip==22.0
pip install -r requirements.txt

Then, you need to export weights from the original UTMOS checkpoint:

# add --private to save privately
python extract_state_dict.py --repo-id USERNAME/REPO_NAME_ON_HUGGINGFACE

This will upload the state dict extracted from the original PyTorch Lightning UTMOS checkpoint to Hugging Face. The same state dict is used to load our fairseq-free PyTorch-only module.

To create a scripted version of the PyTorch model that allows to load UTMOS without class definitions, run

# add --private to save privately
python create_scripted_model.py --repo-id USERNAME/REPO_NAME_ON_HUGGINGFACE

It will upload the scripted model to HuggingFace as well.

Finally, to test that all 3 variations (Original, PyTorch, Scripted) return the same scores, run

# set --device "cpu" to run on cpu
# set --batch-size to a value bigger than 1 to test batched version
python test.py --device "cuda" --batch-size 1

The models are tested on test-clean partition of LibriSpeech.

UTMOS Version Score (LibriSpeech Test-Clean)
Original 4.085875394599128
Torch 4.085875394599128
Scripted 4.085875394599128

Credits

The code is based on the original UTMOS and fairseq repositories.

License

This project is released under the MIT License.

Parts of the implementation are adapted from the original UTMOS and fairseq repositories, which are also MIT licensed. See LICENSES for third-party license texts.

Converted checkpoints are derived from the original UTMOS checkpoint. Original authors retain copyright over the original model and weights.

Citation

If you use this package, please cite the original UTMOS paper:

@inproceedings{saeki22c_interspeech,
  title     = {{UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022}},
  author    = {Takaaki Saeki and Detai Xin and Wataru Nakata and Tomoki Koriyama and Shinnosuke Takamichi and Hiroshi Saruwatari},
  year      = {2022},
  booktitle = {{Interspeech 2022}},
  pages     = {4521--4525},
  doi       = {10.21437/Interspeech.2022-439},
  issn      = {2958-1796},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

utmos_pytorch-0.1.0.tar.gz (18.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

utmos_pytorch-0.1.0-py3-none-any.whl (18.3 kB view details)

Uploaded Python 3

File details

Details for the file utmos_pytorch-0.1.0.tar.gz.

File metadata

  • Download URL: utmos_pytorch-0.1.0.tar.gz
  • Upload date:
  • Size: 18.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.7

File hashes

Hashes for utmos_pytorch-0.1.0.tar.gz
Algorithm Hash digest
SHA256 819560f8085f27d3c922b4fd8f04f30b4854bfce8028733efc11db4ebbc4b122
MD5 cc9b8f44851801b4d5cc446e868641c3
BLAKE2b-256 45d043de4ea1cd74ec77754cdeffe0ed7925283bf96625982c7bf483b8e0290c

See more details on using hashes here.

File details

Details for the file utmos_pytorch-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: utmos_pytorch-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 18.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.7

File hashes

Hashes for utmos_pytorch-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5b228ed7e3190ac97d785dd11cca57773ff24c8cefb0f9d176f909bd7c39f8bb
MD5 81195b1c66b5be80d12d4cb71dc14119
BLAKE2b-256 ed04d2b1baa64b01db2c35f1ad9861c50a7a59c75f61bf85255e5a0cd0a3f5fa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page