Skip to main content

Models for the paper 'Analysis of XLS-R for Speech Quality Assessment'.

Project description

xls-r-analysis-sqa

1. Overview

This repository hosts the models for the paper "Analysis of XLS-R for Speech Quality Assessment".

1.1. Performance On Unseen Datasets

Comparison of model performance on each unseen corpus individually (NISQA, IUB) and combined together (Unseen). The metric is RMSE, lower is better.

V1 Results
Model NISQA IUB Unseen
XLS-R 300M Layer24 Bi-LSTM [1] 0.5907 0.5067 0.5323
DNSMOS [2] 0.8718 0.5452 0.6565
MFCC Transformer 0.8280 0.7775 0.7924
XLS-R 300M Layer5 Transformer 0.6256 0.5049 0.5425
XLS-R 300M Layer21 Transformer 0.5694 0.5025 0.5227
XLS-R 300M Layer5+21 Transformer 0.5683 0.4886 0.5129
XLS-R 1B Layer10 Transformer 0.5456 0.5815 0.5713
XLS-R 1B Layer41 Transformer 0.5657 0.4656 0.4966
XLS-R 1B Layer10+41 Transformer 0.5748 0.5288 0.5425
XLS-R 2B Layer10 Transformer 0.6277 0.4899 0.5334
XLS-R 2B Layer41 Transformer 0.5724 0.4897 0.5150
XLS-R 2B Layer10+41 Transformer 0.6036 0.4743 0.5150
Human 0.6738 0.6573 0.6629

V2 Results

UPDATE: the code has been updated to use version 2 of the models. Version 1 used the final model checkpoint by mistake, version 2 uses the checkpoint with the minimum validation loss.

Model NISQA IUB Unseen
XLS-R 300M Layer24 Bi-LSTM [1] 0.5907 0.5067 0.5323
DNSMOS [2] 0.8718 0.5452 0.6565
MFCC Transformer 0.9291 0.7415 0.8003
XLS-R 300M Layer5 Transformer 0.6494 0.5117 0.5550
XLS-R 300M Layer21 Transformer 0.5852 0.4838 0.5152
XLS-R 300M Layer5+21 Transformer 0.5861 0.4768 0.5108
XLS-R 1B Layer10 Transformer 0.6217 0.4763 0.5225
XLS-R 1B Layer41 Transformer 0.5615 0.4646 0.4946
XLS-R 1B Layer10+41 Transformer 0.6024 0.4624 0.5068
XLS-R 2B Layer10 Transformer 0.5227 0.4447 0.4686
XLS-R 2B Layer41 Transformer 0.5295 0.4926 0.5035
XLS-R 2B Layer10+41 Transformer 0.5191 0.4573 0.4760
Human 0.6738 0.6573 0.6629

[1] Tamm, B., Balabin, H., Vandenberghe, R., Van hamme, H. (2022) Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications. Proc. Interspeech 2022, 4083-4087, doi: 10.21437/Interspeech.2022-10147

[2] C. K. A. Reddy, V. Gopal and R. Cutler, "DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors," ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 2021, pp. 6493-6497, doi: 10.1109/ICASSP39728.2021.9414878.

1.2. Visualization of MOS Predictions

MOS predictions on two unseen datasets: NISQA (top) and IU Bloomington (bottom). Our proposed model based on embeddings extracted from the 10th layer of the pre-trained XLS-R 2B outperforms DNSMOS and the MFCC baseline. The human ACRs are also visualized for the IUB corpus.

1.3. Example Audio Segments

🔊

Excellent (MOS = 4.808)

Audio Sample Model Prediction Error
| DNSMOS 3.699 -1.109
MFCC Transformer 3.497 −1.311
XLS-R 2B Layer10
Transformer
3.935 -0.873
🔊

Good (MOS = 4.104)

Audio Sample Model Prediction Error
| DNSMOS 3.269 -0.835
MFCC Transformer 2.498 -1.606
XLS-R 2B Layer10
Transformer
3.793 -0.311
🔊

Fair (MOS = 3.168)

Audio Sample Model Prediction Error
| DNSMOS 3.309 +0.141
MFCC Transformer 3.931 +0.763
XLS-R 2B Layer10
Transformer
3.080 -0.088
🔊

Poor (MOS = 2.240)

Audio Sample Model Prediction Error
| DNSMOS 2.704 +0.464
MFCC Transformer 1.927 -0.313
XLS-R 2B Layer10
Transformer
2.284 +0.044
🔊

Bad (MOS = 1.416)

Audio Sample Model Prediction Error
| DNSMOS 2.553 +1.137
MFCC Transformer 1.806 +0.390
XLS-R 2B Layer10
Transformer
2.312 +0.896

2. Installation

Option A: Install via pip (Recommended)

pip install xls-r-sqa

Option B: Install From Source

First, clone the repository.

git clone https://github.com/lcn-kul/xls-r-analysis-sqa.git

Next, install the requirements to a virtual environment of your choice.

cd xls-r-analysis-sqa/
pip3 install -r requirements.txt

3. Truncated XLS-R Models

This code uses truncated XLS-R models. By default, the code will attempt to auto-download the required truncated XLS-R model from Hugging Face whenever you create an E2EModel that uses XLS-R. For example:

from xls_r_sqa.config import XLSR_2B_TRANSFORMER_32DEEP_CONFIG
from xls_r_sqa.e2e_model import E2EModel

model = E2EModel(
    config=XLSR_2B_TRANSFORMER_32DEEP_CONFIG,
    xlsr_layers=10,
    auto_download=True  # <-- default is True
)

If you do not wish to auto-download, or if you would like to choose your own save location, there are two manual approaches:

  1. Download Truncated Models: Clone the truncated XLS-R repositories from Hugging Face (using Git LFS). Follow [these instructions] in xls_r_sqa/models/xls-r-trunc/README.md.

  2. Truncate Full XLS-R Yourself: Download the full pre-trained XLS-R models (see [these instructions] in xls_r_sqa/models/xls-r/README.md) and then run truncate_w2v2.py to create the truncated versions locally.

Warning: The combined size of all truncated XLS-R repos is approximately 15 GB (plus .git overhead, effectively doubling the storage needed). Make sure you have sufficient disk space before downloading or truncating them yourself.

4. Usage

A working example is provided in test_e2e_sqa.py.

5. Citation

@INPROCEEDINGS{10248049,
  author={Tamm, Bastiaan and Vandenberghe, Rik and Van Hamme, Hugo},
  booktitle={2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)}, 
  title={Analysis of XLS-R for Speech Quality Assessment}, 
  year={2023},
  volume={},
  number={},
  pages={1-5},
  doi={10.1109/WASPAA58266.2023.10248049}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xls_r_sqa-0.1.0.tar.gz (15.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

xls_r_sqa-0.1.0-py3-none-any.whl (15.4 MB view details)

Uploaded Python 3

File details

Details for the file xls_r_sqa-0.1.0.tar.gz.

File metadata

  • Download URL: xls_r_sqa-0.1.0.tar.gz
  • Upload date:
  • Size: 15.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.16

File hashes

Hashes for xls_r_sqa-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9d27598ba93bf2763fdaf09e1e473fbb55cd36f3dd0b194069b8668f8cbaab68
MD5 fee7587f0ceb8e5a2187ac0267ac9914
BLAKE2b-256 0163cb3583ba44471a0c662bd64b4ba8bc5b472265ba511271b75d466c48e1f9

See more details on using hashes here.

File details

Details for the file xls_r_sqa-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: xls_r_sqa-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 15.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.16

File hashes

Hashes for xls_r_sqa-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7ed4587df6f46879bc4ad8773cc58fef80169d809edd5dab63bd9021b1355f0f
MD5 4526244ed21a3b31d43f799c98ff1e28
BLAKE2b-256 71830c630d81dc5ae183c3d32df7004949db026eeff5872e386ed3b2a820ba08

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page