Models for the paper 'Analysis of XLS-R for Speech Quality Assessment'.
Project description
xls-r-analysis-sqa
1. Overview
This repository hosts the models for the paper "Analysis of XLS-R for Speech Quality Assessment".
1.1. Performance On Unseen Datasets
Comparison of model performance on each unseen corpus individually (NISQA, IUB) and combined together (Unseen). The metric is RMSE, lower is better.
V1 Results
| Model | NISQA | IUB | Unseen |
|---|---|---|---|
| XLS-R 300M Layer24 Bi-LSTM [1] | 0.5907 | 0.5067 | 0.5323 |
| DNSMOS [2] | 0.8718 | 0.5452 | 0.6565 |
| MFCC Transformer | 0.8280 | 0.7775 | 0.7924 |
| XLS-R 300M Layer5 Transformer | 0.6256 | 0.5049 | 0.5425 |
| XLS-R 300M Layer21 Transformer | 0.5694 | 0.5025 | 0.5227 |
| XLS-R 300M Layer5+21 Transformer | 0.5683 | 0.4886 | 0.5129 |
| XLS-R 1B Layer10 Transformer | 0.5456 | 0.5815 | 0.5713 |
| XLS-R 1B Layer41 Transformer | 0.5657 | 0.4656 | 0.4966 |
| XLS-R 1B Layer10+41 Transformer | 0.5748 | 0.5288 | 0.5425 |
| XLS-R 2B Layer10 Transformer | 0.6277 | 0.4899 | 0.5334 |
| XLS-R 2B Layer41 Transformer | 0.5724 | 0.4897 | 0.5150 |
| XLS-R 2B Layer10+41 Transformer | 0.6036 | 0.4743 | 0.5150 |
| Human | 0.6738 | 0.6573 | 0.6629 |
V2 Results
UPDATE: the code has been updated to use version 2 of the models. Version 1 used the final model checkpoint by mistake, version 2 uses the checkpoint with the minimum validation loss.
| Model | NISQA | IUB | Unseen |
|---|---|---|---|
| XLS-R 300M Layer24 Bi-LSTM [1] | 0.5907 | 0.5067 | 0.5323 |
| DNSMOS [2] | 0.8718 | 0.5452 | 0.6565 |
| MFCC Transformer | 0.9291 | 0.7415 | 0.8003 |
| XLS-R 300M Layer5 Transformer | 0.6494 | 0.5117 | 0.5550 |
| XLS-R 300M Layer21 Transformer | 0.5852 | 0.4838 | 0.5152 |
| XLS-R 300M Layer5+21 Transformer | 0.5861 | 0.4768 | 0.5108 |
| XLS-R 1B Layer10 Transformer | 0.6217 | 0.4763 | 0.5225 |
| XLS-R 1B Layer41 Transformer | 0.5615 | 0.4646 | 0.4946 |
| XLS-R 1B Layer10+41 Transformer | 0.6024 | 0.4624 | 0.5068 |
| XLS-R 2B Layer10 Transformer | 0.5227 | 0.4447 | 0.4686 |
| XLS-R 2B Layer41 Transformer | 0.5295 | 0.4926 | 0.5035 |
| XLS-R 2B Layer10+41 Transformer | 0.5191 | 0.4573 | 0.4760 |
| Human | 0.6738 | 0.6573 | 0.6629 |
[1] Tamm, B., Balabin, H., Vandenberghe, R., Van hamme, H. (2022) Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications. Proc. Interspeech 2022, 4083-4087, doi: 10.21437/Interspeech.2022-10147
[2] C. K. A. Reddy, V. Gopal and R. Cutler, "DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors," ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 2021, pp. 6493-6497, doi: 10.1109/ICASSP39728.2021.9414878.
1.2. Visualization of MOS Predictions
MOS predictions on two unseen datasets: NISQA (top) and IU Bloomington (bottom). Our proposed model based on embeddings extracted from the 10th layer of the pre-trained XLS-R 2B outperforms DNSMOS and the MFCC baseline. The human ACRs are also visualized for the IUB corpus.
1.3. Example Audio Segments
🔊
Excellent (MOS = 4.808)
| Audio Sample | Model | Prediction | Error |
|---|---|---|---|
| | | DNSMOS | 3.699 | -1.109 |
| MFCC Transformer | 3.497 | −1.311 | |
| XLS-R 2B Layer10 Transformer |
3.935 | -0.873 |
🔊
Good (MOS = 4.104)
| Audio Sample | Model | Prediction | Error |
|---|---|---|---|
| | | DNSMOS | 3.269 | -0.835 |
| MFCC Transformer | 2.498 | -1.606 | |
| XLS-R 2B Layer10 Transformer |
3.793 | -0.311 |
🔊
Fair (MOS = 3.168)
| Audio Sample | Model | Prediction | Error |
|---|---|---|---|
| | | DNSMOS | 3.309 | +0.141 |
| MFCC Transformer | 3.931 | +0.763 | |
| XLS-R 2B Layer10 Transformer |
3.080 | -0.088 |
🔊
Poor (MOS = 2.240)
| Audio Sample | Model | Prediction | Error |
|---|---|---|---|
| | | DNSMOS | 2.704 | +0.464 |
| MFCC Transformer | 1.927 | -0.313 | |
| XLS-R 2B Layer10 Transformer |
2.284 | +0.044 |
🔊
Bad (MOS = 1.416)
| Audio Sample | Model | Prediction | Error |
|---|---|---|---|
| | | DNSMOS | 2.553 | +1.137 |
| MFCC Transformer | 1.806 | +0.390 | |
| XLS-R 2B Layer10 Transformer |
2.312 | +0.896 |
2. Installation
Option A: Install via pip (Recommended)
pip install xls-r-sqa
Option B: Install From Source
First, clone the repository.
git clone https://github.com/lcn-kul/xls-r-analysis-sqa.git
Next, install the requirements to a virtual environment of your choice.
cd xls-r-analysis-sqa/
pip3 install -r requirements.txt
3. Truncated XLS-R Models
This code uses truncated XLS-R models. By default, the code will attempt to auto-download the required truncated XLS-R model from Hugging Face whenever you create an E2EModel that uses XLS-R. For example:
from xls_r_sqa.config import XLSR_2B_TRANSFORMER_32DEEP_CONFIG
from xls_r_sqa.e2e_model import E2EModel
model = E2EModel(
config=XLSR_2B_TRANSFORMER_32DEEP_CONFIG,
xlsr_layers=10,
auto_download=True # <-- default is True
)
If you do not wish to auto-download, or if you would like to choose your own save location, there are two manual approaches:
-
Download Truncated Models: Clone the truncated XLS-R repositories from Hugging Face (using Git LFS). Follow [these instructions] in xls_r_sqa/models/xls-r-trunc/README.md.
-
Truncate Full XLS-R Yourself: Download the full pre-trained XLS-R models (see [these instructions] in xls_r_sqa/models/xls-r/README.md) and then run
truncate_w2v2.pyto create the truncated versions locally.
Warning: The combined size of all truncated XLS-R repos is approximately 15 GB (plus
.gitoverhead, effectively doubling the storage needed). Make sure you have sufficient disk space before downloading or truncating them yourself.
4. Usage
A working example is provided in test_e2e_sqa.py.
5. Citation
@INPROCEEDINGS{10248049,
author={Tamm, Bastiaan and Vandenberghe, Rik and Van Hamme, Hugo},
booktitle={2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
title={Analysis of XLS-R for Speech Quality Assessment},
year={2023},
volume={},
number={},
pages={1-5},
doi={10.1109/WASPAA58266.2023.10248049}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file xls_r_sqa-0.1.0.tar.gz.
File metadata
- Download URL: xls_r_sqa-0.1.0.tar.gz
- Upload date:
- Size: 15.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9d27598ba93bf2763fdaf09e1e473fbb55cd36f3dd0b194069b8668f8cbaab68
|
|
| MD5 |
fee7587f0ceb8e5a2187ac0267ac9914
|
|
| BLAKE2b-256 |
0163cb3583ba44471a0c662bd64b4ba8bc5b472265ba511271b75d466c48e1f9
|
File details
Details for the file xls_r_sqa-0.1.0-py3-none-any.whl.
File metadata
- Download URL: xls_r_sqa-0.1.0-py3-none-any.whl
- Upload date:
- Size: 15.4 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7ed4587df6f46879bc4ad8773cc58fef80169d809edd5dab63bd9021b1355f0f
|
|
| MD5 |
4526244ed21a3b31d43f799c98ff1e28
|
|
| BLAKE2b-256 |
71830c630d81dc5ae183c3d32df7004949db026eeff5872e386ed3b2a820ba08
|