Skip to main content

Lint your audio datasets before training. 13 checks for TTS, ASR, and voice-cloning pipelines.

Project description

audio-data-quality-toolkit

Lint your audio datasets before training. 13 automated checks for TTS, ASR, and voice-cloning pipelines. Zero GPU required.

License: MIT Python 3.9+

from audio_qa import check_directory

report = check_directory("./my-dataset/")
print(report.summary())
report.export_clean_manifest("clean_files.txt")  # ready for training

Why this exists

Existing tools answer the wrong question for dataset builders:

Tool Question it answers GPU Output
NISQA (300+ stars) "How does this audio sound?" (MOS 1-5) Yes Quality score
PESQ (600+ stars) "How degraded is this vs reference?" No Quality score
DataSpeech (390+ stars) "Describe this audio for Parler-TTS" Yes NL descriptions
audio-qa "Should this file be in my training set?" No Score (0-10) + pass/fail + clean manifest

DataSpeech annotates datasets for conditioned TTS training. NISQA predicts perceptual quality. Neither tells you that 12% of your training set has clipping, 8% is upsampled from 8kHz, and 3% has misaligned transcripts -- the data engineering problems that silently degrade your model.

Quick start

pip install -e .
audio-qa ./my-dataset/ --csv report.csv --manifest clean_files.txt

Quality score (0-10)

Every file gets a composite quality score on a 0-10 scale, computed from the signal-level checks. No GPU, no ML model -- just weighted signal metrics.

from audio_qa import check_file

result = check_file("sample.wav")
print(result["quality_score"])   # 8.3
print(result["grade"])           # "B"
print(result["score_components"])
# {'snr': 9.1, 'clipping': 10.0, 'silence': 9.5, 'loudness': 7.2,
#  'metallic': 9.8, 'upsampling': 10.0, 'channel': 10.0, 'duration': 10.0}
Score Grade Meaning
9-10 A Studio quality, ready for any pipeline
7-9 B Good, suitable for most TTS/ASR training
5-7 C Acceptable with caveats
3-5 D Poor, likely to degrade model quality
0-3 F Bad, exclude from training

Directory reports show average score and grade distribution:

report = check_directory("./data")
print(report.summary())
# Total files:  1000
# Clean files:  847 (85%)
# Quality Score: 7.8 / 10  (avg across 1000 files)
# Grade distribution: A=312, B=401, C=134, D=98, F=55

Optional: perceptual MOS (NISQA / UTMOS / PESQ)

For ML-based perceptual scores alongside the signal checks:

from audio_qa.checks.perceptual import check_nisqa, check_utmos, check_pesq

# NISQA: no-reference MOS (1-5 scale), needs PyTorch
result = check_nisqa("sample.wav")        # {"mos": 3.8, "noisiness": 4.1, ...}

# UTMOS: no-reference MOS, needs PyTorch
result = check_utmos("sample.wav")        # {"mos": 4.2, ...}

# PESQ: reference-based (-0.5 to 4.5), needs clean reference
result = check_pesq("degraded.wav", "clean_reference.wav")

Install with: pip install audio-data-quality-toolkit[perceptual]

HuggingFace integration

from datasets import load_dataset
from audio_qa import audit_hf_dataset

# LibriTTS-R -- cleaned audiobooks
ds = load_dataset("blabble-io/libritts_r", "clean", split="train.clean.100", streaming=True)
report = audit_hf_dataset(ds, max_samples=500)
print(report.summary())

# MLS English -- large-scale multilingual speech
ds = load_dataset("parler-tts/mls_eng", split="train", streaming=True)
report = audit_hf_dataset(ds, max_samples=500)
print(report.summary())

# LibriSpeech ASR -- classic ASR benchmark
ds = load_dataset("openslr/librispeech_asr", "clean", split="validation", streaming=True)
report = audit_hf_dataset(ds, max_samples=500)
print(report.summary())

# LJSpeech -- needs trust_remote_code with datasets>=4.0
# ds = load_dataset("keithito/lj_speech", split="train",
#                   streaming=True, trust_remote_code=True)

# Common Voice -- requires accepting terms + HF auth token
# ds = load_dataset("mozilla-foundation/common_voice_13_0", "en",
#                   split="train", streaming=True, token="hf_...")

# Filter to clean samples only
clean_ds = report.filter_hf_dataset(ds)

# Export for review
report.to_csv("qa_report.csv")

Setup for HuggingFace audio datasets:

# datasets v4.0+ requires torchcodec for audio decoding
pip install torchcodec datasets huggingface-hub

# OR pin to datasets v3.x to avoid the torchcodec dependency
pip install "datasets>=2.14,<4.0"

If you get RuntimeError: Dataset scripts are no longer supported, add trust_remote_code=True to load_dataset().

13 checks

# Check What it catches GPU
1 SNR estimation Background noise, hum, hiss No
2 Clipping detection Consecutive samples at max amplitude No
3 Silence analysis Excessive leading/trailing/internal silence No
4 Sample rate Non-standard or mismatched rates No
5 Duration bounds Too short or too long for training No
6 Loudness (LUFS) Audio far from target loudness No
7 Metallic artifacts Robotic/metallic TTS artifacts via spectral flatness No
8 Repetition Word/phrase loops via autocorrelation No
9 Channel issues Stereo, silent channel, phase inversion, dual mono No
10 Upsampling detection Files claiming 22kHz but upsampled from 8kHz No
11 Transcript ratio Misaligned transcripts (chars-per-second) No
12 Duplicates Near-duplicate files via chromagram fingerprinting No
13 Transcript alignment Audio vs expected text (optional, Whisper) Optional

Checks 1-12 run on CPU with numpy/scipy/librosa. Check 13 requires pip install audio-data-quality-toolkit[transcript].

Python API

Single file

from audio_qa import check_file

result = check_file("sample.wav", expected_sr=22050)
for check in result["checks"]:
    if not check["passed"]:
        print(check["check"], check.get("severity"))

Directory with exports

from audio_qa import check_directory

report = check_directory("./data", workers=8)
print(report.summary())

report.to_csv("qa_report.csv")         # spreadsheet
report.to_json("qa_report.json")       # structured
report.export_clean_manifest("clean.txt")  # filepaths that passed all checks

HuggingFace dataset

from datasets import load_dataset
from audio_qa import audit_hf_dataset

ds = load_dataset("keithito/lj_speech", split="train", streaming=True)
report = audit_hf_dataset(ds, max_samples=500)
print(report.summary())

Transcript alignment (no ML)

from audio_qa.checks.transcript_ratio import check_transcript_ratio

result = check_transcript_ratio(duration_s=4.2, transcript="Hello world test.")
# result["cps"] -> characters per second
# result["passed"] -> True if within normal speech range (5-25 CPS)

CLI

audio-qa <file_or_directory> [options]

Options:
  --report PATH        Save full JSON report
  --csv PATH           Save per-file CSV summary
  --manifest PATH      Save clean file list (one per line)
  --expected-sr INT    Expected sample rate
  --min-duration FLOAT Min duration in seconds (default: 0.5)
  --max-duration FLOAT Max duration in seconds (default: 30.0)
  --snr-threshold FLOAT Min SNR in dB (default: 20.0)
  --target-lufs FLOAT  Target LUFS (default: -18.0)
  --workers INT        Parallel workers (default: 4)

Optional extras

pip install -e ".[perceptual]"    # NISQA MOS scoring (PyTorch)
pip install -e ".[transcript]"    # Whisper transcript alignment
pip install -e ".[hf]"            # HuggingFace datasets
pip install -e ".[demo]"          # Gradio demo
pip install -e ".[all]"           # Everything

HuggingFace Space

Try it without installing: upload audio files and get instant quality reports.

pip install -e ".[demo]"
python demo/hf_space_app.py

Project structure

audio_qa/
  pipeline.py           # check_file, check_directory, audit_hf_dataset
  report.py             # Report class: to_csv, to_json, export_clean_manifest
  cli.py                # CLI entry point
  checks/
    quality_score.py    # Composite 0-10 score from signal checks
    noise.py            # SNR estimation (silence-gated)
    clipping.py         # Consecutive peak detection
    silence.py          # Leading/trailing/internal silence
    sample_rate.py      # Standard rate validation
    duration.py         # Min/max bounds
    loudness.py         # Simplified LUFS
    tts_artifacts.py    # Metallic + repetition detection
    channel.py          # Mono/stereo, phase, silent channel
    upsampling.py       # Fake SR detection via FFT
    transcript_ratio.py # CPS sanity check
    duplicates.py       # Chromagram fingerprinting
    transcript.py       # Whisper alignment (optional)
    perceptual.py       # NISQA, UTMOS, PESQ wrappers (optional)
demo/
  hf_space_app.py       # Gradio app for HuggingFace Spaces
  app.py                # Streamlit demo
  generate_sample_data.py

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

audio_data_quality_toolkit-0.2.0.tar.gz (23.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

audio_data_quality_toolkit-0.2.0-py3-none-any.whl (27.9 kB view details)

Uploaded Python 3

File details

Details for the file audio_data_quality_toolkit-0.2.0.tar.gz.

File metadata

File hashes

Hashes for audio_data_quality_toolkit-0.2.0.tar.gz
Algorithm Hash digest
SHA256 6a50bd8c1af3d1f8b7bb185e96394602ee3d014644d6f4721a1784091b789cfc
MD5 64fb3f64f8d0940efc16ac8b76803ee7
BLAKE2b-256 9369f7d3d531ab1bcac2724e584d30deb3ce45bbedbce599399ff6997679f26f

See more details on using hashes here.

File details

Details for the file audio_data_quality_toolkit-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for audio_data_quality_toolkit-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6f945e1b8323b90416434b92462d888e5a890309aa6102335e4da84703cd2d17
MD5 3164ea8168f2110b5a6b842aa6226f91
BLAKE2b-256 ade8b733e18822a73e970148144cbe86826eb1803b9f5c2f804e5d0e33cb03b6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page