The model hub for audio intelligence โ timm for audio classification
Project description
๐ง audiotimm
The Model Hub for Audio Intelligence
timm for audio โ one registry, every architecture, one clean API.
What is audiotimm?
audiotimm is a standalone Python library that lets you classify, tag, detect events in, and extract embeddings from audio โ in one line โ using state-of-the-art pretrained models. It is designed after the philosophy of timm: a unified registry where every model family (PANNs, AST, BEATs, HTS-AT, CLAP, Wav2Vec2, WavLM, Whisper, โฆ) is accessible through a single, stable API.
from audiotimm import Classifier
clf = Classifier.load() # default: panns-cnn14
result = clf.predict("dog.wav")
result.top(5) # [(label, score), ...]
result.label # "Dog"
result.scores # {"Dog": 0.94, "Animal": 0.72, ...}
Highlights
| One line to classify | Classifier.load().predict("x.wav").top(3) โ weights download and cache automatically |
| Every major architecture | PANNs, YAMNet, AST, BEATs, HTS-AT, AudioMAE, CLAP, Wav2Vec2, HuBERT, WavLM, Whisper |
| Lean core | Zero heavy deps at import time โ torch + torchaudio only for the default model |
| Rich result object | .top(k), .above(thresh), .label, .scores, .as_dict(), .embed() |
| Extensible | @register_model decorator to plug in custom architectures |
| CLI included | audiotimm predict dog.wav --top 5 |
Installation
# Core (PANNs CNN-family, Wave M0)
pip install audiotimm
# + Transformer taggers: AST, BEATs, HTS-AT, AudioMAE (Wave M1)
pip install audiotimm[transformers]
# + Zero-shot classification via CLAP (Wave M2)
pip install audiotimm[clap]
# + Speech SSL backbones: Wav2Vec2, HuBERT, WavLM (Wave M3)
pip install audiotimm[speech]
# + Whisper ASR + encoder embeddings (Wave M4)
pip install audiotimm[whisper]
# + Training utilities
pip install audiotimm[train]
# + ONNX edge export
pip install audiotimm[onnx]
# Everything
pip install audiotimm[transformers,clap,speech,whisper,train,onnx]
Quick Start
Classify a file
from audiotimm import Classifier
clf = Classifier.load() # panns-cnn14 by default
result = clf.predict("siren.wav")
print(result.top(5))
# [("Siren", 0.93), ("Emergency vehicle", 0.81), ("Vehicle", 0.74), ...]
print(result.label) # "Siren"
print(result.score) # 0.93
Batch classification
results = clf.predict(["a.wav", "b.wav", "c.wav"])
print(results.labels()) # ["Dog", "Car horn", "Rain"]
Only results above a threshold
result.above(0.5)
# [("Siren", 0.93), ("Emergency vehicle", 0.81), ("Vehicle", 0.74)]
Get embeddings
emb = clf.embed("dog.wav") # np.ndarray shape (2048,) for panns-cnn14
Switch models
# High accuracy transformer (requires pip install audiotimm[transformers])
clf = Classifier.load("ast-10-10")
# Lightweight 16 kHz variant of PANNs
clf = Classifier.load("panns-cnn14-16k")
CLI
# โโ predict โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Basic classification
audiotimm predict siren.wav
# Top-10 results
audiotimm predict siren.wav --top 10
# Show only labels above a confidence threshold
audiotimm predict siren.wav --threshold 0.3
# Use a specific model
audiotimm predict siren.wav --model ast-10-10
# Batch โ processes all files, shows per-file results
audiotimm predict audio/*.wav --model panns-cnn14
# JSON output (single file or batch)
audiotimm predict siren.wav --json
audiotimm predict audio/*.wav --json --output results.jsonl
# Run on GPU
audiotimm predict siren.wav --model beats-iter3plus-as2m-cpt2 --device cuda
# โโ embed โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Print embedding stats to stdout
audiotimm embed dog.wav
# Save single embedding as .npy
audiotimm embed dog.wav --output dog.npy
# Save batch as compressed .npz (keys = file stems)
audiotimm embed audio/*.wav --output embeddings.npz
# Save as CSV (filename, dim_0, dim_1, โฆ)
audiotimm embed audio/*.wav --output embeddings.csv
# โโ list / info โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# List all models
audiotimm list
# Filter by wave or task
audiotimm list --wave M1
audiotimm list --task tagging
audiotimm list --family beats
# Machine-readable JSON
audiotimm list --json
# Detailed card for one model
audiotimm info beats-iter3plus-as2m-cpt2
audiotimm info ast-10-10
# โโ benchmark โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Time 20 inference runs and print mean/median/min/max/std
audiotimm benchmark siren.wav --model panns-cnn14 --runs 20
audiotimm benchmark siren.wav --model ast-10-10 --device cuda
# โโ version โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
audiotimm --version
Available Models
Wave M0 โ CNN Taggers (core, no extras)
| Zoo ID | Architecture | SR | Classes | mAP | Notes |
|---|---|---|---|---|---|
panns-cnn14 โญ |
CNN14 | 32 kHz | 527 | 0.431 | Default model |
panns-cnn14-16k |
CNN14 | 16 kHz | 527 | 0.438 | Slightly higher mAP |
yamnet |
MobileNetV1 | 16 kHz | 521 | โ | PyTorch path coming in v0.2 |
Wave M1 โ Transformer Taggers pip install audiotimm[transformers]
| Zoo ID | Architecture | SR | Classes | mAP | Notes |
|---|---|---|---|---|---|
ast-10-10 โญ |
Audio Spectrogram Transformer | 16 kHz | 527 | 0.459 | Default AST |
ast-16-16 |
AST (larger patches) | 16 kHz | 527 | 0.442 | Faster |
ast-speechcommands |
AST | 16 kHz | 35 | โ | Keyword spotting |
htsat-audioset |
HTS-AT (Swin-style) | 32 kHz | 527 | 0.471 | Also CLAP encoder |
htsat-desed |
HTS-AT | 32 kHz | โ | โ | Sound event detection |
audiomae-base-ft |
AudioMAE (ViT-Base) | 16 kHz | 527 | 0.473 | Facebook MAE |
beats-iter3plus-as2m-cpt2 |
BEATs | 16 kHz | 527 | 0.486 | SOTA mAP |
Wave M2 โ Zero-Shot CLAP pip install audiotimm[clap]
| Zoo ID | Variant | SR | Notes |
|---|---|---|---|
clap-laion-fused โญ |
LAION HTSAT + feature fusion | 48 kHz | Handles long audio |
clap-laion-unfused |
LAION HTSAT | 48 kHz | |
clap-laion-music-audioset |
Music + AudioSet trained | 48 kHz | ESC-50 โ 90.1% |
clap-ms-2023 โญ |
MS-CLAP HTSAT + GPT-2 | 44.1 kHz | Stronger text encoder |
clap-ms-2022 |
MS-CLAP CNN14 + BERT | 44.1 kHz | |
clap-ms-clapcap |
MS-CLAP + captioning head | 44.1 kHz | Audio โ text captions |
Wave M3 โ Speech SSL Backbones pip install audiotimm[speech]
| Zoo ID | Architecture | SR | Output |
|---|---|---|---|
wav2vec2-base |
Wav2Vec2 Base | 16 kHz | Frame embeddings |
wav2vec2-large-xlsr |
XLS-R 300M (128 languages) | 16 kHz | Multilingual |
hubert-large-ll60k |
HuBERT Large | 16 kHz | Strong SER backbone |
wavlm-large โญ |
WavLM Large | 16 kHz | Best for speaker tasks |
wavlm-base-plus-sv |
WavLM + SV head | 16 kHz | Speaker verification |
Wave M4 โ Whisper pip install audiotimm[whisper]
| Zoo ID | Size | Languages | Notes |
|---|---|---|---|
whisper-base |
Base | 99 | Fast, general |
whisper-large-v3 โญ |
Large v3 | 99 | Best accuracy |
whisper-large-v3-turbo |
Large v3 Turbo | 99 | Fast + accurate |
whisper-distil-large-v3 |
Distil Large v3 | 1 (EN) | ~2ร faster |
Zero-Shot Classification (Wave M2)
Classify audio into any labels you define โ no training needed:
from audiotimm import ZeroShotClassifier # coming in Phase 2
zs = ZeroShotClassifier.load("clap-laion-fused")
result = zs.classify(
"clip.wav",
labels=["dog barking", "car horn", "rain", "crowd applause"]
)
# -> [("rain", 0.81), ("crowd applause", 0.10), ...]
Plugin API โ Register Custom Models
from audiotimm import register_model
from audiotimm.models._base import ModelAdapter
from audiotimm.core.registry import ModelSpec
@register_model("my-bird-net")
class BirdNet(ModelAdapter):
@classmethod
def spec(cls):
return ModelSpec(
name="", # filled by decorator
family="custom",
adapter_factory=cls,
checkpoint="./weights/birdnet.pt",
sample_rate=22050,
n_classes=500,
embed_dim=512,
task="tagging",
wave="M0",
)
def predict(self, waveform):
... # return {label: score} dict
# Now available everywhere
from audiotimm import Classifier
clf = Classifier.load("my-bird-net")
Project Roadmap
Phase 1 โ
Core engine + PANNs CNN family (Wave M0)
Phase 2 โ
Wave M1 โ AST, AudioMAE, HTS-AT, BEATs (transformer taggers)
Phase 3 ยท Wave M2 โ CLAP zero-shot (LAION + MS)
Phase 4 ยท Embeddings & similarity search
Phase 5 ยท Sound Event Detection timeline
Phase 6 ยท Wave M3 โ Wav2Vec2, HuBERT, WavLM speech SSL
Phase 7 ยท Training & fine-tuning (Trainer API)
Phase 8 ยท Wave M4 โ Whisper ASR + encoder embeddings
Phase 9 ยท Evaluation & explainability (Grad-CAM on mel-spectrogram)
Phase 10 ยท Domain packs (bioacoustics, security, health, music, speech)
Phase 11 ยท Streaming / real-time inference
Phase 12 ยท ONNX / TFLite edge export
Phase 13 ยท XenAudio integration + plugin API
Architecture
audiotimm/
โโโ core/
โ โโโ classifier.py # Classifier.load(), predict(), embed()
โ โโโ result.py # PredictionResult, BatchResult
โ โโโ registry.py # ModelRegistry singleton + @register_model
โโโ models/
โ โโโ _base.py # ModelAdapter ABC
โ โโโ panns.py # Wave M0 โ CNN14 family
โ โโโ yamnet.py # Wave M0 โ YAMNet (stub)
โ โโโ ast.py # Wave M1 โ AST (coming)
โ โโโ beats.py # Wave M1 โ BEATs (coming)
โ โโโ htsat.py # Wave M1+M2 โ HTS-AT (coming)
โ โโโ audiomae.py # Wave M1 โ AudioMAE (coming)
โ โโโ clap.py # Wave M2 โ LAION + MS-CLAP (coming)
โ โโโ wav2vec2.py # Wave M3 (coming)
โ โโโ hubert.py # Wave M3 (coming)
โ โโโ wavlm.py # Wave M3 (coming)
โ โโโ whisper.py # Wave M4 (coming)
โโโ utils/
โ โโโ audio.py # load_audio(), pad_or_trim()
โ โโโ download.py # cached downloader (~/.cache/audiotimm/)
โโโ cli.py # `audiotimm predict` / `audiotimm list`
Design Principles
- Lazy everything โ weights download on first
predict(), not onimport. - One result type โ
PredictionResulteverywhere; switching models never breaks your code. - Lean core โ
torch + torchaudio + numpyonly for the default model; every heavy dep is behind an optional extra. - Registry-first โ every model is a registry entry; custom models slot in with
@register_model. - Immutable results โ
PredictionResultis read-only; safe to cache and pass around.
Contributing
git clone https://github.com/shubham10divakar/audiotimm
cd audiotimm
pip install -e ".[dev]"
pytest tests/
License
Apache 2.0. Model weights are subject to their respective upstream licenses โ see PLAN.md Appendix A for per-checkpoint license notes.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file audiotimm-1.0.0.tar.gz.
File metadata
- Download URL: audiotimm-1.0.0.tar.gz
- Upload date:
- Size: 48.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e2010d8e6c956ae7fb4020c4c6c6275ba210b38045ed58400da92160f5c294f6
|
|
| MD5 |
c3d0eb90fc5038d8b108adb2cda7d160
|
|
| BLAKE2b-256 |
149c09797699ca9ca22ad091ca5204d79b945c3830e9a53e113e1c8c0ebfb036
|
File details
Details for the file audiotimm-1.0.0-py3-none-any.whl.
File metadata
- Download URL: audiotimm-1.0.0-py3-none-any.whl
- Upload date:
- Size: 40.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9461c04b795f7b7e36ab3ae98bc86c7452c3b65eb5541e0f19c27254d535f452
|
|
| MD5 |
2b66e0f75164d14a30206024fa4c75dc
|
|
| BLAKE2b-256 |
4edb76c13b03f97fa96f98c956d99572cb1f580003f4e130ee8cd2bc5a395cf7
|