The model hub for audio intelligence — timm for audio classification

These details have not been verified by PyPI

Project description

🎧 audiotimm

The Model Hub for Audio Intelligence

timm for audio — one registry, every architecture, one clean API.

What is audiotimm?

audiotimm is a standalone Python library that lets you classify, tag, detect events in, and extract embeddings from audio — in one line — using state-of-the-art pretrained models. It is designed after the philosophy of timm: a unified registry where every model family (PANNs, AST, BEATs, HTS-AT, CLAP, Wav2Vec2, WavLM, Whisper, …) is accessible through a single, stable API.

from audiotimm import Classifier

clf = Classifier.load()                    # default: panns-cnn14
result = clf.predict("dog.wav")

result.top(5)       # [(label, score), ...]
result.label        # "Dog"
result.scores       # {"Dog": 0.94, "Animal": 0.72, ...}

Highlights


One line to classify	`Classifier.load().predict("x.wav").top(3)` — weights download and cache automatically
Every major architecture	PANNs, YAMNet, AST, BEATs, HTS-AT, AudioMAE, CLAP, Wav2Vec2, HuBERT, WavLM, Whisper
Lean core	Zero heavy deps at import time — torch + torchaudio only for the default model
Rich result object	`.top(k)`, `.above(thresh)`, `.label`, `.scores`, `.as_dict()`, `.embed()`
Extensible	`@register_model` decorator to plug in custom architectures
CLI included	`audiotimm predict dog.wav --top 5`

Installation

# Core (PANNs CNN-family, Wave M0)
pip install audiotimm

# + Transformer taggers: AST, BEATs, HTS-AT, AudioMAE (Wave M1)
pip install audiotimm[transformers]

# + Zero-shot classification via CLAP (Wave M2)
pip install audiotimm[clap]

# + Speech SSL backbones: Wav2Vec2, HuBERT, WavLM (Wave M3)
pip install audiotimm[speech]

# + Whisper ASR + encoder embeddings (Wave M4)
pip install audiotimm[whisper]

# + Training utilities
pip install audiotimm[train]

# + ONNX edge export
pip install audiotimm[onnx]

# Everything
pip install audiotimm[transformers,clap,speech,whisper,train,onnx]

Quick Start

Classify a file

from audiotimm import Classifier

clf = Classifier.load()            # panns-cnn14 by default
result = clf.predict("siren.wav")

print(result.top(5))
# [("Siren", 0.93), ("Emergency vehicle", 0.81), ("Vehicle", 0.74), ...]

print(result.label)    # "Siren"
print(result.score)    # 0.93

Batch classification

results = clf.predict(["a.wav", "b.wav", "c.wav"])
print(results.labels())   # ["Dog", "Car horn", "Rain"]

Only results above a threshold

result.above(0.5)
# [("Siren", 0.93), ("Emergency vehicle", 0.81), ("Vehicle", 0.74)]

Get embeddings

emb = clf.embed("dog.wav")   # np.ndarray shape (2048,) for panns-cnn14

Switch models

# High accuracy transformer (requires pip install audiotimm[transformers])
clf = Classifier.load("ast-10-10")

# Lightweight 16 kHz variant of PANNs
clf = Classifier.load("panns-cnn14-16k")

CLI

# ── predict ──────────────────────────────────────────────────────────────
# Basic classification
audiotimm predict siren.wav

# Top-10 results
audiotimm predict siren.wav --top 10

# Show only labels above a confidence threshold
audiotimm predict siren.wav --threshold 0.3

# Use a specific model
audiotimm predict siren.wav --model ast-10-10

# Batch — processes all files, shows per-file results
audiotimm predict audio/*.wav --model panns-cnn14

# JSON output (single file or batch)
audiotimm predict siren.wav --json
audiotimm predict audio/*.wav --json --output results.jsonl

# Run on GPU
audiotimm predict siren.wav --model beats-iter3plus-as2m-cpt2 --device cuda

# ── embed ─────────────────────────────────────────────────────────────────
# Print embedding stats to stdout
audiotimm embed dog.wav

# Save single embedding as .npy
audiotimm embed dog.wav --output dog.npy

# Save batch as compressed .npz  (keys = file stems)
audiotimm embed audio/*.wav --output embeddings.npz

# Save as CSV (filename, dim_0, dim_1, …)
audiotimm embed audio/*.wav --output embeddings.csv

# ── list / info ───────────────────────────────────────────────────────────
# List all models
audiotimm list

# Filter by wave or task
audiotimm list --wave M1
audiotimm list --task tagging
audiotimm list --family beats

# Machine-readable JSON
audiotimm list --json

# Detailed card for one model
audiotimm info beats-iter3plus-as2m-cpt2
audiotimm info ast-10-10

# ── benchmark ─────────────────────────────────────────────────────────────
# Time 20 inference runs and print mean/median/min/max/std
audiotimm benchmark siren.wav --model panns-cnn14 --runs 20
audiotimm benchmark siren.wav --model ast-10-10 --device cuda

# ── version ───────────────────────────────────────────────────────────────
audiotimm --version

Available Models

Wave M0 — CNN Taggers `(core, no extras)`

Zoo ID	Architecture	SR	Classes	mAP	Notes
`panns-cnn14` ⭐	CNN14	32 kHz	527	0.431	Default model
`panns-cnn14-16k`	CNN14	16 kHz	527	0.438	Slightly higher mAP
`yamnet`	MobileNetV1	16 kHz	521	—	PyTorch path coming in v0.2

Wave M1 — Transformer Taggers `pip install audiotimm[transformers]`

Zoo ID	Architecture	SR	Classes	mAP	Notes
`ast-10-10` ⭐	Audio Spectrogram Transformer	16 kHz	527	0.459	Default AST
`ast-16-16`	AST (larger patches)	16 kHz	527	0.442	Faster
`ast-speechcommands`	AST	16 kHz	35	—	Keyword spotting
`htsat-audioset`	HTS-AT (Swin-style)	32 kHz	527	0.471	Also CLAP encoder
`htsat-desed`	HTS-AT	32 kHz	—	—	Sound event detection
`audiomae-base-ft`	AudioMAE (ViT-Base)	16 kHz	527	0.473	Facebook MAE
`beats-iter3plus-as2m-cpt2`	BEATs	16 kHz	527	0.486	SOTA mAP

Wave M2 — Zero-Shot CLAP `pip install audiotimm[clap]`

Zoo ID	Variant	SR	Notes
`clap-laion-fused` ⭐	LAION HTSAT + feature fusion	48 kHz	Handles long audio
`clap-laion-unfused`	LAION HTSAT	48 kHz
`clap-laion-music-audioset`	Music + AudioSet trained	48 kHz	ESC-50 ≈ 90.1%
`clap-ms-2023` ⭐	MS-CLAP HTSAT + GPT-2	44.1 kHz	Stronger text encoder
`clap-ms-2022`	MS-CLAP CNN14 + BERT	44.1 kHz
`clap-ms-clapcap`	MS-CLAP + captioning head	44.1 kHz	Audio → text captions

Wave M3 — Speech SSL Backbones `pip install audiotimm[speech]`

Zoo ID	Architecture	SR	Output
`wav2vec2-base`	Wav2Vec2 Base	16 kHz	Frame embeddings
`wav2vec2-large-xlsr`	XLS-R 300M (128 languages)	16 kHz	Multilingual
`hubert-large-ll60k`	HuBERT Large	16 kHz	Strong SER backbone
`wavlm-large` ⭐	WavLM Large	16 kHz	Best for speaker tasks
`wavlm-base-plus-sv`	WavLM + SV head	16 kHz	Speaker verification

Wave M4 — Whisper `pip install audiotimm[whisper]`

Zoo ID	Size	Languages	Notes
`whisper-base`	Base	99	Fast, general
`whisper-large-v3` ⭐	Large v3	99	Best accuracy
`whisper-large-v3-turbo`	Large v3 Turbo	99	Fast + accurate
`whisper-distil-large-v3`	Distil Large v3	1 (EN)	~2× faster

Zero-Shot Classification (Wave M2)

Classify audio into any labels you define — no training needed:

from audiotimm import ZeroShotClassifier   # coming in Phase 2

zs = ZeroShotClassifier.load("clap-laion-fused")
result = zs.classify(
    "clip.wav",
    labels=["dog barking", "car horn", "rain", "crowd applause"]
)
# -> [("rain", 0.81), ("crowd applause", 0.10), ...]

Plugin API — Register Custom Models

from audiotimm import register_model
from audiotimm.models._base import ModelAdapter
from audiotimm.core.registry import ModelSpec

@register_model("my-bird-net")
class BirdNet(ModelAdapter):

    @classmethod
    def spec(cls):
        return ModelSpec(
            name="",           # filled by decorator
            family="custom",
            adapter_factory=cls,
            checkpoint="./weights/birdnet.pt",
            sample_rate=22050,
            n_classes=500,
            embed_dim=512,
            task="tagging",
            wave="M0",
        )

    def predict(self, waveform):
        ...  # return {label: score} dict

# Now available everywhere
from audiotimm import Classifier
clf = Classifier.load("my-bird-net")

Project Roadmap

Phase 1  ✅  Core engine + PANNs CNN family (Wave M0)
Phase 2  ✅  Wave M1 — AST, AudioMAE, HTS-AT, BEATs (transformer taggers)
Phase 3  ·   Wave M2 — CLAP zero-shot (LAION + MS)
Phase 4  ·   Embeddings & similarity search
Phase 5  ·   Sound Event Detection timeline
Phase 6  ·   Wave M3 — Wav2Vec2, HuBERT, WavLM speech SSL
Phase 7  ·   Training & fine-tuning (Trainer API)
Phase 8  ·   Wave M4 — Whisper ASR + encoder embeddings
Phase 9  ·   Evaluation & explainability (Grad-CAM on mel-spectrogram)
Phase 10 ·   Domain packs (bioacoustics, security, health, music, speech)
Phase 11 ·   Streaming / real-time inference
Phase 12 ·   ONNX / TFLite edge export
Phase 13 ·   XenAudio integration + plugin API

Architecture

audiotimm/
├── core/
│   ├── classifier.py    # Classifier.load(), predict(), embed()
│   ├── result.py        # PredictionResult, BatchResult
│   └── registry.py      # ModelRegistry singleton + @register_model
├── models/
│   ├── _base.py         # ModelAdapter ABC
│   ├── panns.py         # Wave M0 — CNN14 family
│   ├── yamnet.py        # Wave M0 — YAMNet (stub)
│   ├── ast.py           # Wave M1 — AST (coming)
│   ├── beats.py         # Wave M1 — BEATs (coming)
│   ├── htsat.py         # Wave M1+M2 — HTS-AT (coming)
│   ├── audiomae.py      # Wave M1 — AudioMAE (coming)
│   ├── clap.py          # Wave M2 — LAION + MS-CLAP (coming)
│   ├── wav2vec2.py      # Wave M3 (coming)
│   ├── hubert.py        # Wave M3 (coming)
│   ├── wavlm.py         # Wave M3 (coming)
│   └── whisper.py       # Wave M4 (coming)
├── utils/
│   ├── audio.py         # load_audio(), pad_or_trim()
│   └── download.py      # cached downloader (~/.cache/audiotimm/)
└── cli.py               # `audiotimm predict` / `audiotimm list`

Design Principles

Lazy everything — weights download on first predict(), not on import.
One result type — PredictionResult everywhere; switching models never breaks your code.
Lean core — torch + torchaudio + numpy only for the default model; every heavy dep is behind an optional extra.
Registry-first — every model is a registry entry; custom models slot in with @register_model.
Immutable results — PredictionResult is read-only; safe to cache and pass around.

Contributing

git clone https://github.com/shubham10divakar/audiotimm
cd audiotimm
pip install -e ".[dev]"
pytest tests/

License

Apache 2.0. Model weights are subject to their respective upstream licenses — see PLAN.md Appendix A for per-checkpoint license notes.

_{Built with ❤️ · audiotimm — Teach Machines to Listen.}

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.0.0

Jun 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

audiotimm-1.0.0.tar.gz (48.7 kB view details)

Uploaded Jun 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

audiotimm-1.0.0-py3-none-any.whl (40.1 kB view details)

Uploaded Jun 12, 2026 Python 3

File details

Details for the file audiotimm-1.0.0.tar.gz.

File metadata

Download URL: audiotimm-1.0.0.tar.gz
Upload date: Jun 12, 2026
Size: 48.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for audiotimm-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`e2010d8e6c956ae7fb4020c4c6c6275ba210b38045ed58400da92160f5c294f6`
MD5	`c3d0eb90fc5038d8b108adb2cda7d160`
BLAKE2b-256	`149c09797699ca9ca22ad091ca5204d79b945c3830e9a53e113e1c8c0ebfb036`

See more details on using hashes here.

File details

Details for the file audiotimm-1.0.0-py3-none-any.whl.

File metadata

Download URL: audiotimm-1.0.0-py3-none-any.whl
Upload date: Jun 12, 2026
Size: 40.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for audiotimm-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9461c04b795f7b7e36ab3ae98bc86c7452c3b65eb5541e0f19c27254d535f452`
MD5	`2b66e0f75164d14a30206024fa4c75dc`
BLAKE2b-256	`4edb76c13b03f97fa96f98c956d99572cb1f580003f4e130ee8cd2bc5a395cf7`

See more details on using hashes here.

audiotimm 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

🎧 audiotimm

What is audiotimm?

Highlights

Installation

Quick Start

Classify a file

Batch classification

Only results above a threshold

Get embeddings

Switch models

CLI

Available Models

Wave M0 — CNN Taggers (core, no extras)

Wave M1 — Transformer Taggers pip install audiotimm[transformers]

Wave M2 — Zero-Shot CLAP pip install audiotimm[clap]

Wave M3 — Speech SSL Backbones pip install audiotimm[speech]

Wave M4 — Whisper pip install audiotimm[whisper]

Zero-Shot Classification (Wave M2)

Plugin API — Register Custom Models

Project Roadmap

Architecture

Design Principles

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Wave M0 — CNN Taggers `(core, no extras)`

Wave M1 — Transformer Taggers `pip install audiotimm[transformers]`

Wave M2 — Zero-Shot CLAP `pip install audiotimm[clap]`

Wave M3 — Speech SSL Backbones `pip install audiotimm[speech]`

Wave M4 — Whisper `pip install audiotimm[whisper]`