Skip to main content

The model hub for audio intelligence โ€” timm for audio classification

Project description

๐ŸŽง audiotimm

The Model Hub for Audio Intelligence

timm for audio โ€” one registry, every architecture, one clean API.

PyPI Downloads Python PyTorch License Version Phase


What is audiotimm?

audiotimm is a standalone Python library that lets you classify, tag, detect events in, and extract embeddings from audio โ€” in one line โ€” using state-of-the-art pretrained models. It is designed after the philosophy of timm: a unified registry where every model family (PANNs, AST, BEATs, HTS-AT, CLAP, Wav2Vec2, WavLM, Whisper, โ€ฆ) is accessible through a single, stable API.

from audiotimm import Classifier

clf = Classifier.load()                    # default: panns-cnn14
result = clf.predict("dog.wav")

result.top(5)       # [(label, score), ...]
result.label        # "Dog"
result.scores       # {"Dog": 0.94, "Animal": 0.72, ...}

Highlights

One line to classify Classifier.load().predict("x.wav").top(3) โ€” weights download and cache automatically
Every major architecture PANNs, YAMNet, AST, BEATs, HTS-AT, AudioMAE, CLAP, Wav2Vec2, HuBERT, WavLM, Whisper
Lean core Zero heavy deps at import time โ€” torch + torchaudio only for the default model
Rich result object .top(k), .above(thresh), .label, .scores, .as_dict(), .embed()
Extensible @register_model decorator to plug in custom architectures
CLI included audiotimm predict dog.wav --top 5

Installation

# Core (PANNs CNN-family, Wave M0)
pip install audiotimm

# + Transformer taggers: AST, BEATs, HTS-AT, AudioMAE (Wave M1)
pip install audiotimm[transformers]

# + Zero-shot classification via CLAP (Wave M2)
pip install audiotimm[clap]

# + Speech SSL backbones: Wav2Vec2, HuBERT, WavLM (Wave M3)
pip install audiotimm[speech]

# + Whisper ASR + encoder embeddings (Wave M4)
pip install audiotimm[whisper]

# + Training utilities
pip install audiotimm[train]

# + ONNX edge export
pip install audiotimm[onnx]

# Everything
pip install audiotimm[transformers,clap,speech,whisper,train,onnx]

Quick Start

Classify a file

from audiotimm import Classifier

clf = Classifier.load()            # panns-cnn14 by default
result = clf.predict("siren.wav")

print(result.top(5))
# [("Siren", 0.93), ("Emergency vehicle", 0.81), ("Vehicle", 0.74), ...]

print(result.label)    # "Siren"
print(result.score)    # 0.93

Batch classification

results = clf.predict(["a.wav", "b.wav", "c.wav"])
print(results.labels())   # ["Dog", "Car horn", "Rain"]

Only results above a threshold

result.above(0.5)
# [("Siren", 0.93), ("Emergency vehicle", 0.81), ("Vehicle", 0.74)]

Get embeddings

emb = clf.embed("dog.wav")   # np.ndarray shape (2048,) for panns-cnn14

Switch models

# High accuracy transformer (requires pip install audiotimm[transformers])
clf = Classifier.load("ast-10-10")

# Lightweight 16 kHz variant of PANNs
clf = Classifier.load("panns-cnn14-16k")

CLI

# โ”€โ”€ predict โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Basic classification
audiotimm predict siren.wav

# Top-10 results
audiotimm predict siren.wav --top 10

# Show only labels above a confidence threshold
audiotimm predict siren.wav --threshold 0.3

# Use a specific model
audiotimm predict siren.wav --model ast-10-10

# Batch โ€” processes all files, shows per-file results
audiotimm predict audio/*.wav --model panns-cnn14

# JSON output (single file or batch)
audiotimm predict siren.wav --json
audiotimm predict audio/*.wav --json --output results.jsonl

# Run on GPU
audiotimm predict siren.wav --model beats-iter3plus-as2m-cpt2 --device cuda

# โ”€โ”€ embed โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Print embedding stats to stdout
audiotimm embed dog.wav

# Save single embedding as .npy
audiotimm embed dog.wav --output dog.npy

# Save batch as compressed .npz  (keys = file stems)
audiotimm embed audio/*.wav --output embeddings.npz

# Save as CSV (filename, dim_0, dim_1, โ€ฆ)
audiotimm embed audio/*.wav --output embeddings.csv

# โ”€โ”€ list / info โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# List all models
audiotimm list

# Filter by wave or task
audiotimm list --wave M1
audiotimm list --task tagging
audiotimm list --family beats

# Machine-readable JSON
audiotimm list --json

# Detailed card for one model
audiotimm info beats-iter3plus-as2m-cpt2
audiotimm info ast-10-10

# โ”€โ”€ benchmark โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Time 20 inference runs and print mean/median/min/max/std
audiotimm benchmark siren.wav --model panns-cnn14 --runs 20
audiotimm benchmark siren.wav --model ast-10-10 --device cuda

# โ”€โ”€ version โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
audiotimm --version

Available Models

Wave M0 โ€” CNN Taggers (core, no extras)

Zoo ID Architecture SR Classes mAP Notes
panns-cnn14 โญ CNN14 32 kHz 527 0.431 Default model
panns-cnn14-16k CNN14 16 kHz 527 0.438 Slightly higher mAP
yamnet MobileNetV1 16 kHz 521 โ€” PyTorch path coming in v0.2

Wave M1 โ€” Transformer Taggers pip install audiotimm[transformers]

Zoo ID Architecture SR Classes mAP Notes
ast-10-10 โญ Audio Spectrogram Transformer 16 kHz 527 0.459 Default AST
ast-16-16 AST (larger patches) 16 kHz 527 0.442 Faster
ast-speechcommands AST 16 kHz 35 โ€” Keyword spotting
htsat-audioset HTS-AT (Swin-style) 32 kHz 527 0.471 Also CLAP encoder
htsat-desed HTS-AT 32 kHz โ€” โ€” Sound event detection
audiomae-base-ft AudioMAE (ViT-Base) 16 kHz 527 0.473 Facebook MAE
beats-iter3plus-as2m-cpt2 BEATs 16 kHz 527 0.486 SOTA mAP

Wave M2 โ€” Zero-Shot CLAP pip install audiotimm[clap]

Zoo ID Variant SR Notes
clap-laion-fused โญ LAION HTSAT + feature fusion 48 kHz Handles long audio
clap-laion-unfused LAION HTSAT 48 kHz
clap-laion-music-audioset Music + AudioSet trained 48 kHz ESC-50 โ‰ˆ 90.1%
clap-ms-2023 โญ MS-CLAP HTSAT + GPT-2 44.1 kHz Stronger text encoder
clap-ms-2022 MS-CLAP CNN14 + BERT 44.1 kHz
clap-ms-clapcap MS-CLAP + captioning head 44.1 kHz Audio โ†’ text captions

Wave M3 โ€” Speech SSL Backbones pip install audiotimm[speech]

Zoo ID Architecture SR Output
wav2vec2-base Wav2Vec2 Base 16 kHz Frame embeddings
wav2vec2-large-xlsr XLS-R 300M (128 languages) 16 kHz Multilingual
hubert-large-ll60k HuBERT Large 16 kHz Strong SER backbone
wavlm-large โญ WavLM Large 16 kHz Best for speaker tasks
wavlm-base-plus-sv WavLM + SV head 16 kHz Speaker verification

Wave M4 โ€” Whisper pip install audiotimm[whisper]

Zoo ID Size Languages Notes
whisper-base Base 99 Fast, general
whisper-large-v3 โญ Large v3 99 Best accuracy
whisper-large-v3-turbo Large v3 Turbo 99 Fast + accurate
whisper-distil-large-v3 Distil Large v3 1 (EN) ~2ร— faster

Zero-Shot Classification (Wave M2)

Classify audio into any labels you define โ€” no training needed:

from audiotimm import ZeroShotClassifier   # coming in Phase 2

zs = ZeroShotClassifier.load("clap-laion-fused")
result = zs.classify(
    "clip.wav",
    labels=["dog barking", "car horn", "rain", "crowd applause"]
)
# -> [("rain", 0.81), ("crowd applause", 0.10), ...]

Plugin API โ€” Register Custom Models

from audiotimm import register_model
from audiotimm.models._base import ModelAdapter
from audiotimm.core.registry import ModelSpec

@register_model("my-bird-net")
class BirdNet(ModelAdapter):

    @classmethod
    def spec(cls):
        return ModelSpec(
            name="",           # filled by decorator
            family="custom",
            adapter_factory=cls,
            checkpoint="./weights/birdnet.pt",
            sample_rate=22050,
            n_classes=500,
            embed_dim=512,
            task="tagging",
            wave="M0",
        )

    def predict(self, waveform):
        ...  # return {label: score} dict

# Now available everywhere
from audiotimm import Classifier
clf = Classifier.load("my-bird-net")

Project Roadmap

Phase 1  โœ…  Core engine + PANNs CNN family (Wave M0)
Phase 2  โœ…  Wave M1 โ€” AST, AudioMAE, HTS-AT, BEATs (transformer taggers)
Phase 3  ยท   Wave M2 โ€” CLAP zero-shot (LAION + MS)
Phase 4  ยท   Embeddings & similarity search
Phase 5  ยท   Sound Event Detection timeline
Phase 6  ยท   Wave M3 โ€” Wav2Vec2, HuBERT, WavLM speech SSL
Phase 7  ยท   Training & fine-tuning (Trainer API)
Phase 8  ยท   Wave M4 โ€” Whisper ASR + encoder embeddings
Phase 9  ยท   Evaluation & explainability (Grad-CAM on mel-spectrogram)
Phase 10 ยท   Domain packs (bioacoustics, security, health, music, speech)
Phase 11 ยท   Streaming / real-time inference
Phase 12 ยท   ONNX / TFLite edge export
Phase 13 ยท   XenAudio integration + plugin API

Architecture

audiotimm/
โ”œโ”€โ”€ core/
โ”‚   โ”œโ”€โ”€ classifier.py    # Classifier.load(), predict(), embed()
โ”‚   โ”œโ”€โ”€ result.py        # PredictionResult, BatchResult
โ”‚   โ””โ”€โ”€ registry.py      # ModelRegistry singleton + @register_model
โ”œโ”€โ”€ models/
โ”‚   โ”œโ”€โ”€ _base.py         # ModelAdapter ABC
โ”‚   โ”œโ”€โ”€ panns.py         # Wave M0 โ€” CNN14 family
โ”‚   โ”œโ”€โ”€ yamnet.py        # Wave M0 โ€” YAMNet (stub)
โ”‚   โ”œโ”€โ”€ ast.py           # Wave M1 โ€” AST (coming)
โ”‚   โ”œโ”€โ”€ beats.py         # Wave M1 โ€” BEATs (coming)
โ”‚   โ”œโ”€โ”€ htsat.py         # Wave M1+M2 โ€” HTS-AT (coming)
โ”‚   โ”œโ”€โ”€ audiomae.py      # Wave M1 โ€” AudioMAE (coming)
โ”‚   โ”œโ”€โ”€ clap.py          # Wave M2 โ€” LAION + MS-CLAP (coming)
โ”‚   โ”œโ”€โ”€ wav2vec2.py      # Wave M3 (coming)
โ”‚   โ”œโ”€โ”€ hubert.py        # Wave M3 (coming)
โ”‚   โ”œโ”€โ”€ wavlm.py         # Wave M3 (coming)
โ”‚   โ””โ”€โ”€ whisper.py       # Wave M4 (coming)
โ”œโ”€โ”€ utils/
โ”‚   โ”œโ”€โ”€ audio.py         # load_audio(), pad_or_trim()
โ”‚   โ””โ”€โ”€ download.py      # cached downloader (~/.cache/audiotimm/)
โ””โ”€โ”€ cli.py               # `audiotimm predict` / `audiotimm list`

Design Principles

  • Lazy everything โ€” weights download on first predict(), not on import.
  • One result type โ€” PredictionResult everywhere; switching models never breaks your code.
  • Lean core โ€” torch + torchaudio + numpy only for the default model; every heavy dep is behind an optional extra.
  • Registry-first โ€” every model is a registry entry; custom models slot in with @register_model.
  • Immutable results โ€” PredictionResult is read-only; safe to cache and pass around.

Contributing

git clone https://github.com/shubham10divakar/audiotimm
cd audiotimm
pip install -e ".[dev]"
pytest tests/

License

Apache 2.0. Model weights are subject to their respective upstream licenses โ€” see PLAN.md Appendix A for per-checkpoint license notes.


Built with โค๏ธ ยท audiotimm โ€” Teach Machines to Listen.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

audiotimm-1.0.0.tar.gz (48.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

audiotimm-1.0.0-py3-none-any.whl (40.1 kB view details)

Uploaded Python 3

File details

Details for the file audiotimm-1.0.0.tar.gz.

File metadata

  • Download URL: audiotimm-1.0.0.tar.gz
  • Upload date:
  • Size: 48.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for audiotimm-1.0.0.tar.gz
Algorithm Hash digest
SHA256 e2010d8e6c956ae7fb4020c4c6c6275ba210b38045ed58400da92160f5c294f6
MD5 c3d0eb90fc5038d8b108adb2cda7d160
BLAKE2b-256 149c09797699ca9ca22ad091ca5204d79b945c3830e9a53e113e1c8c0ebfb036

See more details on using hashes here.

File details

Details for the file audiotimm-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: audiotimm-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 40.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for audiotimm-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9461c04b795f7b7e36ab3ae98bc86c7452c3b65eb5541e0f19c27254d535f452
MD5 2b66e0f75164d14a30206024fa4c75dc
BLAKE2b-256 4edb76c13b03f97fa96f98c956d99572cb1f580003f4e130ee8cd2bc5a395cf7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page