DeepAudio-X: Self-supervised audio toolkit for audio classification and beyond.

Project description

DeepAudioX

DeepAudioX is a PyTorch-based library that provides simple, flexible pipelines for audio classification using pretrained audio foundation models as feature extractors.

It is designed to let users train, evaluate, and run inference on custom audio datasets with only a few lines of code, while still allowing advanced customization when needed.

Key Features

🔊 Pretrained audio backbones for feature extraction
🧠 Modular pooling strategies (e.g. mean, attentive, learnable pooling)
🧩 Custom classifier heads for downstream audio classification
🚀 High-level training, evaluation, and inference APIs
🔁 Fully PyTorch-native and extensible
📦 Clean integration with existing PyTorch workflows

Installation

pip install deepaudio-x

Or install from source:

git@github.com:magcil/deepaudio-x.git
cd deepaudio-x
pip install -e .

Quick Start

Creating an Audio Classification Dataset

DeepAudioX provides flexible dataset creation methods for audio classification tasks. Here are the main approaches:

Method 1: From Directory Structure

If your audio files are organized in subdirectories where each subdirectory name is a class label:

data/
├── speech/
│   ├── audio1.wav
│   ├── audio2.wav
│   └── ...
├── music/
│   ├── audio3.wav
│   ├── audio4.wav
│   └── ...
└── noise/
    ├── audio5.wav
    └── ...

You can load the dataset as follows:

from deepaudiox.datasets.audio_classification_dataset import audio_classification_dataset_from_dir
from deepaudiox.utils.training_utils import get_class_mapping_from_dir

# Define a class mapping
class_mapping = get_class_mapping_from_dir(root_dir="path/to/data")

dataset = audio_classification_dataset_from_dir(
    root_dir="path/to/data",
    sample_rate=16_000,  # sampling rate in Hz
    class_mapping=class_mapping
)

Method 2: From Custom File-to-Class Mapping

If your audio files aren't organized in subdirectories, or you need custom mappings, you can create a dictionary mapping file paths to class labels:

from deepaudiox.datasets.audio_classification_dataset import audio_classification_dataset_from_dictionary
from deepaudiox.utils.training_utils import get_class_mapping

# Create a file-to-class mapping
file_to_class_mapping = {
    "path/to/audio1.wav": "speech",
    "path/to/audio2.wav": "speech",
    "path/to/audio3.wav": "music",
    # ... more mappings
}

# Create a class-to-id mapping
class_mapping = {"speech": 0, "music": 1, "noise": 2}

# Initialize the dataset
dataset = audio_classification_dataset_from_dictionary(
    file_to_class_mapping=file_to_class_mapping,
    sample_rate=16_000,
    class_mapping=class_mapping
)

Audio Segmentation

To split long audio files into fixed-duration segments, use the segment_duration parameter:

# Create dataset with 2-second audio segments
dataset = audio_classification_dataset_from_dir(
    root_dir="path/to/data",
    sample_rate=16_000,
    segment_duration=2.0  # Duration in seconds
    class_mapping=class_mapping
)

When segment_duration is specified, each audio file is divided into non-overlapping segments of the given duration. Each segment is treated as an independent sample in the dataset, with the same class label as the original audio file. The segment_idx field in the dataset output indicates which segment a sample corresponds to.

Example: A 10-second audio file with segment_duration=2.0 will produce 5 separate samples, each 2 seconds long, all with the same class label.

Both methods return an AudioClassificationDataset object that can be used with PyTorch's DataLoader for training and evaluation.

Dataset Output Format

Each item returned by the dataset is a dictionary containing:

{
    "path": str,                # File path of the audio
    "y_true": int,              # Integer class ID
    "class_name": str,          # String class label
    "segment_idx": int,         # Segment index (for segmented audio)
    "feature": np.ndarray       # Audio waveform as numpy array
}

Example usage:

from torch.utils.data import DataLoader

dataloader = DataLoader(dataset, batch_size=32)

for batch in dataloader:
    paths = batch["path"]           # File paths
    class_ids = batch["y_true"]     # Shape: (batch_size,)
    class_names = batch["class_name"]  # Class names
    segment_indices = batch["segment_idx"]  # Segment indices
    waveforms = batch["feature"]    # Shape: (batch_size, num_samples)

Create an Audio Classifier with Pretrained Backbone

DeepAudioX simplifies the creation of audio classifiers by combining pretrained audio backbones with custom classifier heads. Here's how to build and configure a classifier:

Basic Setup

from deepaudiox.modules.audio_classifier_constructor import AudioClassifierConstructor

# Initialize classifier with pretrained BEATs backbone
classifier = AudioClassifierConstructor(
    num_classes=10,              # Number of output classes
    backbone="beats",            # Pretrained backbone (e.g., "beats")
    sample_rate=16_000,          # Audio sample rate
    pretrained=True,             # Use pretrained weights
    freeze_backbone=True         # Freeze backbone for fine-tuning
)

Note: When pretrained=True, the BEATs model will be automatically downloaded and cached in your OS-specific cache directory (e.g., ~/.cache on Linux). The library does not contain pretrained model files (.pt files), keeping the repository lightweight. Subsequent uses will load the model from the cache.

Available Backbones

BEATs: BEATs: Audio Pre-Training with Acoustic Tokenizers (https://arxiv.org/abs/2212.09058)

Key Parameters

num_classes: Number of output classification classes
sample_rate: Audio sampling rate (Hz) - must match your dataset
pretrained: Whether to use pretrained weights (recommended)
freeze_backbone: Freeze backbone parameters during training (reduces parameters to fine-tune)

Optional: Custom Pooling Strategies

You can customize the pooling strategy used to aggregate audio features:

classifier = AudioClassifierConstructor(
    num_classes=10,
    backbone="beats",
    sample_rate=16_000,
    pretrained=True,
    freeze_backbone=True,
    pooling="gap"
)

Available pooling strategies include:

GAP: Simple average pooling
SimPool: As presented in "Keep It SimPool: Who Said Supervised Transformers Suffer from Attention Deficit?" (https://arxiv.org/pdf/2309.06891)
EP: As presented in "Attention, Please! Revisiting Attentive Probing Through the Lens of Efficiency"(https://arxiv.org/abs/2506.10178)

Normally, attentive pooling methods like ep and simpool will perform better than the Global Average Pooling (GAP).

The classifier is now ready for training or inference.

Training

Train your audio classifier with a few lines of code using the built-in Trainer class:

Minimal Example (Recommended)

from deepaudiox.loops.trainer import Trainer

# Initialize trainer with defaults
trainer = Trainer(
    train_dset=train_dataset,
    model=classifier,
    validation_dset=val_dataset,  # Optional
    batch_size=32,
    epochs=100,
    num_workers=4,
    patience=20
)

# Start training
trainer.train()

By default, the trainer uses:

Optimizer: Adam with learning rate 1e-3
Scheduler: ReduceLROnPlateau with patience 10

Advanced: Custom Optimizer and Scheduler

For more control, you can provide custom optimizer and learning rate scheduler:

from torch.optim import Adam
from torch.optim.lr_scheduler import CosineAnnealingLR
from deepaudiox.loops.trainer import Trainer

optimizer = Adam(classifier.parameters(), lr=1e-2)
lr_scheduler = CosineAnnealingLR(optimizer=optimizer, T_max=100, eta_min=1e-6)

trainer = Trainer(
    train_dset=train_dataset,
    model=classifier,
    validation_dset=val_dataset,
    optimizer=optimizer,
    lr_scheduler=lr_scheduler,
    batch_size=32,
    epochs=100,
    num_workers=4,
    patience=20,
    path_to_checkpoint="checkpoint.pt"
)

trainer.train()

Trainer Parameters

train_dset: Training dataset (AudioClassificationDataset)
model: Audio classifier model to train
validation_dset: Optional validation dataset for monitoring (if None, will split from train_dset)
optimizer: Optional custom PyTorch optimizer (default: Adam with lr=1e-3)
lr_scheduler: Optional custom learning rate scheduler (default: ReduceLROnPlateau with patience=10)
batch_size: Number of samples per batch (default: 16)
epochs: Maximum number of training epochs (default: 100)
patience: Number of epochs with no improvement before early stopping (default: 15)
num_workers: Number of workers for data loading (default: 4)
path_to_checkpoint: Path to save the best model checkpoint (default: "checkpoint.pt")

Features

Automatic Checkpointing: Saves the best model based on validation loss
Early Stopping: Stops training when validation loss plateaus
Progress Tracking: Displays training progress with loss metrics
Device Agnostic: Automatically detects and uses GPU if available

Evaluate

Evaluate your trained classifier on a test dataset using the Evaluator class:

import torch

from deepaudiox.loops.evaluator import Evaluator

# Initialize evaluator
evaluator = Evaluator(
    test_dset=test_dataset,
    model=classifier,
    class_mapping=class_mapping,
    batch_size=32,
    num_workers=4
)

# Load model
classifier.load_state_dict(torch.load("checkpoint.pt"))

# Run evaluation
evaluator.evaluate()

# Access evaluation results
y_true = evaluator.state.y_true       # True labels
y_pred = evaluator.state.y_pred       # Predicted labels
posteriors = evaluator.state.posteriors  # Prediction probabilities

Evaluator Parameters

test_dset: Test dataset (AudioClassificationDataset)
model: Trained audio classifier model
class_mapping: Dictionary mapping class names to IDs
batch_size: Number of samples per batch (default: 16)
num_workers: Number of workers for data loading (default: 4)
device_index: GPU device index to use (optional, auto-detects by default)

Evaluation Results

The evaluator stores predictions in its state:

y_true: Ground truth labels as NumPy array
y_pred: Predicted class IDs as NumPy array
posteriors: Class probability distributions as NumPy array

You can use these results to compute metrics like accuracy, precision, recall, F1-score, etc.:

from sklearn.metrics import accuracy_score, classification_report

accuracy = accuracy_score(evaluator.state.y_true, evaluator.state.y_pred)
print(f"Accuracy: {accuracy:.4f}")
print(classification_report(evaluator.state.y_true, evaluator.state.y_pred))

Customization

Advanced users can:

Plug in custom backbones - Implement your own audio feature extractors
Implement new pooling layers - Create custom aggregation strategies for sequence features
Define custom classifier heads - Design specialized classification architectures
Override training loops - Customize the training process while keeping the pipeline structure

The library is designed to scale from quick experiments to research and production use.

Project Status

🚧 This project is under active development.

APIs may evolve, but backward compatibility will be considered once a stable release is reached.

Attribution

This project is developed at MagCIL and is created and primarily maintained by:

Christos Nikou (@ChrisNick92)
Stefanos Vlachos (@stefanos-vlachos)
Ellie Vakalaki (@ellievak)

Citation

If you use this library in academic work, please cite:

@software{DeepAudioX,
  author = {Nikou, Christos and Vlachos, Stefanos and Vakalaki, Ellie},
  title = {DeepAudioX: A PyTorch-based audio classification framework},
  year = {2026},
  url = {https://github.com/magcil/deepaudio-x}
}

Contributing

Contributions are welcome!

Please open an issue to discuss major changes before submitting a pull request.

Project details

Release history Release notifications | RSS feed

0.4.6

May 8, 2026

0.4.5

May 4, 2026

0.4.4

Apr 20, 2026

0.4.3

Apr 6, 2026

0.4.2

Apr 2, 2026

0.4.1

Mar 24, 2026

0.4.0

Feb 13, 2026

0.3.7

Feb 9, 2026

0.3.6

Feb 9, 2026

0.3.5

Feb 9, 2026

0.3.4

Feb 9, 2026

0.3.3

Feb 9, 2026

0.3.2

Feb 9, 2026

0.2.0

Jan 27, 2026

This version

0.1.5

Jan 17, 2026

0.1.4

Jan 17, 2026

0.1.3

Jan 16, 2026

0.1.1

Jan 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepaudio_x-0.1.5.tar.gz (6.7 MB view details)

Uploaded Jan 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

deepaudio_x-0.1.5-py3-none-any.whl (50.3 kB view details)

Uploaded Jan 17, 2026 Python 3

File details

Details for the file deepaudio_x-0.1.5.tar.gz.

File metadata

Download URL: deepaudio_x-0.1.5.tar.gz
Upload date: Jan 17, 2026
Size: 6.7 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for deepaudio_x-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`f6de7c23e96dcd97896f38a84f7977ad5c49fe54c6b461ebe048d7939379d415`
MD5	`a56ce2a76e8f8309ef583c25ba35aece`
BLAKE2b-256	`70d12e54266fad8d3a1ac9b56aea3e55979f482d109202ef141b7a54b6951fa1`

See more details on using hashes here.

File details

Details for the file deepaudio_x-0.1.5-py3-none-any.whl.

File metadata

Download URL: deepaudio_x-0.1.5-py3-none-any.whl
Upload date: Jan 17, 2026
Size: 50.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for deepaudio_x-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8191af987d4d5e57c5c50f92e7a64a68d811c6658af902770c93038a30e04681`
MD5	`8b1173e27225c43603a40cc716de4d08`
BLAKE2b-256	`3f7a588b399bce06a3414c6f95eeb3f5f79df86afa8ca8b4bf5d2b32b89c3796`

See more details on using hashes here.

deepaudio-x 0.1.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

DeepAudioX

Key Features

Installation

Quick Start

Creating an Audio Classification Dataset

Method 1: From Directory Structure

Method 2: From Custom File-to-Class Mapping

Audio Segmentation

Dataset Output Format

Create an Audio Classifier with Pretrained Backbone

Basic Setup

Available Backbones

Key Parameters

Optional: Custom Pooling Strategies

Training

Minimal Example (Recommended)

Advanced: Custom Optimizer and Scheduler

Trainer Parameters

Features

Evaluate

Evaluator Parameters

Evaluation Results

Customization

Project Status

Attribution

Citation

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes