Data tools for ESP projects
Project description
alp-data
alp-data gives you unified access to dozens of bioacoustic datasets, including recordings from birds, marine mammals, primates, insects, anurans, and multi-taxon benchmarks. Every built-in dataset shares a common Dataset interface, with streaming, configurable transforms, and consistent loading regardless of source format.
Why alp-data
Bioacoustic recordings live in many places: Zenodo, OSF, GBIF, institutional repositories. Each dataset arrives with its own format, manifest schema, audio organization, sampling rate, and licensing posture. Researchers wanting to listen across datasets first have to write a custom loader per dataset, then a custom mixer for combining them.
alp-data removes that scaffolding. Every built-in dataset surfaces the same interface (for sample in ds, ds[i], len(ds)), with audio and sample_rate keys returned for every sample. Dataset-specific keys carry labels, annotations, and other metadata. Sample-rate harmonization, label derivation, and split-aware loading are wired in.
You can stand up a multi-dataset benchmark, compare models across taxa, or stream species-specific audio for transfer learning without first becoming a data-engineering specialist.
What's included
alp-data ships with 30+ built-in datasets across:
- Birds — large benchmarks including BirdSet (6,800+ training hours, 10,000 species) and WABAD (1,192 species, 72 sites); aggregator corpora like Xeno-Canto; and site- or species-specific recordings spanning arctic species, Hawaiian soundscapes, the Powdermill dawn chorus, and individual-ID datasets for chiffchaff, little owl, and tree pipit.
- Marine mammals — the Watkins Marine Mammal Sound Database (~13,700 clips across ~50 cetacean and pinniped species), DCLDE 2026 killer whale annotations, dolphin whistle and click corpora
- Primates — gelada vocal sequences, gibbon solos, infant marmoset vocalizations, and macaque coo calls.
- Insects, anurans, and other mammals — InsectSet459 (459 Orthoptera and Cicadidae species), AnuraSetStrong (42 frog species, 27 hours of expert annotations), and giant otter vocalization types.
- Multi-taxon benchmarks and aggregators — BEANS and BeansZero (the canonical bioacoustic and zero-shot benchmarks), AnimalSpeak (1M+ audio–caption pairs), AnimalSoundArchive, iNaturalist audio, AudioSet / AudioSetStrong, and Voxaboxen (overlapping vocalization detection).
Most datasets are openly licensed (CC-BY, CC-BY-NC, CC0, public domain). License and source metadata are available per-dataset via Dataset.info — provenance matters.
Quickstart
from alp_data import Beans
# Load 'train' split of the BEANS bioacoustic benchmark at 16kHz.
# Resampling is done on the fly via librosa.
beans = Beans(split="train", sample_rate=16000)
print(len(beans))
# Iterate
for sample in beans:
print(sample["audio"].shape)
break
# Indexed access
sample = beans[0]
print(sample["audio"].shape)
# Streaming mode (lower memory; len() not available)
beans_streaming = Beans(split="train", streaming=True)
for sample in beans_streaming:
print(sample["audio"].shape)
break
⚠️ Warning: When using a PyTorch
DataLoaderwithnum_workers > 0, you must set the multiprocessing start method to"spawn"(not the default"fork"on Linux). alp-data datasets hold cloud-backed I/O handles (fsspec /gcsfs/s3fs) that are not safe to inherit across afork; using"fork"can deadlock workers or corrupt audio reads. Either calltorch.multiprocessing.set_start_method("spawn", force=True)at the top of your program, or pass a"spawn"context to the DataLoader, e.g.:import torch.multiprocessing as mp from torch.utils.data import DataLoader loader = DataLoader( dataset, num_workers=4, multiprocessing_context=mp.get_context("spawn"), )
Datasets and transforms can also be loaded from a YAML config:
# config.yaml
dataset:
dataset_name: beans
split: train
sample_rate: 16000
transformations:
- type: filter
property: source_dataset
mode: include
values: ["watkins"]
from alp_data import dataset_from_config
ds, transform_metadata = dataset_from_config("config.yaml")
# ds now only contains "watkins" samples from the BEANS train split, resampled to 16kHz.
Installation
git clone https://github.com/earthspecies/alp-data.git
cd alp-data
pip install -e . # or uv sync
Development
This repository uses uv for dependency management.
# Install all dependencies including dev tools
uv sync --dev
# Set up pre-commit hooks (uses prek; pass --overwrite to replace existing pre-commit lib hooks)
uv run prek install
# Run tests (Note: some tests require authentication via google or cloudflare; see test files for details)
uv run pytest
Documentation
Build documentation locally with:
make serve-local-docs
# Hosts docs at http://localhost:8000
Highlights
- Source-flexible loading — built-in datasets stream from a public Cloudflare R2 bucket; the I/O module can also read from local paths, GCS, S3, and other R2 buckets, with CSV / JSON-Lines / Parquet manifests supported at the backend layer.
- Iterate or random-access indexing — every Dataset supports
for sample in dsandds[i]. - Streaming mode — process datasets larger than memory with
streaming=True. - Configurable transforms — filter rows, select columns, derive labels, deduplicate, balance, upsample long tails, subsample by ratio.
- Combine datasets —
ConcatenatedDatasetmerges multiple datasets with configurable column-merge strategies;ChainedDatasetiterates over them sequentially with streaming support. - Pluggable backends — pandas or polars under the hood, selectable per-Dataset.
License
alp-data is released under the MIT License. See LICENSE for the full text.
The datasets accessed through alp-data are governed by their own licenses, set by their original creators — independent of alp-data's code license. Most are openly licensed (CC-BY, CC-BY-NC, CC0, public domain), but terms vary per dataset and may include attribution, share-alike, or non-commercial restrictions. Per-dataset license and source metadata are available via Dataset.info.
Contributing
For improvements, bug reports, or proposals, open an issue or pull request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file alp_data-1.10.0.tar.gz.
File metadata
- Download URL: alp_data-1.10.0.tar.gz
- Upload date:
- Size: 1.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ac49c89c32f2d9873f76ac2d7b726a5837cc141d4884c33120c4c3f2e38068d2
|
|
| MD5 |
be3570e058f4fcb6529f1a37b02a5692
|
|
| BLAKE2b-256 |
a05d37148b18ec727e93a940123d57f3a4c53d0f6e741c0befc91178af1eb1a7
|
File details
Details for the file alp_data-1.10.0-py3-none-any.whl.
File metadata
- Download URL: alp_data-1.10.0-py3-none-any.whl
- Upload date:
- Size: 216.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8322f1a1b965e1b9c56f47a7d8f41745588141c9b1c5682b1a17a21734b37473
|
|
| MD5 |
efdfdce155c74385a63cebbf0d401223
|
|
| BLAKE2b-256 |
8da16ef8e7edce6caf16070a29de2c7acac3e4e414687129ddc439580717733c
|