Data tools for ESP projects

Project description

alp-data

alp-data gives you unified access to dozens of bioacoustic datasets, including recordings from birds, marine mammals, primates, insects, anurans, and multi-taxon benchmarks. Every built-in dataset shares a common Dataset interface, with streaming, configurable transforms, and consistent loading regardless of source format.

Why alp-data

Bioacoustic recordings live in many places: Zenodo, OSF, GBIF, institutional repositories. Each dataset arrives with its own format, manifest schema, audio organization, sampling rate, and licensing posture. Researchers wanting to listen across datasets first have to write a custom loader per dataset, then a custom mixer for combining them.

alp-data removes that scaffolding. Every built-in dataset surfaces the same interface (for sample in ds, ds[i], len(ds)), with audio and sample_rate keys returned for every sample. Dataset-specific keys carry labels, annotations, and other metadata. Sample-rate harmonization, label derivation, and split-aware loading are wired in.

You can stand up a multi-dataset benchmark, compare models across taxa, or stream species-specific audio for transfer learning without first becoming a data-engineering specialist.

What's included

alp-data ships with 30+ built-in datasets across:

Birds — large benchmarks including BirdSet (6,800+ training hours, 10,000 species) and WABAD (1,192 species, 72 sites); aggregator corpora like Xeno-Canto; and site- or species-specific recordings spanning arctic species, Hawaiian soundscapes, the Powdermill dawn chorus, and individual-ID datasets for chiffchaff, little owl, and tree pipit.
Marine mammals — the Watkins Marine Mammal Sound Database (~13,700 clips across ~50 cetacean and pinniped species), DCLDE 2026 killer whale annotations, dolphin whistle and click corpora
Primates — gelada vocal sequences, gibbon solos, infant marmoset vocalizations, and macaque coo calls.
Insects, anurans, and other mammals — InsectSet459 (459 Orthoptera and Cicadidae species), AnuraSetStrong (42 frog species, 27 hours of expert annotations), and giant otter vocalization types.
Multi-taxon benchmarks and aggregators — BEANS and BeansZero (the canonical bioacoustic and zero-shot benchmarks), AnimalSpeak (1M+ audio–caption pairs), AnimalSoundArchive, iNaturalist audio, AudioSet / AudioSetStrong, and Voxaboxen (overlapping vocalization detection).

Most datasets are openly licensed (CC-BY, CC-BY-NC, CC0, public domain). License and source metadata are available per-dataset via Dataset.info — provenance matters.

Quickstart

from alp_data import Beans

# Load 'train' split of the BEANS bioacoustic benchmark at 16kHz.
# Resampling is done on the fly via librosa.
beans = Beans(split="train", sample_rate=16000)

print(len(beans))

# Iterate
for sample in beans:
    print(sample["audio"].shape)
    break

# Indexed access
sample = beans[0]
print(sample["audio"].shape)

# Streaming mode (lower memory; len() not available)
beans_streaming = Beans(split="train", streaming=True)
for sample in beans_streaming:
    print(sample["audio"].shape)
    break

⚠️ Warning: When using a PyTorch DataLoader with num_workers > 0, you must set the multiprocessing start method to "spawn" (not the default "fork" on Linux). alp-data datasets hold cloud-backed I/O handles (fsspec / gcsfs / s3fs) that are not safe to inherit across a fork; using "fork" can deadlock workers or corrupt audio reads. Either call torch.multiprocessing.set_start_method("spawn", force=True) at the top of your program, or pass a "spawn" context to the DataLoader, e.g.:
import torch.multiprocessing as mp
from torch.utils.data import DataLoader

loader = DataLoader(
    dataset,
    num_workers=4,
    multiprocessing_context=mp.get_context("spawn"),
)

Datasets and transforms can also be loaded from a YAML config:

# config.yaml
dataset:
  dataset_name: beans
  split: train
  sample_rate: 16000
  transformations:
    - type: filter
      property: source_dataset
      mode: include
      values: ["watkins"]

from alp_data import dataset_from_config

ds, transform_metadata = dataset_from_config("config.yaml")

# ds now only contains "watkins" samples from the BEANS train split, resampled to 16kHz.

Installation

git clone https://github.com/earthspecies/alp-data.git
cd alp-data
pip install -e .  # or uv sync

Development

This repository uses uv for dependency management.

# Install all dependencies including dev tools
uv sync --dev

# Set up pre-commit hooks (uses prek; pass --overwrite to replace existing pre-commit lib hooks)
uv run prek install

# Run tests (Note: some tests require authentication via google or cloudflare; see test files for details)
uv run pytest

Documentation

Build documentation locally with:

make serve-local-docs
# Hosts docs at http://localhost:8000

Highlights

Source-flexible loading — built-in datasets stream from a public Cloudflare R2 bucket; the I/O module can also read from local paths, GCS, S3, and other R2 buckets, with CSV / JSON-Lines / Parquet manifests supported at the backend layer.
Iterate or random-access indexing — every Dataset supports for sample in ds and ds[i].
Streaming mode — process datasets larger than memory with streaming=True.
Configurable transforms — filter rows, select columns, derive labels, deduplicate, balance, upsample long tails, subsample by ratio.
Combine datasets — ConcatenatedDataset merges multiple datasets with configurable column-merge strategies; ChainedDataset iterates over them sequentially with streaming support.
Pluggable backends — pandas or polars under the hood, selectable per-Dataset.

License

alp-data is released under the MIT License. See LICENSE for the full text.

The datasets accessed through alp-data are governed by their own licenses, set by their original creators — independent of alp-data's code license. Most are openly licensed (CC-BY, CC-BY-NC, CC0, public domain), but terms vary per dataset and may include attribution, share-alike, or non-commercial restrictions. Per-dataset license and source metadata are available via Dataset.info.

Contributing

For improvements, bug reports, or proposals, open an issue or pull request.

Project details

Release history Release notifications | RSS feed

This version

1.10.0

Jun 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

alp_data-1.10.0.tar.gz (1.7 MB view details)

Uploaded Jun 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

alp_data-1.10.0-py3-none-any.whl (216.0 kB view details)

Uploaded Jun 10, 2026 Python 3

File details

Details for the file alp_data-1.10.0.tar.gz.

File metadata

Download URL: alp_data-1.10.0.tar.gz
Upload date: Jun 10, 2026
Size: 1.7 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for alp_data-1.10.0.tar.gz
Algorithm	Hash digest
SHA256	`ac49c89c32f2d9873f76ac2d7b726a5837cc141d4884c33120c4c3f2e38068d2`
MD5	`be3570e058f4fcb6529f1a37b02a5692`
BLAKE2b-256	`a05d37148b18ec727e93a940123d57f3a4c53d0f6e741c0befc91178af1eb1a7`

See more details on using hashes here.

File details

Details for the file alp_data-1.10.0-py3-none-any.whl.

File metadata

Download URL: alp_data-1.10.0-py3-none-any.whl
Upload date: Jun 10, 2026
Size: 216.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for alp_data-1.10.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8322f1a1b965e1b9c56f47a7d8f41745588141c9b1c5682b1a17a21734b37473`
MD5	`efdfdce155c74385a63cebbf0d401223`
BLAKE2b-256	`8da16ef8e7edce6caf16070a29de2c7acac3e4e414687129ddc439580717733c`

See more details on using hashes here.

alp-data 1.10.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

alp-data

Why alp-data

What's included

Quickstart

Installation

Development

Documentation

Highlights

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes