Skip to main content

Data tools for ESP projects

Project description

alp-data

License: MIT

alp-data gives you unified access to dozens of bioacoustic datasets, including recordings from birds, marine mammals, primates, insects, anurans, and multi-taxon benchmarks. Every built-in dataset shares a common Dataset interface, with streaming, configurable transforms, and consistent loading regardless of source format.

Why alp-data

Bioacoustic recordings live in many places: Zenodo, OSF, GBIF, institutional repositories. Each dataset arrives with its own format, manifest schema, audio organization, sampling rate, and licensing posture. Researchers wanting to listen across datasets first have to write a custom loader per dataset, then a custom mixer for combining them.

alp-data removes that scaffolding. Every built-in dataset surfaces the same interface (for sample in ds, ds[i], len(ds)), with audio and sample_rate keys returned for every sample. Dataset-specific keys carry labels, annotations, and other metadata. Sample-rate harmonization, label derivation, and split-aware loading are wired in.

You can stand up a multi-dataset benchmark, compare models across taxa, or stream species-specific audio for transfer learning without first becoming a data-engineering specialist.

What's included

alp-data ships with 30+ built-in datasets across:

  • Birds — large benchmarks including BirdSet (6,800+ training hours, 10,000 species) and WABAD (1,192 species, 72 sites); aggregator corpora like Xeno-Canto; and site- or species-specific recordings spanning arctic species, Hawaiian soundscapes, the Powdermill dawn chorus, and individual-ID datasets for chiffchaff, little owl, and tree pipit.
  • Marine mammals — the Watkins Marine Mammal Sound Database (~13,700 clips across ~50 cetacean and pinniped species), DCLDE 2026 killer whale annotations, dolphin whistle and click corpora
  • Primates — gelada vocal sequences, gibbon solos, infant marmoset vocalizations, and macaque coo calls.
  • Insects, anurans, and other mammals — InsectSet459 (459 Orthoptera and Cicadidae species), AnuraSetStrong (42 frog species, 27 hours of expert annotations), and giant otter vocalization types.
  • Multi-taxon benchmarks and aggregators — BEANS and BeansZero (the canonical bioacoustic and zero-shot benchmarks), AnimalSpeak (1M+ audio–caption pairs), AnimalSoundArchive, iNaturalist audio, AudioSet / AudioSetStrong, and Voxaboxen (overlapping vocalization detection).

Most datasets are openly licensed (CC-BY, CC-BY-NC, CC0, public domain). License and source metadata are available per-dataset via Dataset.info — provenance matters.

Quickstart

from alp_data import Beans

# Load 'train' split of the BEANS bioacoustic benchmark at 16kHz.
# Resampling is done on the fly via librosa.
beans = Beans(split="train", sample_rate=16000)

print(len(beans))

# Iterate
for sample in beans:
    print(sample["audio"].shape)
    break

# Indexed access
sample = beans[0]
print(sample["audio"].shape)

# Streaming mode (lower memory; len() not available)
beans_streaming = Beans(split="train", streaming=True)
for sample in beans_streaming:
    print(sample["audio"].shape)
    break

⚠️ Warning: When using a PyTorch DataLoader with num_workers > 0, you must set the multiprocessing start method to "spawn" (not the default "fork" on Linux). alp-data datasets hold cloud-backed I/O handles (fsspec / gcsfs / s3fs) that are not safe to inherit across a fork; using "fork" can deadlock workers or corrupt audio reads. Either call torch.multiprocessing.set_start_method("spawn", force=True) at the top of your program, or pass a "spawn" context to the DataLoader, e.g.:

import torch.multiprocessing as mp
from torch.utils.data import DataLoader

loader = DataLoader(
    dataset,
    num_workers=4,
    multiprocessing_context=mp.get_context("spawn"),
)

Datasets and transforms can also be loaded from a YAML config:

# config.yaml
dataset:
  dataset_name: beans
  split: train
  sample_rate: 16000
  transformations:
    - type: filter
      property: source_dataset
      mode: include
      values: ["watkins"]
from alp_data import dataset_from_config

ds, transform_metadata = dataset_from_config("config.yaml")

# ds now only contains "watkins" samples from the BEANS train split, resampled to 16kHz.

Installation

git clone https://github.com/earthspecies/alp-data.git
cd alp-data
pip install -e .  # or uv sync

Development

This repository uses uv for dependency management.

# Install all dependencies including dev tools
uv sync --dev

# Set up pre-commit hooks (uses prek; pass --overwrite to replace existing pre-commit lib hooks)
uv run prek install

# Run tests (Note: some tests require authentication via google or cloudflare; see test files for details)
uv run pytest

Documentation

Build documentation locally with:

make serve-local-docs
# Hosts docs at http://localhost:8000

Highlights

  • Source-flexible loading — built-in datasets stream from a public Cloudflare R2 bucket; the I/O module can also read from local paths, GCS, S3, and other R2 buckets, with CSV / JSON-Lines / Parquet manifests supported at the backend layer.
  • Iterate or random-access indexing — every Dataset supports for sample in ds and ds[i].
  • Streaming mode — process datasets larger than memory with streaming=True.
  • Configurable transforms — filter rows, select columns, derive labels, deduplicate, balance, upsample long tails, subsample by ratio.
  • Combine datasetsConcatenatedDataset merges multiple datasets with configurable column-merge strategies; ChainedDataset iterates over them sequentially with streaming support.
  • Pluggable backends — pandas or polars under the hood, selectable per-Dataset.

License

alp-data is released under the MIT License. See LICENSE for the full text.

The datasets accessed through alp-data are governed by their own licenses, set by their original creators — independent of alp-data's code license. Most are openly licensed (CC-BY, CC-BY-NC, CC0, public domain), but terms vary per dataset and may include attribution, share-alike, or non-commercial restrictions. Per-dataset license and source metadata are available via Dataset.info.

Contributing

For improvements, bug reports, or proposals, open an issue or pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

alp_data-1.10.0.tar.gz (1.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

alp_data-1.10.0-py3-none-any.whl (216.0 kB view details)

Uploaded Python 3

File details

Details for the file alp_data-1.10.0.tar.gz.

File metadata

  • Download URL: alp_data-1.10.0.tar.gz
  • Upload date:
  • Size: 1.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for alp_data-1.10.0.tar.gz
Algorithm Hash digest
SHA256 ac49c89c32f2d9873f76ac2d7b726a5837cc141d4884c33120c4c3f2e38068d2
MD5 be3570e058f4fcb6529f1a37b02a5692
BLAKE2b-256 a05d37148b18ec727e93a940123d57f3a4c53d0f6e741c0befc91178af1eb1a7

See more details on using hashes here.

File details

Details for the file alp_data-1.10.0-py3-none-any.whl.

File metadata

  • Download URL: alp_data-1.10.0-py3-none-any.whl
  • Upload date:
  • Size: 216.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for alp_data-1.10.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8322f1a1b965e1b9c56f47a7d8f41745588141c9b1c5682b1a17a21734b37473
MD5 efdfdce155c74385a63cebbf0d401223
BLAKE2b-256 8da16ef8e7edce6caf16070a29de2c7acac3e4e414687129ddc439580717733c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page