bioscan-dataset

PyTorch torchvision-style datasets for BIOSCAN-1M and BIOSCAN-5M.

These details have not been verified by PyPI

Project links

Project description

In this package, we provide PyTorch/torchvision style dataset classes to load the BIOSCAN-1M and BIOSCAN-5M datasets.

BIOSCAN-1M and 5M are large multimodal datasets for insect biodiversity monitoring, containing over 1 million and 5 million specimens, respectively. The datasets are comprised of RGB microscopy images, DNA barcodes, and fine-grained, hierarchical taxonomic labels. Every sample has both an image and a DNA barcode, but the taxonomic labels are incomplete and only extend all the way to the species level for around 9% of the specimens. For more details about the datasets, please see the BIOSCAN-1M paper and BIOSCAN-5M paper, respectively.

Documentation about this package, including the full API details, is available online at readthedocs.

Installation

The bioscan-dataset package is available on PyPI, and the latest release can be installed into your current environment using pip.

To install the package, run:

pip install bioscan-dataset

The package source code is available on GitHub. If you can’t wait for the next PyPI release, the latest (unstable) version can be installed with:

pip install git+https://github.com/bioscan-ml/dataset.git

Usage

The datasets can be used in the same way as PyTorch’s torchvision datasets. For example, to load the BIOSCAN-1M dataset:

from bioscan_dataset import BIOSCAN1M

dataset = BIOSCAN1M(root="~/Datasets/bioscan/")

for image, dna_barcode, label in dataset:
    # Do something with the image, dna_barcode, and label
    pass

To load the BIOSCAN-5M dataset:

from bioscan_dataset import BIOSCAN5M

dataset = BIOSCAN5M(root="~/Datasets/bioscan/")

for image, dna_barcode, label in dataset:
    # Do something with the image, dna_barcode, and label
    pass

Note that although BIOSCAN-5M is a superset of BIOSCAN-1M, the repeated data samples are not identical between the two due to data cleaning and processing differences. Additionally, note that the splits are incompatible between the two datasets. For details, see the BIOSCAN-5M paper.

For these reasons, we recommend new projects use the BIOSCAN-5M dataset over BIOSCAN-1M.

Dataset download

For BIOSCAN-5M, the dataset class supports automatically downloading the cropped_256 image package (which is the default package). This can be performed by setting the argument download=True:

dataset = BIOSCAN5M(root="~/Datasets/bioscan/", download=True)

To use a different image package, follow the download instructions given in the BIOSCAN-5M repository, then set the argument image_package to the desired package name, e.g.

# Manually download original_full from
# https://drive.google.com/drive/u/1/folders/1Jc57eKkeiYrnUBc9WlIp-ZS_L1bVlT-0
# and unzip the 5 zip files into ~/Datasets/bioscan/bioscan5m/images/original_full/
# Then load the dataset as follows:
dataset = BIOSCAN5M(root="~/Datasets/bioscan/", image_package="original_full")

For BIOSCAN1M, automatic dataset download is not supported and so the dataset must be manually downloaded. See the BIOSCAN-1M repository for download instructions.

Partition/split selection

The dataset class can be used to load different dataset splits. By default, the dataset class will load the training split (train).

For example, to load the validation split:

dataset = BIOSCAN5M(root="~/Datasets/bioscan/", split="val")

In the BIOSCAN-5M dataset, the dataset is partitioned so there are train, val, and test splits to use for closed-world tasks (seen species), and key_unseen, val_unseen, and test_unseen splits to use for open-world tasks (unseen species). These partitions only use samples labelled to species-level.

The pretrain split, which contains 90% of the data, is available for self- and semi-supervised training. Note that these samples may include species in the unseen partition, since we don’t know what species these specimens are.

Additionally, there is an other_heldout split, which contains more unseen species with either too few samples to use for testing, or a genus label which does not appear in the seen set. This partition can be used for training a novelty detector, without exposing the detector to the species in the unseen species set.

Species set	Split	Purpose	# Samples	# Barcodes	# Species
unknown	pretrain	self- and semi-sup. training	4,677,756	2,284,232	—
seen	train	supervision; retrieval keys	289,203	118,051	11,846
	val	model dev; retrieval queries	14,757	6,588	3,378
	test	final eval; retrieval queries	39,373	18,362	3,483
unseen	key_unseen	retrieval keys	36,465	12,166	914
	val_unseen	model dev; retrieval queries	8,819	2,442	903
	test_unseen	final eval; retrieval queries	7,887	3,401	880
heldout	other_heldout	novelty detector training	76,590	41,250	9,862

For more details about the BIOSCAN-5M partitioning, please see the BIOSCAN-5M paper.

Input modality selection

By default, the dataset class will load both the image and DNA barcode as inputs for each sample.

This can be changed by setting the argument input_modality to either "image":

dataset = BIOSCAN5M(root="~/Datasets/bioscan/", modality="image")

or "dna":

dataset = BIOSCAN5M(root="~/Datasets/bioscan/", modality="dna")

Additionally, any column names from the metadata can be used as input modalities. For example, to load the latitude and longitude coordinates as inputs:

dataset = BIOSCAN5M(root="~/Datasets/bioscan/", modality=("coord-lat", "coord-lon"))

or to load the size of the insect (in pixels) in addition to the DNA barcode:

dataset = BIOSCAN5M(
    root="~/Datasets/bioscan/", modality=("dna", "image_measurement_value")
)

Multiple modalities can be selected by passing a list of column names. Each item in the dataset will have the inputs in the same order as specified in the modality argument.

All samples have an image and a DNA barcode, but other fields may be incomplete. Any missing values will be replaced with NaN.

Target selection

The target label can be selected by setting the argument target to be either a taxonomic label or dna_bin. The DNA BIN is similar in granularity to subspecies, but was generated by clustering the DNA barcodes instead of morphology. The default target is "family" for BIOSCAN1M and "species" for BIOSCAN5M.

The target can be a single label, e.g.

dataset = BIOSCAN5M(root="~/Datasets/bioscan/", target_type="genus")

or a list of labels, e.g.

dataset = BIOSCAN5M(
    root="~/Datasets/bioscan/", target_type=["genus", "species", "dna_bin"]
)

By default, the target values will be provided as integer indices that map to the labels for that taxonomic rank (with value -1 used for missing labels), appropriate for training a classification model with cross-entropy. This format can be controlled with the target_format argument, which takes values of either "index" or "text". If this is set to target_format="text", the output will instead be the raw label string:

# Default target format is "index"
dataset = BIOSCAN5M(
    root="~/Datasets/bioscan/", target_type="species", target_format="index"
)
assert dataset[0][-1] is 240

# Using target format "text"
dataset = BIOSCAN5M(
    root="~/Datasets/bioscan/", target_type="species", target_format="text"
)
assert dataset[0][-1] is "Gnamptogenys sulcata"

The default setting is target_format="index". Note that if multiple targets types are given, each label will be returned in the same format.

To map target indices back to text labels, the dataset class provides the index2label method. Similarly, the label2index method can be used to map text labels to indices.

Data transforms

The dataset class supports the use of data transforms for the image and DNA barcode inputs.

import torch
from torchvision.transforms import v2 as transforms
from bioscan_dataset import BIOSCAN5M
from bioscan_dataset.bioscan5m import RGB_MEAN, RGB_STDEV

# Create an image transform, standardizing image size and normalizing pixel values
image_transform = transforms.Compose(
    [
        transforms.CenterCrop(256),
        transforms.ToImage(),
        transforms.ToDtype(torch.float32, scale=True),
        transforms.Normalize(mean=RGB_MEAN, std=RGB_STDEV),
    ]
)
# Create a DNA transform, mapping from characters to integers and padding to a fixed length
charmap = {"P": 0, "A": 1, "C": 2, "G": 3, "T": 4, "N": 5}
dna_transform = lambda seq: torch.tensor(
    [charmap[char] for char in seq] + [0] * (660 - len(seq)), dtype=torch.long
)
# Load the dataset with the transforms applied for each sample
ds_train = BIOSCAN5M(
    root="~/Datasets/bioscan/",
    split="train",
    transform=image_transform,
    dna_transform=dna_transform,
)

Other resources

Read the BIOSCAN-1M paper and BIOSCAN-5M paper.
The dataset can be explored through a web interface using our BIOSCAN Browser.
Read more about the International Barcode of Life (iBOL) and BIOSCAN initiatives.
See the code for the cropping tool that was applied to the images to create the cropped image package.
Examine the code for the experiments described in the BIOSCAN-1M paper.
Examine the code for the experiments described in the BIOSCAN-5M paper.

Citation

If you make use of the BIOSCAN-1M or BIOSCAN-5M datasets in your research, please cite the following papers as appropriate.

BIOSCAN-5M:

@inproceedings{bioscan5m,
   title={{BIOSCAN-5M}: A Multimodal Dataset for Insect Biodiversity},
   booktitle={Advances in Neural Information Processing Systems},
   author={Zahra Gharaee and Scott C. Lowe and ZeMing Gong and Pablo Millan Arias
      and Nicholas Pellegrino and Austin T. Wang and Joakim Bruslund Haurum
      and Iuliia Zarubiieva and Lila Kari and Dirk Steinke and Graham W. Taylor
      and Paul Fieguth and Angel X. Chang
   },
   editor={A. Globerson and L. Mackey and D. Belgrave and A. Fan and U. Paquet and J. Tomczak and C. Zhang},
   pages={36285--36313},
   publisher={Curran Associates, Inc.},
   year={2024},
   volume={37},
   url={https://proceedings.neurips.cc/paper_files/paper/2024/file/3fdbb472813041c9ecef04c20c2b1e5a-Paper-Datasets_and_Benchmarks_Track.pdf},
}

BIOSCAN-1M:

@inproceedings{bioscan1m,
   title={A Step Towards Worldwide Biodiversity Assessment: The {BIOSCAN-1M} Insect Dataset},
   booktitle={Advances in Neural Information Processing Systems},
   author={Gharaee, Z. and Gong, Z. and Pellegrino, N. and Zarubiieva, I.
      and Haurum, J. B. and Lowe, S. C. and McKeown, J. T. A. and Ho, C. Y.
      and McLeod, J. and Wei, Y. C. and Agda, J. and Ratnasingham, S.
      and Steinke, D. and Chang, A. X. and Taylor, G. W. and Fieguth, P.
   },
   editor={A. Oh and T. Neumann and A. Globerson and K. Saenko and M. Hardt and S. Levine},
   pages={43593--43619},
   publisher={Curran Associates, Inc.},
   year={2023},
   volume={36},
   url={https://proceedings.neurips.cc/paper_files/paper/2023/file/87dbbdc3a685a97ad28489a1d57c45c1-Paper-Datasets_and_Benchmarks.pdf},
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.3.0

Apr 19, 2025

1.2.1

Apr 11, 2025

1.2.0

Apr 3, 2025

This version

1.1.0

Mar 27, 2025

1.0.1

Dec 7, 2024

1.0.0 yanked

Dec 4, 2024

Reason this release was yanked:

Minimum requirement version numbers not specified

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bioscan_dataset-1.1.0.tar.gz (24.3 kB view details)

Uploaded Mar 27, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bioscan_dataset-1.1.0-py3-none-any.whl (20.7 kB view details)

Uploaded Mar 27, 2025 Python 3

File details

Details for the file bioscan_dataset-1.1.0.tar.gz.

File metadata

Download URL: bioscan_dataset-1.1.0.tar.gz
Upload date: Mar 27, 2025
Size: 24.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for bioscan_dataset-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`49ce6ade00a630847b94f3cc1d72f3d38aeb13dc5f7f1de52f117b001b242fe3`
MD5	`5c2ac3fa7186f4176afb766cdba5e994`
BLAKE2b-256	`9e491dbc8366718560a434703e1d6836a3917c428eb0e21a33be9d889cd1397d`

See more details on using hashes here.

File details

Details for the file bioscan_dataset-1.1.0-py3-none-any.whl.

File metadata

Download URL: bioscan_dataset-1.1.0-py3-none-any.whl
Upload date: Mar 27, 2025
Size: 20.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for bioscan_dataset-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d2eb86232e13ecfa1c05561dc3d1eec3012b7a92dc020f17531535a40e7c1955`
MD5	`5bf183a6f869eeaf06144d01defa1562`
BLAKE2b-256	`ea1daf8c47e8110b526a60e32f5dbe653ae96c5555fccde77975f0723ad48a88`

See more details on using hashes here.

bioscan-dataset 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Installation

Usage

Dataset download

Partition/split selection

Input modality selection

Target selection

Data transforms

Other resources

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes