dataset-builder-inat

A modular Python toolkit for building, analyzing, and visualizing fine-grained classification datasets from iNaturalist. Includes tools for web crawling, species filtering, class balancing, and manifest generation.

These details have not been verified by PyPI

Project links

Project description

A modular dataset preparation toolkit for species-based classification pipelines.

A flexible and fast dataset builder for iNaturalist-style species classification.

About The Project

Example Output

dataset_builder is a modular toolkit designed to streamline the process of preparing image classification datasets, especially for biodiversity and species-based research projects. It provides flexible CLI tools and Python APIs to help you:

Organize images by species into training and validation folders.
Apply filtering rules based on dominant species.
Export dataset manifests in plain text or Parquet formats.
Handle restricted dataset creation, cross-referencing, and species-level analysis.

This package is designed with iNaturelist 2017 dataset in mind. However, it should still helps you if you want to build a similar iNaturelist-style datasets or building your own species classifier.

The project follows the DRY principle and is designed with modularity and pipeline automation in mind.

You can use the CLI to quickly build datasets, or integrate it directly into your own ML pipeline.

Built With

This package is written entirely in Python to ensure that it can run on multiple platform easily. I use the following packages to enable the high-level feature of the package.

Getting Started

This project helps you build custom fine-tuning datasets from the INaturelist collection with minimal effort. It supports tasks such as filtering species, copying matched images, generating manifests, and preparing training/validation splits - all with configurable YAML pipelines. Whether you are training a deep learning model or simply exploring biodiversity data, this toolkit gets your dataset in shape.

Prerequisites

Python >= 3.11
Git

Installation

pip install dataset_builder

For more details, check out the wiki here

Usage

This package is designed to be used through its high-level Python APIs. The typical workflow is defined in a central Python script such as main.py (see below), which loads a config file and runs multiple dataset preparation stages.

Step 1: Create a YAML config file (config.yaml) You can check out the details in the wiki here.

global:
  included_classes: ["Aves", "Insecta"]
  verbose: false
  overwrite: false

paths:
  src_dataset: "iNaturelist_2017"
  dst_dataset: "haute_garonne"
  web_crawl_output_json: "./output/haute_garonne.json"
  output_dir: "./output"

web_crawl:
  total_pages: 104
  base_url: "https://www.inaturalist.org/check_lists/32961-Haute-Garonne-Check-List?page="
  delay_between_requests: 1

train_val_split:
  train_size: 0.8
  random_state: 42
  dominant_threshold: 0.9

Step 2: Create the main.py or use the dataset_orchestration.py provided in release For more details, you can check out the wiki here.

Roadmap

Simplify config.yaml structure: group related options, add environmental variable support, introduce profiles (e.g., dev/prod).
Add advanced options to train_val_split: support stratified splitting, per-class balancing, and deterministic sampling for reproducibility.
Auto-generate config.yaml step-by-step from terminal prompts.
Built-in summary report: after pipeline finishes, output a Markdown or HTML report: species count, splits, coverage, etc. (show logs after each run)
Add support for export all manifests in Parquet format regardless of path format.

Contributing

Contributions are welcome!

If you have suggestions for improvements or spot any issues, feel free to open an issue or submit a pull request.
Please follow the existing project structure and naming conventions when contributing.

To get started:

Fork the repo
Clone your fork locally:
git clone https://github.com/HoangPham6337/iNaturelist_dataset_builder
Create a new branch:
git checkout -b feature/your-feature-name
Make your changes and commit
Push to your fork:
git push origin feature/your-feature-name
Open a Pull Request

For major changes, please open an issue first to discuss what you’d like to change.

License

Distributed under the MIT License. See MIT License for more information.

Contact

Pham Xuan Hoang – LinkedIn – hoangphamat0407@gmail.com

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.8

Sep 9, 2025

1.0.5

Jun 30, 2025

1.0.4

May 19, 2025

1.0.3

May 7, 2025

This version

1.0.2

May 7, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataset_builder_inat-1.0.2.tar.gz (32.2 kB view details)

Uploaded May 7, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dataset_builder_inat-1.0.2-py3-none-any.whl (42.4 kB view details)

Uploaded May 7, 2025 Python 3

File details

Details for the file dataset_builder_inat-1.0.2.tar.gz.

File metadata

Download URL: dataset_builder_inat-1.0.2.tar.gz
Upload date: May 7, 2025
Size: 32.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for dataset_builder_inat-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`24b2fbbe1a54571e74c144c1d69f84cd5d5e3737662f0c90a9f7243f1f9eb4e2`
MD5	`058347b84f79b0552ccbb08f7b5df5a1`
BLAKE2b-256	`7308739611b0998f301278f04c35a8862337b8c7d190d97d13b50cc24b75f74e`

See more details on using hashes here.

File details

Details for the file dataset_builder_inat-1.0.2-py3-none-any.whl.

File metadata

Download URL: dataset_builder_inat-1.0.2-py3-none-any.whl
Upload date: May 7, 2025
Size: 42.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for dataset_builder_inat-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9bca54df4e77a86a3c0bf2156a5749c2aa92f453cbf426e75884ee2ccf70645c`
MD5	`205afa5b6749d7dced81cb2fa5d0591a`
BLAKE2b-256	`0975eba47442a0401fdfef035d590be1eae69be491ef38578b42b1ec29ed8de5`

See more details on using hashes here.

dataset-builder-inat 1.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

A modular dataset preparation toolkit for species-based classification pipelines.

About The Project

Built With

Getting Started

Prerequisites

Installation

Usage

Roadmap

Contributing

License

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes