Skip to main content

A simple method to create AI models for biodiversity, with collect and prepare pipeline

Project description

B++ repository

DOI PyPi version Python versions License Downloads Downloads Downloads

This project provides a complete, end-to-end pipeline for building a custom insect classification system. The framework is designed to be domain-agnostic, allowing you to train a powerful detection and classification model for any insect species by simply providing a list of names.

Using the Bplusplus library, this pipeline automates the entire machine learning workflow, from data collection to video inference.

Key Features

  • Automated Data Collection: Downloads hundreds of images for any species from the GBIF database.
  • Intelligent Data Preparation: Uses a pre-trained model to automatically find, crop, and resize insects from raw images, ensuring high-quality training data.
  • Hierarchical Classification: Trains a model to identify insects at three taxonomic levels: family, genus, and species.
  • Video Inference & Tracking: Processes video files to detect, classify, and track individual insects over time, providing aggregated predictions.

Pipeline Overview

The process is broken down into five main steps, all detailed in the full_pipeline.ipynb notebook:

  1. Collect Data: Select your target species and fetch raw insect images from the web.
  2. Prepare Data: Filter, clean, and prepare images for training.
  3. Train Model: Train the hierarchical classification model.
  4. Validate Model: Evaluate the performance of the trained model.
  5. Run Inference: Run the full pipeline on a video file for real-world application.

How to Use

Prerequisites

  • Python 3.10+

Setup

  1. Create and activate a virtual environment:

    python3 -m venv venv
    source venv/bin/activate
    
  2. Install the required packages:

    pip install bplusplus
    

Running the Pipeline

The pipeline can be run step-by-step using the functions from the bplusplus library. While the full_pipeline.ipynb notebook provides a complete, executable workflow, the core functions are described below.

Step 1: Collect Data

Download images for your target species from the GBIF database. You'll need to provide a list of scientific names.

import bplusplus
from pathlib import Path

# Define species and directories
names = ["Vespa crabro", "Vespula vulgaris", "Dolichovespula media"]
GBIF_DATA_DIR = Path("./GBIF_data")

# Define search parameters
search = {"scientificName": names}

# Run collection
bplusplus.collect(
    group_by_key=bplusplus.Group.scientificName,
    search_parameters=search,
    images_per_group=200,  # Recommended to download more than needed
    output_directory=GBIF_DATA_DIR,
    num_threads=5
)

Step 2: Prepare Data

Process the raw images to extract, crop, and resize insects. This step uses a pre-trained model to ensure only high-quality images are used for training.

PREPARED_DATA_DIR = Path("./prepared_data")

bplusplus.prepare(
    input_directory=GBIF_DATA_DIR,
    output_directory=PREPARED_DATA_DIR,
    img_size=640,        # Target image size for training
    conf=0.6,            # Detection confidence threshold (0-1)
    valid=0.1,           # Validation split ratio (0-1), set to 0 for no validation
    blur=None,           # Gaussian blur as fraction of image size (0-1), None = disabled
)

Note: The blur parameter applies Gaussian blur before resizing, which can help reduce noise. Values are relative to image size (e.g., blur=0.01 means 1% of the smallest dimension). Supported image formats: JPG, JPEG, and PNG.

Step 3: Train Model

Train the hierarchical classification model on your prepared data. The model learns to identify family, genus, and species.

TRAINED_MODEL_DIR = Path("./trained_model")

bplusplus.train(
    batch_size=4,
    epochs=30,
    patience=3,
    img_size=640,
    data_dir=PREPARED_DATA_DIR,
    output_dir=TRAINED_MODEL_DIR,
    species_list=names,
    backbone="resnet50",  # Choose: "resnet18", "resnet50", or "resnet101"
    # num_workers=0,      # Optional: force single-process loading (most stable)
    # train_transforms=custom_transforms,  # Optional: custom torchvision transforms
)

Note: The num_workers parameter controls DataLoader multiprocessing (defaults to 0 for stability). The backbone parameter allows you to choose between different ResNet architectures—use resnet18 for faster training or resnet101 for potentially better accuracy.

Step 4: Validate Model

Evaluate the trained model on a held-out validation set. This calculates precision, recall, and F1-score at all taxonomic levels.

HIERARCHICAL_MODEL_PATH = TRAINED_MODEL_DIR / "best_multitask.pt"

results = bplusplus.validate(
    species_list=names,
    validation_dir=PREPARED_DATA_DIR / "valid",
    hierarchical_weights=HIERARCHICAL_MODEL_PATH,
    img_size=640,           # Must match training
    batch_size=32,
    backbone="resnet50",    # Must match training
)

Step 5: Run Inference on Video

Processes a video through a multi-phase pipeline: motion-based detection (GMM), Hungarian tracking, path topology confirmation, and hierarchical classification. Detection and tracking are powered by BugSpot, a lightweight core that runs on any platform including edge devices.

The species list is automatically loaded from the model checkpoint.

HIERARCHICAL_MODEL_PATH = TRAINED_MODEL_DIR / "best_multitask.pt"

results = bplusplus.inference(
    video_path="my_video.mp4",
    output_dir="./output",
    hierarchical_model_path=HIERARCHICAL_MODEL_PATH,
    backbone="resnet50",        # Must match training
    img_size=60,                # Must match training
    # --- Optional ---
    # species_list=names,       # Override species from checkpoint
    # fps=None,                 # None = all frames, or set target FPS
    # config="config.yaml",     # Custom detection parameters (YAML/JSON)
    # classify=False,           # Detection only, NaN for classification
    # save_video=True,          # Annotated + debug videos
    # crops=False,              # Save crop per detection per track
    # track_composites=False,   # Composite image per track (temporal trail)
)

print(f"Confirmed: {results['confirmed_tracks']} / {results['tracks']} tracks")

Output files:

File Description Flag
{video}_results.csv Aggregated results per confirmed track Always
{video}_detections.csv Frame-by-frame detections Always
{video}_annotated.mp4 Video with detection boxes and paths save_video=True
{video}_debug.mp4 Side-by-side with GMM motion mask save_video=True
{video}_crops/ Crop images per track crops=True
{video}_composites/ Composite images per track track_composites=True

Detection configuration can be customized via a YAML/JSON file passed as config=. Download a template from the releases page.

Resolution-independent units. Scene-scale pixel parameters are fractions of image dimensions, not absolute pixels, so one config works across resolutions. Lengths are fractions of the image width W; areas are fractions of W * H. They are resolved to absolute pixels at runtime once the frame size is known. The 1080 px wide column shows the resolved value for a 1080×1080 frame for intuition. morph_kernel_size is an exception — it stays in absolute NxN pixels since it targets sensor-level noise.

Full Configuration Parameters (click to expand)
Parameter Default 1080 px wide Description
GMM Background Subtractor Motion detection model
gmm_history 500 Frames to build background model
gmm_var_threshold 16 Variance threshold for foreground detection
Morphological Filtering Noise removal
morph_kernel_size 3 3 Kernel size (NxN), absolute pixels
Cohesiveness Filters scattered motion (plants) vs compact motion (insects)
min_largest_blob_ratio 0.80 Min ratio of largest blob to total motion
max_num_blobs 5 Max separate blobs allowed in detection
min_motion_ratio 0.15 Min ratio of motion pixels to bbox area
Shape Filters by contour properties
min_area 0.0002 ~233 px² Min detection area, fraction of image area
max_area 0.035 ~40 824 px² Max detection area, fraction of image area
min_density 3.0 Min area/perimeter ratio (unitless)
min_solidity 0.55 Min convex hull fill ratio
Tracking Controls track behavior
min_displacement 0.05 54 px Min net movement for confirmation, fraction of image width
min_path_points 10 Min points before path analysis
max_frame_jump 0.1 108 px Max jump between frames, fraction of image width
max_lost_frames 45 Frames before lost track deleted (e.g., 45 @ 30fps = 1.5s)
max_area_change_ratio 3.0 Max area change ratio between frames
Tracker Matching Hungarian algorithm cost function
tracker_w_dist 0.6 Weight for distance cost (0-1)
tracker_w_area 0.4 Weight for area cost (0-1)
tracker_cost_threshold 0.3 Max cost for valid match (0-1)
Path Topology Confirms insect-like movement patterns
max_revisit_ratio 0.30 Max ratio of revisited positions
min_progression_ratio 0.70 Min forward progression
max_directional_variance 0.90 Max heading variance
revisit_radius 0.05 54 px Revisit radius, fraction of image width

Customization

To train the model on your own set of insect species, you only need to change the names list in Step 1. The pipeline will automatically handle the rest.

# To use your own species, change the names in this list
names = [
    "Vespa crabro",
    "Vespula vulgaris",
    "Dolichovespula media",
    # Add your species here
]

Handling an "Unknown" Class

To train a model that can recognize an "unknown" class for insects that don't belong to your target species, add "unknown" to your species_list. You must also provide a corresponding unknown folder containing images of various other insects in your data directories (e.g., prepared_data/train/unknown).

# Example with an unknown class
names_with_unknown = [
    "Vespa crabro",
    "Vespula vulgaris",
    "unknown"
]

Directory Structure

The pipeline will create the following directories to store artifacts:

  • GBIF_data/: Stores the raw images downloaded from GBIF.
  • prepared_data/: Contains the cleaned, cropped, and resized images ready for training (train/ and optionally valid/ subdirectories).
  • trained_model/: Saves the trained model weights (best_multitask.pt).
  • output/: Inference results including annotated videos and CSV files.

Citation

All information in this GitHub is available under MIT license, as long as credit is given to the authors.

Venverloo, T., Duarte, F., B++: Towards Real-Time Monitoring of Insect Species. MIT Senseable City Laboratory, AMS Institute.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bplusplus-2.2.0.tar.gz (53.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bplusplus-2.2.0-py3-none-any.whl (54.0 kB view details)

Uploaded Python 3

File details

Details for the file bplusplus-2.2.0.tar.gz.

File metadata

  • Download URL: bplusplus-2.2.0.tar.gz
  • Upload date:
  • Size: 53.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.10.12 Linux/6.8.0-107-generic

File hashes

Hashes for bplusplus-2.2.0.tar.gz
Algorithm Hash digest
SHA256 aea62b5dbe372c0dd60785dbfc4da5cc7dcc9a1dc84982cacbc7b4711d4ef371
MD5 672426716915315500a7fefa022fde24
BLAKE2b-256 ff54efade06b3211ec8d0519dfc60fba0f337ecfa6ca303a06def9aea881ca0a

See more details on using hashes here.

File details

Details for the file bplusplus-2.2.0-py3-none-any.whl.

File metadata

  • Download URL: bplusplus-2.2.0-py3-none-any.whl
  • Upload date:
  • Size: 54.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.10.12 Linux/6.8.0-107-generic

File hashes

Hashes for bplusplus-2.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3d599524d7abbf068a37a0a4697d1e163c98dfd515892cb6d3e1ef0fc560b257
MD5 a950d1457d43390c7475602d2297ff91
BLAKE2b-256 58f2dd3223baddee1acdcdae4f1478dd4374c0114ebb5a600891ce1957f804e9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page