Skip to main content

CLI tool for creating hyperspectral image datasets for machine learning.

Project description

SpectralDatamaker

Python CLI tool designed to facilitate the creation of datasets with hyperspectral images for machine learning.

The dataset structure is organized as follows:

dataset_root/
├── images
│   ├── DATASET-01_image-name_0
│   ├── DATASET-01_image-name_1
│   ├── DATASET-01_image-name_2
│   └── DATASET-01_image-name_3
├── masks
│   ├── RoiMASK_image-name.csv
│   ├── PxMASK_image-name.npy
│   ├── DATASET-01_image-name_0
│   ├── DATASET-01_image-name_1
│   ├── DATASET-01_image-name_2
│   └── DATASET-01_image-name_3
├── source
│   ├── image-name.hdr
│   └── image-name.raw
├── metadata.json
└── splits.csv

This tool provides functionalities for processing the source images, generating region of interest (ROI) masks, pixel masks, labels, and cropping the images based on the generated masks.

CLI Usage

After installing the package, you can use the console command:

spectral-datamaker --help

You can also invoke the package module directly:

python -m spectral_datamaker --help

The CLI provides the following commands:

Create a complete dataset:

spectral-datamaker create <config.yaml> <output_directory>

Options:

  • --dry-run: Validate configuration without executing
  • --skip-validation: Skip final dataset validation
  • --no-interactive: Skip interactive mask adjustment (not yet implemented)

The pipeline executes these steps sequentially: structureroi-maskpixel-maskcropmetadatasplits. The splits step is skipped automatically unless dev_split is set in the segmentation configuration.

Validate an existing dataset:

spectral-datamaker validate <dataset_directory>

Options:

  • --config <file>: Validate against a specific configuration file

Inspect dataset metadata:

spectral-datamaker inspect <dataset_directory>

Options:

  • --format [json|yaml|table]: Output format (default: table)
  • --show-images: List all processed images

Execute individual pipeline steps:

spectral-datamaker step <step_name> <config.yaml> <dataset_directory>

Available steps: structure, roi-mask, pixel-mask, crop, metadata, splits

Options:

  • --force: Overwrite existing files (passed to the pipeline context for steps to use)

Compose a new dataset from existing ones:

spectral-datamaker compose <compose.yaml> <output_directory>

Options:

  • --dry-run: Validate configuration without copying files

Generate splits for an existing dataset:

spectral-datamaker splits <dataset_directory> --dev-split 0.2

Generates a splits.csv file with balanced dev/test assignments without requiring a config file. Useful for re-splitting already-built datasets.

Capture mask positions for reuse:

spectral-datamaker set-mask-position <config.yaml>

Opens Napari on the first source image so you can position masks. When you close the viewer, positions are saved as a masking-defaults section in the config. Options:

  • --output <path>: Write to a different file (don't overwrite the original)
  • --image <n>: Use source image at index n as reference (default: 0)
  • --force: Overwrite an existing masking-defaults section

splits.csv

splits.csv is the dataset-level file that records how the cropped images are divided between dev and test. It is created in the root of the dataset, next to metadata.json, and is produced by the splits postprocess step or by the standalone spectral-datamaker splits command.

How it is generated:

  • SpectralDatamaker reads the class assignments from the dataset metadata.
  • For each class, it shuffles the available cropped images.
  • It assigns the first int(len(class_images) * dev_split) images to dev and the rest to test.
  • The file is then written as a CSV with one row per cropped image.

Columns:

  • image: absolute path to the cropped image file.
  • class: class assigned to that cropped image.
  • split: either dev or test.

Example output:

image,class,split
/home/user/datasets/demo/images/DEMO_img_001_0.npy,type_A,dev
/home/user/datasets/demo/images/DEMO_img_001_1.npy,type_A,test
/home/user/datasets/demo/images/DEMO_img_002_0.npy,type_B,dev
/home/user/datasets/demo/images/DEMO_img_002_1.npy,type_B,test

Because the split is built per class and the images are shuffled before partitioning, repeated generations can produce different rows unless you add your own randomness control outside the current implementation.

Library usage (Python API)

Besides the CLI, SpectralDatamaker can be used as a Python library. The most useful exports are:

  • load_dataset_config(path): Load a dataset configuration YAML file. Returns DatasetConfig.
  • load_compose_config(path): Load a compose configuration YAML file. Returns ComposeConfig.
  • DatasetStructure: Infers canonical dataset locations (images/, masks/, source/, metadata.json) from a root directory.
  • Filenames: Derives expected filenames and absolute paths for masks, labels, cropped outputs, and metadata.
  • DatasetValidator: Validates an existing dataset either from a config file or from metadata.json.
  • DatasetManager: Provides methods for retrieving dataset information, listing processed images, and accessing metadata details.
  • ComposeConfig / SourceSelection: Dataclasses for compose configuration.
  • ComposeProcessor: Builds a composed dataset programmatically from a ComposeConfig.
  • SplitsStep: Pipeline step that generates balanced dev/test splits (splits.csv).
  • generate_balanced_splits / write_splits_csv: Low-level split utilities.
  • resolve_from_config_dir: Resolves relative paths against a config file directory.
from spectral_datamaker.config import load_dataset_config, DatasetStructure, Filenames
from spectral_datamaker.dataset import DatasetValidator, DatasetManager

# 1) Load configuration
config = load_dataset_config("/path/to/dataset.yaml")
print(config.name, config.segmentation_config.classes)

# 2) Infer dataset structure from root directory
structure = DatasetStructure("/path/to/dataset_root")
print(structure.images_dir)
print(structure.metadata_file)

# 3) Derive expected file paths and names
names = Filenames(structure)
print(names.get_roi_mask("image_1.hdr", abs=True))
print(names.get_px_mask("image_1.hdr", abs=True))
print(names.get_dataset_metadata(abs=True))

# 4) Validate dataset contents
validator = DatasetValidator(structure)
validator.validate_dataset_from_config("/path/to/dataset.yaml")
# Or, if metadata already exists:
# validator.validate_dataset_from_metadata()

# 5) Work with dataset metadata
manager = DatasetManager(structure)
assignments = manager.get_dataset_assignments(abs=True)
for class_name, image_paths in assignments.items():
    print(f"{class_name}: {len(image_paths)} images")

Dataset config file

The dataset configuration file (e.g., dataset.yaml) contains the necessary information for creating a dataset from ENVI images. See docs/config.md for a complete reference of every field, inheritance rules, and execution modes.

Path resolution rules:

  • Absolute paths are used as-is.
  • Relative paths in source-images[].path are resolved relative to the directory that contains the YAML file.
dataset:
  name: dataset-example
  description: An example dataset created with SpectralDatamaker.

  default-class: type_A         # optional — inherited by images without their own class

  masking-defaults:            # optional — inherited by images without their own masking
    shape: circle
    size: 35
    num: 6
    positions: "120,340;210,145;310,440;180,250"   # set by set-mask-position

  source-images:
    - path: ../images/source/image_1.hdr
      # no masking, no class → inherits all from defaults

    - path: ../images/source/image_2.hdr
      masking:                 # overrides specific masking fields
        size: 20
        num: 4
        positions: "500,200;600,300;700,400;800,500"

    - path: ../images/source/image_n.hdr
      class: type_B            # overrides default-class for this image
      masking:                 # no positions → user positions masks manually
        shape: triangle
        size: 50
        num: 2

  segmentation:
    enabled: true
    classes:
      - type_A
      - type_B
    dev_split: 0.2   # optional — ratio for dev/test splits (e.g. 0.2 = 20% dev, 80% test)

  classification:
    enabled: false

Segmentation mode

When segmentation mode is enabled, SpectralDatamaker will generate a dataset with segmentation masks for each source image. The pipeline runs through the following steps:

  1. structure: Creates the dataset directory layout (images/, masks/, source/).
  2. roi-mask: Creates ROI masks based on the specified shape, size, and number of regions. A napari viewer is launched to allow interactive adjustment. If positions are provided (via masking-defaults or per-image), masks appear pre-positioned.
  3. pixel-mask: Generates pixel masks from the ROI masks, labeling each region with the corresponding class. If the image has a class (or default-class is set), labeling is done automatically — no manual prompts. Combine with --no-interactive and positions for a fully automatic pipeline.
  4. crop: Crops the source images based on the generated masks and saves the cropped images and masks.
  5. metadata: Generates metadata.json with dataset information and processing details.
  6. splits (optional): If dev_split is set in the segmentation config, generates a splits.csv file with balanced dev/test assignments per class.

Classification mode

[!NOTE] The classification mode is currently in development is not yet available for use. The following description is based on the intended functionality.

When classification mode is enabled, SpectralDatamaker will generate a dataset with class labels for each source image. The steps are as follows:

  1. Creates ROI masks based on the specified shape, size, and number of regions in the configuration file. A napari viewer is launched to allow the user to adjust the generated masks if necessary. Masks are saved when the user closes the viewer.
  2. Asks the user to label each ROI with the corresponding class from the configuration file. Saves the class labels in a CSV file.
  3. Crops the source images based on the generated masks and saves the cropped images in the appropriate directories.

Compose mode

The compose command builds a new dataset by selecting ROI crops from one or more already-processed datasets, without re-annotating anything. It reads the metadata of the source datasets to locate the crops, copies them to the new dataset, remaps the class labels according to the new class list, and generates the metadata.json of the composed dataset.

Compose config file

Path resolution rules:

  • Absolute paths are used as-is.
  • Relative paths in sources[].dataset are resolved relative to the directory that contains the compose YAML file.
compose:
  name: composed-dataset
  description: Dataset composed from multiple source datasets.
  classes:
    - type_A
    - type_B
    - type_C

  sources:
    - dataset: ../datasets/source_dataset_1
      class: type_A

    - dataset: ../datasets/source_dataset_1
      class: type_B
      num: 4        # optional — limit to 4 crops; omit to use all available

    - dataset: ../datasets/source_dataset_2
      class: type_C
  • classes: defines the label mapping of the output dataset (classes[0] → label 1, classes[1] → label 2, etc.).
  • sources: each entry selects all ROI crops of a given variety from a source dataset. The source dataset must have been created with spectral-datamaker create and must contain metadata.json.

Composed dataset structure

The output directory follows the same structure as a regular dataset:

output_dir/
├── images/
│   ├── COMPOSED_imageA_type_A_0.npy
│   ├── COMPOSED_imageA_type_A_1.npy
│   └── COMPOSED_imageB_type_C_0.npy
├── masks/
│   ├── COMPOSED_imageA_type_A_0.npy
│   ├── COMPOSED_imageA_type_A_1.npy
│   └── COMPOSED_imageB_type_C_0.npy
├── source/
└── metadata.json

Crops from different source images and varieties are grouped into virtual source image keys of the form <source_image>_<variety>. Within each group, crops are indexed sequentially from 0.

Dataset metadata

SpectralDatamaker generates a metadata.json file containing information about the dataset, including the dataset name, description, source images, and the processing steps applied to each image. This metadata file is recognized by the SpectralDatamaker and can be used to validate the dataset structure and contents. An example of the metadata.json structure is as follows:

{
    "name": "dataset-03",
    "description": "Dataset created with one hyperespectral image.",
    "last_update": "2026-04-08 13:52:01",
    "source_images": ["/path/to/image_1.hdr"],
    "types": ["segmentation"],
    "segmentation_masking": {
        "image_1": {
            "label_map": {"0": "background", "1": "type_A", "2": "type_B"},
            "num_classes": 3,
            "classes": ["type_A", "type_B"],
            "assignments": {
                "type_A": [0,2,3],
                "type_B": [1,5,4]
            },
            "source_image": "image_1.hdr",
            "source_dataset": "",
            "rois_file": "RoiMASK_image_1.csv",
            "mask_file": "PxMASK_image_1.npy",
            "created": "2026-04-08T13:51:33.931524",
            "format": "npy"
        }
    }
}

For composed datasets, source_images contains virtual group keys (one per source image × variety combination) and each segmentation_masking entry includes a source_dataset field pointing to the origin dataset:

{
    "name": "composed-dataset",
    "description": "Dataset composed from multiple source datasets.",
    "last_update": "2026-05-18 10:00:00",
    "source_images": ["image_1_type_A", "image_2_type_C"],
    "types": ["segmentation"],
    "segmentation_masking": {
        "image_1_type_A": {
            "label_map": {"0": "background", "1": "type_A", "2": "type_B", "3": "type_C"},
            "num_classes": 4,
            "classes": ["type_A", "type_B", "type_C"],
            "assignments": {
                "type_A": [0, 1, 2],
                "type_B": [],
                "type_C": []
            },
            "source_image": "image_1_type_A",
            "source_dataset": "/path/to/source_dataset_1",
            "rois_file": "",
            "mask_file": "",
            "created": "2026-05-18T10:00:00.000000",
            "format": "npy"
        }
    }
}

Validations

SpectralDatamaker includes validation checks allowing users to verify the generated dataset structure and contents, as well as validate existing datasets. The validation includes checks for the presence of required directories and expected files.

Extending

See EXTENDING.md for a guide on adding new pipeline steps, CLI commands, and configuration options.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spectral_datamaker-0.6.2.tar.gz (43.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spectral_datamaker-0.6.2-py3-none-any.whl (40.5 kB view details)

Uploaded Python 3

File details

Details for the file spectral_datamaker-0.6.2.tar.gz.

File metadata

  • Download URL: spectral_datamaker-0.6.2.tar.gz
  • Upload date:
  • Size: 43.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for spectral_datamaker-0.6.2.tar.gz
Algorithm Hash digest
SHA256 899d6e0e8d3befdf797d13c79985ad6b4dff4ab6d341d8450a7c9135d6fdff66
MD5 83baf546fabbd685609bc39d3c7d5817
BLAKE2b-256 80496ec0e0d6c7e770001d4c760e239d3483cd431555f0298733e12b55bf9c79

See more details on using hashes here.

File details

Details for the file spectral_datamaker-0.6.2-py3-none-any.whl.

File metadata

File hashes

Hashes for spectral_datamaker-0.6.2-py3-none-any.whl
Algorithm Hash digest
SHA256 17d94e7d9cc9c531366d61963b5cc35c7c47cb886917db329a1c417b9f79d0b2
MD5 29f324933d0be550f2a9285287acfe68
BLAKE2b-256 72d875d3fa0f3f6746406d3009ce4d3cd7c4c74ff8375e354a5c9116459f9b1a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page