Skip to main content

Lightweight utilities for organizing image datasets: split, merge, label-based sorting, and directory structure creation

Project description

folderops

A lightweight Python package for organizing image datasets in machine learning workflows.

It focuses on the most common and repetitive tasks: splitting datasets, merging folders, structuring directories, and organizing data from labels. Everything is designed for direct use inside notebooks and research pipelines with minimal friction.


Why folderops

If you’ve worked with vision datasets, you’ve probably rewritten the same scripts over and over:

  • splitting train/val/test
  • merging datasets from different sources
  • reorganizing files from CSV labels
  • creating directory structures manually

This package removes that overhead and gives you reliable, reusable utilities.


Features

  • Split datasets into train / validation / test sets
  • Merge files from nested directories into a single folder
  • Organize images into class folders using CSV labels
  • Create directory structures from lists or nested dictionaries
  • Supports common image formats:
    .jpg, .jpeg, .png, .bmp, .gif, .tif, .tiff, .webp
  • Works consistently in terminal, VS Code, and Jupyter notebooks

Installation

pip install folderops

For development:

pip install -e .

Quick Start

from folderops import split_dataset, merge_folders, organize_by_labels, create_structure

split_dataset(
    source="images",
    output="dataset",
    train_ratio=0.7,
    val_ratio=0.15,
    test_ratio=0.15,
    seed=42,
)

merge_folders(
    source="dataset/images",
    output="merged_images",
)

organize_by_labels(
    image_dir="images",
    label_file="labels.csv",
    output="organized_dataset",
)

structure = {
    "dataset": {
        "train": {},
        "val": {},
        "test": {}
    }
}

create_structure(structure)

API Reference

split_dataset

Split a dataset organized by class folders into train, validation, and test sets.

Expected input structure

source/
    class1/
        img1.jpg
        img2.jpg
    class2/
        img3.jpg

Output structure

output/
    train/
        class1/
        class2/
    val/
        class1/
        class2/
    test/
        class1/
        class2/

Usage

split_dataset(
    source="images",
    output="dataset",
    train_ratio=0.7,
    val_ratio=0.15,
    test_ratio=0.15,
    seed=42,
    mode="copy",
    extensions=(".jpg", ".png"),
)

Key behavior

  • Splits per class, not globally
  • Shuffles files before splitting
  • Supports deterministic splits via seed
  • Supports both copy and move
  • Validates that ratios sum to 1.0
  • Displays progress cleanly in both terminal and notebooks

merge_folders

Merge all files from a directory (including subfolders) into a single folder.

Example

source/
    cats/
        a.jpg
    dogs/
        a.jpg
        b.jpg

Result

merged/
    a.jpg
    a_1.jpg
    b.jpg

Usage

merge_folders(
    source="source",
    output="merged",
    mode="copy",
)

Key behavior

  • Recursively scans all subfolders
  • Prevents overwriting using automatic renaming
  • Preserves all files
  • Supports extension filtering

organize_by_labels

Organize images into class folders using a CSV file.

CSV format

path,class
img1.jpg,cats
img2.jpg,dogs

Usage

organize_by_labels(
    image_dir="images",
    label_file="labels.csv",
    output="organized",
    mode="copy",
)

Result

organized/
    cats/
        img1.jpg
    dogs/
        img2.jpg

Key behavior

  • Validates every file exists before transfer
  • Raises clear errors for missing or invalid entries
  • Supports custom delimiters
  • Optional strict extension filtering

create_structure

Create directory structures from a list or nested dictionary.

List-based usage

paths = ["train/cats", "train/dogs", "val/cats"]
create_structure(paths, root="dataset")

Dictionary-based usage

structure = {
    "dataset": {
        "train": {},
        "val": {},
        "test": {}
    }
}

create_structure(structure)

Result

dataset/
    train/
    val/
    test/

Key behavior

  • Accepts both flat lists and nested dictionaries
  • Automatically creates missing directories
  • Safe to run multiple times

Project Structure

folderops/
├── folderops/
│   ├── __init__.py
│   ├── merger.py
│   ├── organizer.py
│   ├── splitter.py
│   ├── structure.py
│   └── utils.py
├── LICENSE
├── pyproject.toml
└── README.md

Design Principles

  • Minimal dependencies
  • Explicit, readable APIs
  • Safe file operations
  • Notebook-friendly behavior
  • Reproducible dataset handling

Build and Publish

python -m build
twine upload dist/*

Requirements

  • Python 3.8+
  • tqdm

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

folderops-1.0.0.tar.gz (10.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

folderops-1.0.0-py3-none-any.whl (10.6 kB view details)

Uploaded Python 3

File details

Details for the file folderops-1.0.0.tar.gz.

File metadata

  • Download URL: folderops-1.0.0.tar.gz
  • Upload date:
  • Size: 10.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.10

File hashes

Hashes for folderops-1.0.0.tar.gz
Algorithm Hash digest
SHA256 ed225be9c023d5f4e005ff6c5a37b1f9329db5b0397fdb1b33c6f5466a6ab7d8
MD5 8b043d3dddd9a78504daf42ba43237ea
BLAKE2b-256 f8097bead5440fa06a4ad6c959818d831f95ed04e6d36b9646690a97515e4162

See more details on using hashes here.

File details

Details for the file folderops-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: folderops-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 10.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.10

File hashes

Hashes for folderops-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ae2ba16b21b7289711041ac031015808279e4fffb67be611c2aba729839b305f
MD5 b269e787f2c390b37637916e616cd3b4
BLAKE2b-256 2f8237ad25b8665246bf21946b81cd0e639c8acf1afe54136a1a2d19cb9d88b5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page