Split folders with files (e.g. images) into training, validation and test (dataset) folders.

# split-folders

Split folders with files (e.g. images) into train, validation and test (dataset) folders.

The input folder should have the following format:

input/
class1/
img1.jpg
img2.jpg
...
class2/
imgWhatever.jpg
...
...


In order to give you this:

output/
train/
class1/
img1.jpg
...
class2/
imga.jpg
...
val/
class1/
img2.jpg
...
class2/
imgb.jpg
...
test/
class1/
img3.jpg
...
class2/
imgc.jpg
...


This should get you started to do some serious deep learning on your data. Read here why it's a good idea to split your data intro three different sets.

• Split files into a training set and a validation set (and optionally a test set).
• Works on any file types.
• The files get shuffled.
• A seed makes splits reproducible.
• Allows randomized oversampling for imbalanced datasets.
• Optionally group files by prefix.
• (Should) work on all operating systems.

## Install

This package is Python only and there are no external dependencies.

pip install split-folders


Optionally, you may install tqdm to get get a progress bar when moving files.

pip install split-folders[full]


## Usage

You can use split-folders as Python module or as a Command Line Interface (CLI).

If your datasets is balanced (each class has the same number of samples), choose ratio otherwise fixed. NB: oversampling is turned off by default. Oversampling is only applied to the train folder since having duplicates in val or test would be considered cheating.

### Module

import splitfolders

# Split with a ratio.
# To only split into training and validation set, set a tuple to ratio, i.e, (.8, .2).
splitfolders.ratio("input_folder", output="output",
seed=1337, ratio=(.8, .1, .1), group_prefix=None, move=False) # default values

# Split val/test with a fixed number of items, e.g. (100, 100), for each set.
# To only split into training and validation set, use a single number to fixed, i.e., 10.
# Set 3 values, e.g. (300, 100, 100), to limit the number of training values.
splitfolders.fixed("input_folder", output="output",
seed=1337, fixed=(100, 100), oversample=False, group_prefix=None, move=False) # default values


Occasionally, you may have things that comprise more than a single file (e.g. picture (.png) + annotation (.txt)). splitfolders lets you split files into equally-sized groups based on their prefix. Set group_prefix to the length of the group (e.g. 2). But now all files should be part of groups.

Set move=True if you want to move the files instead of copying.

### CLI

Usage:
splitfolders [--output] [--ratio] [--fixed] [--seed] [--oversample] [--group_prefix] [--move] folder_with_images
Options:
--output        path to the output folder. defaults to output. Get created if non-existent.
--ratio         the ratio to split. e.g. for train/val/test .8 .1 .1 -- or for train/val .8 .2 --.
--fixed         set the absolute number of items per validation/test set. The remaining items constitute
the training set. e.g. for train/val/test 100 100 or for train/val 100.
Set 3 values, e.g. 300 100 100, to limit the number of training values.
--seed          set seed value for shuffling the items. defaults to 1337.
--oversample    enable oversampling of imbalanced datasets, works only with --fixed.
--group_prefix  split files into equally-sized groups based on their prefix
--move          move the files instead of copying
Example:
splitfolders --ratio .8 .1 .1 -- folder_with_images


Because of some Python quirks you have to prepend  -- afer using --ratio.

Instead of the command splitfolders you can also use split_folders or split-folders.

## Development

Install and use poetry.

## Contributing

If you have a question, found a bug or want to propose a new feature, have a look at the issues page.

Pull requests are especially welcomed when they fix bugs or improve the code quality.

MIT

## Project details

Uploaded source
Uploaded py3