Skip to main content

Split folders with files (e.g. images) into training, validation and test (dataset) folders.

Project description

split-folders Build Status PyPI PyPI - Python Version PyPI - Downloads

Split folders with files (e.g. images) into train, validation and test (dataset) folders.

The input folder should have the following format:

input/
    class1/
        img1.jpg
        img2.jpg
        ...
    class2/
        imgWhatever.jpg
        ...
    ...

In order to give you this:

output/
    train/
        class1/
            img1.jpg
            ...
        class2/
            imga.jpg
            ...
    val/
        class1/
            img2.jpg
            ...
        class2/
            imgb.jpg
            ...
    test/
        class1/
            img3.jpg
            ...
        class2/
            imgc.jpg
            ...

This should get you started to do some serious deep learning on your data. Read here why it's a good idea to split your data intro three different sets.

  • Split files into a training set and a validation set (and optionally a test set).
  • Works on any file types.
  • The files get shuffled.
  • A seed makes splits reproducible.
  • Allows randomized oversampling for imbalanced datasets.
  • Optionally group files by prefix.
  • (Should) work on all operating systems.

Install

pip install split-folders

If you are working with a large amount of files, you may want to get a progress bar. Install tqdm in order to get visual updates for copying files.

pip install split-folders tqdm

Usage

You can use split-folders as Python module or as a Command Line Interface (CLI).

If your datasets is balanced (each class has the same number of samples), choose ratio otherwise fixed. NB: oversampling is turned off by default.

Module

import splitfolders  # or import split_folders

# Split with a ratio.
# To only split into training and validation set, set a tuple to `ratio`, i.e, `(.8, .2)`.
splitfolders.ratio("input_folder", output="output", seed=1337, ratio=(.8, .1, .1), group_prefix=None) # default values

# Split val/test with a fixed number of items e.g. 100 for each set.
# To only split into training and validation set, use a single number to `fixed`, i.e., `10`.
splitfolders.fixed("input_folder", output="output", seed=1337, fixed=(100, 100), oversample=False, group_prefix=None) # default values

Occasionally you may have things that comprise more than a single file (e.g. picture (.png) + annotation (.txt)). splitfolders lets you split files into equally-sized groups based on their prefix. Set group_prefix to the length of the group (e.g. 2). But now all files should be part of groups.

CLI

Usage:
    splitfolders [--output] [--ratio] [--fixed] [--seed] [--oversample] [--group_prefix] folder_with_images
Options:
    --output        path to the output folder. defaults to `output`. Get created if non-existent.
    --ratio         the ratio to split. e.g. for train/val/test `.8 .1 .1` or for train/val `.8 .2`.
    --fixed         set the absolute number of items per validation/test set. The remaining items constitute
                    the training set. e.g. for train/val/test `100 100` or for train/val `100`.
    --seed          set seed value for shuffling the items. defaults to 1337.
    --oversample    enable oversampling of imbalanced datasets, works only with --fixed.
    --group_prefix  split files into equally-sized groups based on their prefix
Example:
    splitfolders --ratio .8 .1 .1 folder_with_images

Instead of the command splitfolders you can also use split_folders or split-folders.

Development

Install and use poetry.

Contributing

If you have a question, found a bug or want to propose a new feature, have a look at the issues page.

Pull requests are especially welcomed when they fix bugs or improve the code quality.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

split_folders-0.4.1.tar.gz (6.7 kB view details)

Uploaded Source

Built Distribution

split_folders-0.4.1-py3-none-any.whl (7.3 kB view details)

Uploaded Python 3

File details

Details for the file split_folders-0.4.1.tar.gz.

File metadata

  • Download URL: split_folders-0.4.1.tar.gz
  • Upload date:
  • Size: 6.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.9 CPython/3.8.4 Darwin/19.6.0

File hashes

Hashes for split_folders-0.4.1.tar.gz
Algorithm Hash digest
SHA256 3c078902834aad1dd0b8e9e6b1c524dd77766ae74f0c8cbbdf2d4ca556dc234e
MD5 2564c386eadacae845a42c8e96b4b05b
BLAKE2b-256 8217f6b10299b0bdab9aaf64f6aac0addc7e4aa5f4cbe13db4aa9d9059cb17ef

See more details on using hashes here.

File details

Details for the file split_folders-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: split_folders-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 7.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.9 CPython/3.8.4 Darwin/19.6.0

File hashes

Hashes for split_folders-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 16f76155982fa64268c3c0529e858440551e97344a0e803744951f9ba987dc41
MD5 b9a1df6a253fc0912451e0543d156454
BLAKE2b-256 b29a794e8e820fff21119aab513e3d27dae0b1528aec17b5bac9687dec0dfd4d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page