Skip to main content

🗂 Split folders with files (e.g. images) into training, validation and test (dataset) folders.

Project description

Split Folders Build Status PyPI PyPI - Python Version

Split folders with files (e.g. images) into train, validation and test (dataset) folders.

The input folder shoud have the following format:

input/
    class1/
        img1.jpg
        img2.jpg
        ...
    class2/
        imgWhatever.jpg
        ...
    ...

In order to give you this:

output/
    train/
        class1/
            img1.jpg
            ...
        class2/
            imga.jpg
            ...
    val/
        class1/
            img2.jpg
            ...
        class2/
            imgb.jpg
            ...
    test/
        class1/
            img3.jpg
            ...
        class2/
            imgc.jpg
            ...

This should get you started to do some serious deep learning on your data. Read here why it's a good idea to split your data intro three different sets.

  • You may only split into a training and validation set.
  • The data gets split before it gets shuffled.
  • A seed lets you reproduce the splits.
  • Works on any file types.
  • Allows randomized oversampling for imbalanced datasets.
  • (Should) work on all operating systems.

Install

pip install split-folders

Usage

You you can use split_folders as Python module or as a Command Line Interface (CLI).

If your datasets is balanced (each class has the same number of samples), choose ratio otherwise fixed. NB: oversampling is turned off by default.

Module

import split_folders

# Split with a ratio.
# To only split into training and validation set, set a tuple to `ratio`, i.e, `(.8, .2)`.
split_folders.ratio('input_folder', output="output", seed=1337, ratio=(.8, .1, .1)) # default values

# Split val/test with a fixed number of items e.g. 100 for each set.
# To only split into training and validation set, use a single number to `fixed`, i.e., `10`.
split_folders.fixed('input_folder', output="output", seed=1337, fixed=(100, 100), oversample=False) # default values

CLI

Usage:
    split_folders folder_with_images [--output] [--ratio] [--fixed] [--seed] [--oversample]
Options:
    --output     path to the output folder. defaults to `output`. Get created if non-existent.
    --ratio      the ratio to split. e.g. for train/val/test `.8 .1 .1` or for train/val `.8 .2`.
    --fixed      set the absolute number of items per validation/test set. The remaining items constitute
                 the training set. e.g. for train/val/test `100 100` or for train/val `100`.
    --seed       set seed value for shuffling the items. defaults to 1337.
    --oversample enable oversampling of imbalanced datasets, works only with --fixed.
Example:
    split_folders imgs --ratio .8 .1 .1

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

split_folders-0.2.0.tar.gz (4.2 kB view details)

Uploaded Source

Built Distributions

split_folders-0.2.0-py3.7.egg (4.6 kB view details)

Uploaded Source

split_folders-0.2.0-py3-none-any.whl (5.8 kB view details)

Uploaded Python 3

File details

Details for the file split_folders-0.2.0.tar.gz.

File metadata

  • Download URL: split_folders-0.2.0.tar.gz
  • Upload date:
  • Size: 4.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/39.1.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/3.6.5

File hashes

Hashes for split_folders-0.2.0.tar.gz
Algorithm Hash digest
SHA256 586ffdb2cb830d379041b0f7ae64f2432df52cf1249e62254bb1f54a68c6275b
MD5 95d79a141b006e59a5ea2c94d5f9b6ee
BLAKE2b-256 17db0e64dec5d6c94b12d1d18ffd6cebf6159d7b2bb11e6b24dddddc0f600985

See more details on using hashes here.

File details

Details for the file split_folders-0.2.0-py3.7.egg.

File metadata

  • Download URL: split_folders-0.2.0-py3.7.egg
  • Upload date:
  • Size: 4.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.7.0

File hashes

Hashes for split_folders-0.2.0-py3.7.egg
Algorithm Hash digest
SHA256 a74317d5402854519bcdc0497d15886bf4fc86ffb5b661d522e00e6cf9a7c01a
MD5 9ec48a8aa4ef7ffd2aad6aeea3bcd700
BLAKE2b-256 e07e5b2caeb3660dd23500bb58207d7bb096bb10a7f885515769a127bf497a93

See more details on using hashes here.

File details

Details for the file split_folders-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: split_folders-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 5.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/39.1.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/3.6.5

File hashes

Hashes for split_folders-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 980f55301d444e55d00233c619ed9a5a903e6910fb6224a7b60022c2bfce6d0f
MD5 ad1aa51ff4428df0b545367598561410
BLAKE2b-256 b2a6dd76ca87cb23f84c998ee8ba2d56790c519c92058e651a7cac95d7f12a1b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page