🗂 Split folders with files (e.g. images) into training, validation and test (dataset) folders.
Project description
Split Folders
Split folders with files (e.g. images) into train, validation and test (dataset) folders.
The input folder shoud have the following format:
input/
class1/
img1.jpg
img2.jpg
...
class2/
imgWhatever.jpg
...
...
In order to give you this:
output/
train/
class1/
img1.jpg
...
class2/
imga.jpg
...
val/
class1/
img2.jpg
...
class2/
imgb.jpg
...
test/
class1/
img3.jpg
...
class2/
imgc.jpg
...
This should get you started to do some serious deep learning on your data. Read here why it's a good idea to split your data intro three different sets.
- You may only split into a training and validation set.
- The data gets split before it gets shuffled.
- A seed lets you reproduce the splits.
- Works on any file types.
- Allows randomized oversampling for imbalanced datasets.
- (Should) work on all operating systems.
Install
pip install split-folders
Usage
You you can use split_folders
as Python module or as a Command Line Interface (CLI).
If your datasets is balanced (each class has the same number of samples), choose ratio
otherwise fixed
. NB: oversampling is turned off by default.
Module
import split_folders
# Split with a ratio.
# To only split into training and validation set, set a tuple to `ratio`, i.e, `(.8, .2)`.
split_folders.ratio('input_folder', output="output", seed=1337, ratio=(.8, .1, .1)) # default values
# Split val/test with a fixed number of items e.g. 100 for each set.
# To only split into training and validation set, use a single number to `fixed`, i.e., `10`.
split_folders.fixed('input_folder', output="output", seed=1337, fixed=(100, 100), oversample=False) # default values
CLI
Usage:
split_folders folder_with_images [--output] [--ratio] [--fixed] [--seed] [--oversample]
Options:
--output path to the output folder. defaults to `output`. Get created if non-existent.
--ratio the ratio to split. e.g. for train/val/test `.8 .1 .1` or for train/val `.8 .2`.
--fixed set the absolute number of items per validation/test set. The remaining items constitute
the training set. e.g. for train/val/test `100 100` or for train/val `100`.
--seed set seed value for shuffling the items. defaults to 1337.
--oversample enable oversampling of imbalanced datasets, works only with --fixed.
Example:
split_folders imgs --ratio .8 .1 .1
License
MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file split_folders-0.2.3.tar.gz
.
File metadata
- Download URL: split_folders-0.2.3.tar.gz
- Upload date:
- Size: 4.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ccf9e7409e6ff332feb870fcf65ca23f64e1472462fec949498de4a81a7c86f7 |
|
MD5 | db0c6f681d3a25ec0fe8365583650ce1 |
|
BLAKE2b-256 | 3109e0a2b08f00039ecac5701f7ca9e4cdd4c40c2d5f2382deb16605c8d11a52 |
File details
Details for the file split_folders-0.2.3-py3-none-any.whl
.
File metadata
- Download URL: split_folders-0.2.3-py3-none-any.whl
- Upload date:
- Size: 5.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | da182d02210bfa0b7228ca674126093ecc39d449842d16a3ddc8efa8537a0f9f |
|
MD5 | 2e0186c532a1b5624bbb23e5fc4a7782 |
|
BLAKE2b-256 | 32d33714dfcf4145d5afe49101a9ed36659c3832c1e9b4d09d45e5cbb736ca3f |