Skip to main content

Utility tools for SIMPLE vision dataset format.

Project description

simpledataset

Utility tools for simple vision image dataset format.

Features

  • See the summary of a dataset
  • Convert from/to various dataset formats.
  • Web UI to look into a dataset
  • CUI tools to split and concat datasets.
  • CUI tools to modify labels.

Install

pip install simpledataset

Usage

# Show summary
dataset_summary <input_dataset>

# For Classification dataset, extract only the images that have the specified labels.
# For Detection dataset, extract only the boxes that have the specified labels.
dataset_filter <input_dataset> <output_dataset> [--include_class <class_id> [<class_id> ...]] [--exclude_class <class_id> [<class_id> ...]]

# Update class labels
dataset_map <input_dataset> <output_dataset> --map <src_class_id> <dst_class_id> [--map <src_class_id> <dst_class_id> [--map...]]

dataset_split # NYI

# Concatenate multiple datasets into one dataset.
dataset_concat <input_txt_filepath> <input_txt_filepath2> [<input_txt_filepath>, ...] <output_txt_filepath>

dataset_shuffle # NYI

dataset_sample # NYI

# Re-package images and labels into new zip files.
dataset_pack <input_txt_filepath> <output_filepath> [--images_directory=<images_directory>] [--keep_empty_images]

# Remove labels with no actual data.
dataset_defrag <input_txt_filepath> <output_txt_filepath>

# Draw bounding boxes into images.
dataset_draw <input_txt_filepath> <output_dir>

# Convert from/to other dataset types. 
dataset_convert_from {coco|openimages_od|openimages_vr} ... <output_filepath>
dataset_convert_to <input_dataset> {coco|image_classification|object_detection} <output_filepath>

Examples

Please see CONVERT.md for the dataset conversion examples.

Change class ids

For example, if you would like to change MNIST to odd or even classification dataset, you can use dataset_map command. In this example, we use class_id=0 for even numbers, and class_id=1 for odd numbers.

dataset_map mnist.txt new_dataset.txt --map 2 0 --map 3 1 --map 4 0 --map 5 1 --map 6 0 --map 7 1 --map 8 0 --map 9 1

Concatenate two datasets

For example, if you had 2 datasets (mnist_subset and mnist_subset2) and wanted to combine them, you can use dataset_concat command.

dataset_concat mnist_subset/images.txt mnist_subset2/images.txt new_combined.txt

# new_combined.txt has 20 classes at this point. Let's merge them into 10 classes.
dataset_map new_combined.txt new_mapped_10.txt --map 10 0 --map 11 1 --map 12 2 --map 13 3 --map 14 4 --map 15 5 --map 16 6 --map 17 7 --map 18 8 --map 19 9

SIMPLE Dataset format

Currently there are 2 dataset formats, Image Classification and Object Detection. Both datasets have a single txt file, image files and an optional list of label names (labels.txt). In addition to that, Object Detection datasets has label files that contains bbox info.

Image Classification

The main txt format is:

<file> ::= <txt_line> ('\n' <txt_line>)*
<txt_line> ::= <image_filepath> ' ' <labels>
<image_filepath> ::= <filepath> | <zip_filepath> '@' <entry_name>
<labels> ::= <class_id> (',' <class_id>)*

Here is an example txt file.

train_images.zip@0.jpg 0
train_images2.zip@1.jpg 1
image.png 0,1
image2.bmp 0,1,2,3

Object Detection

The main txt format is:

<file> ::= <txt_line> ('\n' <txt_line>)*
<txt_line> ::= <image_filepath> ' ' <label_filepath>
<image_filepath> ::= <filepath> | <zip_filepath> '@' <entry_name>
<label_filepath> ::= <filepath> | <zip_filepath> '@' <entry_name>

The format of a label file is:

<file> ::= <label_line> ('\n' <label_line>)*
<label_line> ::= <class_id> ' ' <bbox_x_min> ' ' <bbox_y_min> ' ' <bbox_x_max> ' ' <bbox_y_max>
<class_id> ::= <int>
<bbox_x_min> ::= <int>      ; 0 <= <bbox_x_min> < <bbox_x_max> <= <image_width>
<bbox_y_min> ::= <int>      ; 0 <= <bbox_y_min> < <bbox_y_max> <= <image_height>
<bbox_x_max> ::= <int>
<bbox_y_max> ::= <int>

Visual Relationship

The main txt format is same with Object Detection.

The format of a label file is:

<file> ::= <label_line> ('\n' <label_line>)*
<label_line> ::= <subject_id> ' ' <subject_bbox_x_min> ' ' <subject_bbox_y_min> ' ' <subject_bbox_x_max> ' ' <subject_bbox_y_max> ' ' <object_id> ' ' <object_bbox_x_min> ' ' <object_bbox_y_min> ' ' <object_bbox_x_max> ' ' <object_bbox_y_max> ' ' <predicate_id>
<subject_id> ::= <int>
<object_id> ::= <int>
<predicate_id> ::= <int> 

Usage for remote datasets

NYI. This tool allows you to use datasets on Azure Blob Storage. You can update a dataset on the storage efficiently.

# To download a dataset from Azure Blob Storage.
dataset_download <url_with_container_sas> <output_dir>

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simpledataset-0.2.2.tar.gz (17.6 kB view details)

Uploaded Source

Built Distribution

simpledataset-0.2.2-py3-none-any.whl (26.1 kB view details)

Uploaded Python 3

File details

Details for the file simpledataset-0.2.2.tar.gz.

File metadata

  • Download URL: simpledataset-0.2.2.tar.gz
  • Upload date:
  • Size: 17.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.23.0 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.8.5

File hashes

Hashes for simpledataset-0.2.2.tar.gz
Algorithm Hash digest
SHA256 a227f038f3e3b95bd31a795e46ee98ef80046213ccb9d8580508093885abd194
MD5 79ccf5c68d1994375029dc351a4ad4c0
BLAKE2b-256 97761678c9ba0846fbeeebaf3d7bc13b0c8f4a5ccd6c13ea3969fafdd0568931

See more details on using hashes here.

File details

Details for the file simpledataset-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: simpledataset-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 26.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.23.0 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.8.5

File hashes

Hashes for simpledataset-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 bc3a537b0534a52f67c1042b7e9030abdbd51df143ff48f1942868a4287234cc
MD5 cc179b2f78f195eb5be909c4e6490d6e
BLAKE2b-256 89ddc16960a5525cf1e6276dfb25f30081a80d6d48b9ef62f893663430ddbaa5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page