Skip to main content

Easily convert datasets between different formats for object detection

Project description

logo

Detection datasets

Python Code style: black


Easily load and transform datasets for object detection.



Documentation: https://blinjrm.github.io/detection-datasets/

Source Code: https://github.com/blinjrm/detection-datasets

Datasets on Hugging Face Hub: https://huggingface.co/detection-datasets



detection_datasets aims to make it easier to work with detection datasets. The main features are:

  • Read the dataset :
    • From disk if it has already been downloaded.
    • Directly from the Hugging Face Hub if it already exist.
  • Transform the dataset:
    • Select a subset of data.
    • Remap categories.
    • Create new train-val-test splits.
  • Visualize the annotations.
  • Write the dataset:
    • To disk, selecting the target detection format: COCO, YOLO and more to come.
    • To the Hugging Face Hub for easy reuse in a different environment and share with the community.

Requirements

Python 3.8+

detection_datasets is upon the great work of:

Installation

$ pip install detection_datasets

Examples

from detection_datasets import DetectionDataset

1. Read

From local files:

config = {
    'dataset_format': 'coco',                   # the format of the dataset on disk
    'path': 'path/do/data/on/disk',             # where the dataset is located
    'splits': {                                 # how to read the files
        'train': ('train.json', 'train'),
        'test': ('test.json', 'test'),
    },
}

dd = DetectionDataset()
dd.from_disk(**config)

# note that you can use method cascading as well:
# dd = DetectionDataset().from_disk(**config)

From the Hugging Face Hub:

dd = DetectionDataset().from_hub(name='fashionpedia')

Currently supported format for reading datasets are:

  • COCO
  • more to come

The list of datasets available from the Hub is given by:

DetectionDataset().available_in_hub()       # Search in the "detection-datasets" repository on the Hub.
DetectionDataset().available_in_hub(repo_name=MY_REPO_OR_ORGANISATION)

2. Transform

Here we select a subset of 10.000 images and create new train-val-test splits, overwritting the splits from the original dataset:

dd = DetectionDataset()\
    .from_hub(name='fashionpedia')\
    .select(n_images=10000)\
    .split(splits=[0.8, 0.1, 0.1])

3. Visualize

The DetectionDataset objects contains several properties to analyze your data:

dd.data                     # This is equivlent to calling `dd.get_data('image')`,
                            # and returns a DataFrame with 1 row per image

dd.get_data('bbox')         # Returns a DataFrame with 1 row per annotation

dd.n_images                 # Number of images

dd.n_bbox                   # Number of annotations

dd.splits                   # List of split names

dd.split_proportions        # DataFrame with the % of iamges in each split

dd.categories               # DataFrame with the categories and thei ids

dd.category_names           # List of categories

dd.n_categories             # Number of categories

You can also visualize a image with its annotations in a notebook:

dd.show()                   # Shows a random image from the dataset
dd.show(image_id=42)        # Shows the select image based on image_id
image with annotations

4. Write

Once the dataset is ready, you can write it to the local filesystem in a given format:

dd.to_disk(
    dataset_format='yolo',
    name='MY_DATASET_NAME',
    path='DIRECTORY_TO_WRITE_TO',
)

Currently supported format for writing datasets are:

  • YOLO
  • MMDET
  • more to come

The dataset can also be easily uploaded to the Hugging Face Hub, for reuse later on or in a different environment:

dd.to_hub(dataset_name='MY_DATASET_NAME', repo_name='MY_REPO_OR_ORGANISATION')

The dataset viewer on the Hub will work out of the box, and we encourage you to update the README in your new repo to make it easier for the comminuty to use the dataset.

hub viewer

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

detection_datasets-0.2.2.tar.gz (14.6 kB view hashes)

Uploaded Source

Built Distribution

detection_datasets-0.2.2-py3-none-any.whl (16.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page