Utilities for building and working with computer vision datasets
Project description
xt-cvdata
Description
This repo contains utilities for building and working with computer vision datasets, developed by Xtract AI.
So far, APIs for the following open-source datasets are included:
- COCO 2017 (detection and segmentation):
xt_cvdata.apis.COCO
- Open Images V5 (detection and segmentation):
xt_cvdata.apis.OpenImages
- Visual Object Tagging Tool (VoTT) CSV output (detection):
xt_cvdata.apis.VoTTCSV
More to come.
Installation
From PyPI:
pip install xt-cvdata
From source:
git clone https://github.com/XtractTech/xt-cvdata.git
pip install ./xt-cvdata
Usage
See specific help on a dataset class using help
. E.g., help(xt_cvdata.apis.COCO)
.
Building a dataset
from xt_cvdata.apis import COCO, OpenImages
# Build an object populated with the COCO image list, categories, and annotations
coco = COCO('/nasty/data/common/COCO_2017')
print(coco)
print(coco.class_distribution)
# Same for Open Images
oi = OpenImages('/nasty/data/common/open_images_v5')
print(oi)
print(coco.class_distribution)
# Get just the person classes
coco.subset(['person'])
oi.subset(['Person']).rename({'Person': 'person'})
# Merge and build
merged = coco.merge(oi)
merged.build('./data/new_dataset_dir')
This package follows pytorch chaining rules, meaning that methods operating on an object modify it in-place, but also return the modified object. The exception is the merge()
method which does not modify in-place and returns a new merged object. Hence, the above operations can also be completed using:
from xt_cvdata.apis import COCO, OpenImages
merged = (
COCO('/nasty/data/common/COCO_2017')
.subset(['person'])
.merge(
OpenImages('/nasty/data/common/COCO_2017')
.subset(['Person'])
.rename({'Person': 'person'})
)
)
merged.build('./data/new_dataset_dir')
In practice, somewhere between the two approaches will probably be most readable.
The current set of dataset operations are:
analyze
: recalculate dataset statistics (e.g., class distributions, train/val split)verify_schema
: check if class attributes follow required schemasubset
: remove all but a subset of classes from the datasetrename
: rename/combine dataset classessample
: sample a specified number of images from the train and validation setssplit
: define the proportion of data in the validation setmerge
: merge two datasets together, returning merged datasetbuild
: create the currently defined dataset using either symlinks or by copying images
Implementing a new dataset type
New dataset types should inherit from the base xt_cvdata.Builder
class. See the Builder
, COCO
and OpenImages
classes as a guide. Specifically, the class initializer should define info
, licenses
, categories
, annotations
, and images
attributes such that self.verify_schema()
runs without error. This ensures that all of the methods defined in the Builder
class will operate correctly on the inheriting class.
Data Sources
[descriptions and links to data]
Dependencies/Licensing
[list of dependencies and their licenses, including data]
References
[list of references]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for xt_cvdata-0.4.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d9608fd2b4839121f7a5267caabe0471c952ea29d757919da58c66e40cf24d89 |
|
MD5 | c89b112b17c5a7a478c7c7d77649ccb0 |
|
BLAKE2b-256 | 06bf91d3a5b9ed2ad417e4832f8e21ad2172c84c7558652a7515c81c1205be95 |