datadings is a collection of tools to prepare datasets for machine learning. It's easy to use, space-efficient, and blazingly fast.
Project description
datadings is a collection of tools to prepare datasets for machine learning, based on two simple principles
Datasets are collections of individual data samples.
Each sample is a dictionary with descriptive keys.
For supervised training with images samples are dictionaries like this:
{"key": unique_key, "image": imagedata, "label": label}
msgpack is used as an efficient storage format for most supported datasets.
Check out the documentation for more details.
Supported datasets
Dataset |
Short Description |
---|---|
Scene Parsing, Segmentation |
|
own Eye-Tracking dataset (Jalpa) |
|
Motion-based Segmentation |
|
MIT Saliency |
|
32x32 color image classification with 10/100 classes |
|
Segmentation, Semantic understanding of urban street scenes |
|
Eye-Tracking, Saliency |
|
Eye-Tracking, Saliency |
|
Imagenet Large Scale Visual Recognition Challenge |
|
A superset of ILSVRC2012 with 11 M images for 10450 classes |
|
Inria Areal Image Labeling Dataset (Buildings), Segmentation, Remote Sensing |
|
Eye-Tracking, Saliency, Learning to predict where humans look |
|
Eye-Tracking, Saliency |
|
MIT Places, Scene Recognition |
|
MIT Places365, Scene Recognition |
|
High-Res Multispectral Semantic Segmentation, Remote Sensing |
|
Saliency in Context, Eye-Tracking |
|
Saliency in Context, Eye-Tracking |
|
Pascal Visual Object Classes Challenge |
|
Remote Sensing, Semantic Object Classification, Segmentation |
|
Yahoo Flickr Creative Commons 100 M pics |
Command line tools
datadings-write creates new dataset files.
datadings-cat prints the (abbreviated) contents of a dataset file.
datadings-shuffle shuffles an existing dataset file.
datadings-merge merges two or more dataset files.
datadings-split splits a dataset file into two or more subsets.
datadings-bench runs some basic read performance benchmarks.
Basic usage
Each dataset defines modules to read and write in the datadings.sets package. For most datasets the reading module only contains additional metadata like class labels and distributions.
Let’s consider the MIT1003 dataset as an example.
MIT1003_write is an executable that creates dataset files. It can be called directly or through datadings-write. Three files will be written:
MIT1003.msgpack contains sample data
MIT1003.msgpack.index contains index for random access
MIT1003.msgpack.md5 contains MD5 hashes of both files
Reading all samples sequentially, using a MsgpackReader as a context manager:
with MsgpackReader('MIT1003.msgpack') as reader: for sample in reader: [do dataset things]
This standard iterator returns dictionaries. Use the rawiter() method to get samples as messagepack encoded bytes instead.
Reading specific samples:
reader.seek_key('i14020903.jpeg') print(reader.next()['key']) reader.seek_index(100) print(reader.next()['key'])
Reading samples as raw bytes:
raw = reader.rawnext() for raw in reader.rawiter(): print(type(raw), len(raw))
Number of samples:
print(len(reader))
You can also change the order and selection of iterated samples with augments. For example, to randomize the order of samples, wrap the reader in a Shuffler:
from datadings.reader import Shuffler with Shuffler(MsgpackReader('MIT1003.msgpack')) as reader: for sample in reader: # do dataset things, but in random order!
A common use case is to iterate over the whole dataset multiple times. This can be done with the Cycler:
from datadings.reader import Cycler with Cycler(MsgpackReader('MIT1003.msgpack')) as reader: for sample in reader: # do dataset things, but FOREVER!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file datadings-3.4.6-py3-none-any.whl
.
File metadata
- Download URL: datadings-3.4.6-py3-none-any.whl
- Upload date:
- Size: 2.9 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 45c59b2da0686c2e27773bc24de2dc2fccd86f25bdbb8b18b4eadc0bc01a551c |
|
MD5 | ec27a7250ec580e172941fdfe46a0634 |
|
BLAKE2b-256 | 93a08949fbca91ae33cf64c031de6f7b1ea5704adb79bc5c5545f0d3a72377e5 |