Skip to main content

datadings is a collection of tools to prepare datasets for machine learning. It's easy to use, space-efficient, and blazingly fast.

Project description

datadings is a collection of tools to prepare datasets for machine learning, based on two simple principles

Datasets are collections of individual data samples.

Each sample is a dictionary with descriptive keys.

For supervised training with images samples are dictionaries like this:

{"key": unique_key, "image": imagedata, "label": label}

msgpack is used as an efficient storage format for most supported datasets.

Check out the documentation for more details.

Supported datasets

Dataset

Short Description

ADE20k

Scene Parsing, Segmentation

ANP460

own Eye-Tracking dataset (Jalpa)

CAMVID

Motion-based Segmentation

CAT2000

MIT Saliency

CIFAR

32x32 color image classification with 10/100 classes

Cityscapes

Segmentation, Semantic understanding of urban street scenes

Coutrot1

Eye-Tracking, Saliency

FIGRIMFixation

Eye-Tracking, Saliency

ILSVRC2012

Imagenet Large Scale Visual Recognition Challenge

ImageNet21k

A superset of ILSVRC2012 with 11 M images for 10450 classes

InriaBuildings

Inria Areal Image Labeling Dataset (Buildings), Segmentation, Remote Sensing

MIT1003

Eye-Tracking, Saliency, Learning to predict where humans look

MIT300

Eye-Tracking, Saliency

Places2017

MIT Places, Scene Recognition

Places365

MIT Places365, Scene Recognition

RIT18

High-Res Multispectral Semantic Segmentation, Remote Sensing

SALICON2015

Saliency in Context, Eye-Tracking

SALICON2017

Saliency in Context, Eye-Tracking

VOC2012

Pascal Visual Object Classes Challenge

Vaihingen

Remote Sensing, Semantic Object Classification, Segmentation

YFCC100m

Yahoo Flickr Creative Commons 100 M pics

Command line tools

  • datadings-write creates new dataset files.

  • datadings-cat prints the (abbreviated) contents of a dataset file.

  • datadings-shuffle shuffles an existing dataset file.

  • datadings-merge merges two or more dataset files.

  • datadings-split splits a dataset file into two or more subsets.

  • datadings-bench runs some basic read performance benchmarks.

Basic usage

Each dataset defines modules to read and write in the datadings.sets package. For most datasets the reading module only contains additional metadata like class labels and distributions.

Let’s consider the MIT1003 dataset as an example.

MIT1003_write is an executable that creates dataset files. It can be called directly or through datadings-write. Three files will be written:

  • MIT1003.msgpack contains sample data

  • MIT1003.msgpack.index contains index for random access

  • MIT1003.msgpack.md5 contains MD5 hashes of both files

Reading all samples sequentially, using a MsgpackReader as a context manager:

with MsgpackReader('MIT1003.msgpack') as reader:
    for sample in reader:
        [do dataset things]

This standard iterator returns dictionaries. Use the rawiter() method to get samples as messagepack encoded bytes instead.

Reading specific samples:

reader.seek_key('i14020903.jpeg')
print(reader.next()['key'])
reader.seek_index(100)
print(reader.next()['key'])

Reading samples as raw bytes:

raw = reader.rawnext()
for raw in reader.rawiter():
    print(type(raw), len(raw))

Number of samples:

print(len(reader))

You can also change the order and selection of iterated samples with augments. For example, to randomize the order of samples, wrap the reader in a Shuffler:

from datadings.reader import Shuffler
with Shuffler(MsgpackReader('MIT1003.msgpack')) as reader:
    for sample in reader:
        # do dataset things, but in random order!

A common use case is to iterate over the whole dataset multiple times. This can be done with the Cycler:

from datadings.reader import Cycler
with Cycler(MsgpackReader('MIT1003.msgpack')) as reader:
    for sample in reader:
        # do dataset things, but FOREVER!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datadings-3.4.7-py3-none-any.whl (2.9 MB view details)

Uploaded Python 3

File details

Details for the file datadings-3.4.7-py3-none-any.whl.

File metadata

  • Download URL: datadings-3.4.7-py3-none-any.whl
  • Upload date:
  • Size: 2.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.22

File hashes

Hashes for datadings-3.4.7-py3-none-any.whl
Algorithm Hash digest
SHA256 3602f3581650ac55a2f7cf722a72786284441846907fa244561dc9924edacd19
MD5 33bdf69f65543590b2f4cb036b7ea979
BLAKE2b-256 80182e46a7241c4b3ca653c88b607ca129ff9fae23625f09c25ff73d4d90190c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page