Skip to main content

data pipeline 101

Project description

Data Pipeline 101

https://img.shields.io/pypi/v/pipelime.svg https://img.shields.io/travis/eyecan-ai/pipelime.svg Documentation Status Updates

Installation

pip install pipelime

Basic Usage

Underfolder Format

The Underfolder format is one of the pipelime dataset formats: i.e. a flexible way to (model and) store a generic dataset through filesystem.

underfolder structure

An Underfolder dataset is a collection of samples. A sample is a collection of items. An item is a unitary block of data, it can be a generic tensor (e.g. a multi-channel image, or a plain matrix), a dictionary. or more.

Underfolder datasets must contain a subfolder named data that will actually contain the samples and items. Optionally you can store the items in the root folder directly, they will act as “global” items injected into each sample.

naming convention

Items are named using the following naming convention:

$ID_$ITEM.$EXT

Where:

  • $ID is the sample identifier, must be a unique string for each sample.

  • ITEM is the item name.

  • EXT is the item extension. Currently supported extensions are:

    • The most common image formats like PNG, JPEG, BMP, and many others…

    • YAML and JSON for dictionary-like objects.

    • TXT for numpy 2D matrix notation.

    • NPY and NPZ for numpy arrays.

    • PKL for generic pickable python objects.

Root files follow the same convention but they lack the sample identifier part:

$ITEM.$EXT

Reading an Underfolder Dataset

Pipelime provides an intuitive interface to read, manipulate and write Underfolder Datasets. You don’t have to memorize complex signatures, instantiate weird object iterators, or write tens of lines of boilerplate code. It all boils down to a reader, a writer and objects that behave like built-in python types such as lists and dictionaries.

from pipelime.sequences.readers.filesystem import UnderfolderReader

# Read an underfolder dataset with a single line of code
dataset = UnderfolderReader('tests/sample_data/datasets/underfolder_minimnist')

# A dataset behaves like a Sequence
len(dataset) # The number of samples (20)
sample = dataset[4] # Get a Sample from the dataset

# A Sample is a MutableMapping
len(sample) # The number of items (10)
set(sample.keys()) # The set of all the item names {'cfg', 'image', 'image_mask', ...}
item = sample['image'] # Get an item from the sample

# An item can be any python object, depending on which extension is used to store it.
type(item) # numpy.ndarray
item.shape # (28, 28, 3)

Writing an Underfolder Dataset

You can write a dataset by simply creating and running a writer object.

from pipelime.sequences.writers.filesystem import UnderfolderWriter

# Create the writer object from a destination path
writer = UnderfolderWriter('/tmp/my_output_dataset')
# Write the dataset to file system
writer(dataset)

By default, UnderfolderWriter saves every sample with the extension it was originally read with. If for any reason it is unable to retrieve the original extension, it will use pickle to serialize the object.

If you don’t want to use pickle, you can choose a custom extension for each item name. You can also choose which items are going to be saved as root files (if the contained data is the same for all samples).

from pipelime.sequences.writers.filesystem import UnderfolderWriter

# These items are going to be saved as root files
root_files = ['cfg', 'numbers', 'pose']

# Associate a custom extension to each item name
extensions = {
        'image': 'jpg',
        'image_mask': 'png',
        'image_maskinv': 'png',
        'label': 'txt',
        'metadata': 'json',
        'metadatay': 'yml',
        'points': 'txt',
        'numbers': 'txt',
        'pose': 'txt',
        'cfg': 'yml'
}

# Create a customized writer object
writer = UnderfolderWriter(
        '/tmp/my_output_dataset',
        root_files_keys=root_files,
        extensions_map=extensions,
)
# Write the dataset to file system
writer(dataset)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pipelime-0.1.8.tar.gz (237.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pipelime-0.1.8-py2.py3-none-any.whl (125.8 kB view details)

Uploaded Python 2Python 3

File details

Details for the file pipelime-0.1.8.tar.gz.

File metadata

  • Download URL: pipelime-0.1.8.tar.gz
  • Upload date:
  • Size: 237.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.13

File hashes

Hashes for pipelime-0.1.8.tar.gz
Algorithm Hash digest
SHA256 fe7feb1df3634869f09436f51335f24d462b59bda68a6cf6a6db779d84e08a14
MD5 e2f7452c589521799ff35a26542b2d33
BLAKE2b-256 6cbd27f12472c12fa3a21d3fb3d028f6abde6476580dbf092f692e2f7a1d5add

See more details on using hashes here.

File details

Details for the file pipelime-0.1.8-py2.py3-none-any.whl.

File metadata

  • Download URL: pipelime-0.1.8-py2.py3-none-any.whl
  • Upload date:
  • Size: 125.8 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.13

File hashes

Hashes for pipelime-0.1.8-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 22cbab7a5bf88f03031e6de56b97302a8cc8a3de71a69b45cf5ba3f1cb36739b
MD5 4746f4306ba25b63ed24c515a6c19e9e
BLAKE2b-256 7fbc4d02204b667f52c2ae32fb0de2b592d0dcfb32ea3db0ab41a1ff07e57701

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page