pipelime

data pipeline 101

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Natural Language
- English
Programming Language

Project description

Data Pipeline 101

https://img.shields.io/pypi/v/pipelime.svg

https://img.shields.io/travis/eyecan-ai/pipelime.svg

Installation

pip install pipelime

Basic Usage

Underfolder Format

The Underfolder format is one of the pipelime dataset formats: i.e. a flexible way to (model and) store a generic dataset through filesystem.

An Underfolder dataset is a collection of samples. A sample is a collection of items. An item is a unitary block of data, it can be a generic tensor (e.g. a multi-channel image, or a plain matrix), a dictionary. or more.

Underfolder datasets must contain a subfolder named data that will actually contain the samples and items. Optionally you can store the items in the root folder directly, they will act as “global” items injected into each sample.

Items are named using the following naming convention:

$ID_$ITEM.$EXT

Where:

$ID is the sample identifier, must be a unique string for each sample.
ITEM is the item name.
EXT is the item extension. Currently supported extensions are:
- The most common image formats like PNG, JPEG, BMP, and many others…
- YAML and JSON for dictionary-like objects.
- TXT for numpy 2D matrix notation.
- NPY and NPZ for numpy arrays.
- PKL for generic pickable python objects.

Root files follow the same convention but they lack the sample identifier part:

$ITEM.$EXT

Reading an Underfolder Dataset

Pipelime provides an intuitive interface to read, manipulate and write Underfolder Datasets. You don’t have to memorize complex signatures, instantiate weird object iterators, or write tens of lines of boilerplate code. It all boils down to a reader, a writer and objects that behave like built-in python types such as lists and dictionaries.

from pipelime.sequences.readers.filesystem import UnderfolderReader

# Read an underfolder dataset with a single line of code
dataset = UnderfolderReader('tests/sample_data/datasets/underfolder_minimnist')

# A dataset behaves like a Sequence
len(dataset) # The number of samples (20)
sample = dataset[4] # Get a Sample from the dataset

# A Sample is a MutableMapping
len(sample) # The number of items (10)
set(sample.keys()) # The set of all the item names {'cfg', 'image', 'image_mask', ...}
item = sample['image'] # Get an item from the sample

# An item can be any python object, depending on which extension is used to store it.
type(item) # numpy.ndarray
item.shape # (28, 28, 3)

Writing an Underfolder Dataset

You can write a dataset by simply creating and running a writer object.

from pipelime.sequences.writers.filesystem import UnderfolderWriter

# Create the writer object from a destination path
writer = UnderfolderWriter('/tmp/my_output_dataset')
# Write the dataset to file system
writer(dataset)

By default, UnderfolderWriter saves every sample with the extension it was originally read with. If for any reason it is unable to retrieve the original extension, it will use pickle to serialize the object.

If you don’t want to use pickle, you can choose a custom extension for each item name. You can also choose which items are going to be saved as root files (if the contained data is the same for all samples).

from pipelime.sequences.writers.filesystem import UnderfolderWriter

# These items are going to be saved as root files
root_files = ['cfg', 'numbers', 'pose']

# Associate a custom extension to each item name
extensions = {
        'image': 'jpg',
        'image_mask': 'png',
        'image_maskinv': 'png',
        'label': 'txt',
        'metadata': 'json',
        'metadatay': 'yml',
        'points': 'txt',
        'numbers': 'txt',
        'pose': 'txt',
        'cfg': 'yml'
}

# Create a customized writer object
writer = UnderfolderWriter(
        '/tmp/my_output_dataset',
        root_files_keys=root_files,
        extensions_map=extensions,
)
# Write the dataset to file system
writer(dataset)

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

This version

0.1.8

May 29, 2022

0.1.7

Apr 12, 2022

0.1.6

Apr 1, 2022

0.1.5

Mar 15, 2022

0.1.4

Mar 3, 2022

0.1.3

Feb 21, 2022

0.1.2

Nov 26, 2021

0.1.1

May 4, 2021

0.1.0

Mar 19, 2021

0.0.1

Mar 19, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pipelime-0.1.8.tar.gz (237.7 kB view hashes)

Uploaded May 29, 2022 Source

Built Distribution

pipelime-0.1.8-py2.py3-none-any.whl (125.8 kB view hashes)

Uploaded May 29, 2022 Python 2 Python 3

Hashes for pipelime-0.1.8.tar.gz

Hashes for pipelime-0.1.8.tar.gz
Algorithm	Hash digest
SHA256	`fe7feb1df3634869f09436f51335f24d462b59bda68a6cf6a6db779d84e08a14`
MD5	`e2f7452c589521799ff35a26542b2d33`
BLAKE2b-256	`6cbd27f12472c12fa3a21d3fb3d028f6abde6476580dbf092f692e2f7a1d5add`

Hashes for pipelime-0.1.8-py2.py3-none-any.whl

Hashes for pipelime-0.1.8-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`22cbab7a5bf88f03031e6de56b97302a8cc8a3de71a69b45cf5ba3f1cb36739b`
MD5	`4746f4306ba25b63ed24c515a6c19e9e`
BLAKE2b-256	`7fbc4d02204b667f52c2ae32fb0de2b592d0dcfb32ea3db0ab41a1ff07e57701`