Skip to main content

simple library to manage a dataset of shards to train machine learning models

Project description

iden

CI Nightly Tests Nightly Package Tests Codecov
Documentation Documentation
Code style: black Doc style: google Ruff Doc style: google
PYPI version Python BSD-3-Clause
Downloads Monthly downloads

Overview

iden is a simple Python library to manage a dataset of shards when training a machine learning model. iden uses a lazy loading approach to load the shard's data, so it is easy to manage shards without loading their data. iden supports different formats to store shards on disk.

Key Features

  • Lazy Loading: Shards are loaded only when needed, enabling efficient memory management
  • Multiple Formats: Support for JSON, YAML, Pickle, PyTorch, safetensors, and more
  • Flexible Dataset Management: Organize data into splits (train/val/test) with associated assets
  • URI-based Identification: Each shard has a unique URI for easy persistence and loading
  • Caching Support: Optional in-memory caching for frequently accessed shards
  • Extensible: Easy to add custom shard types and loaders

Quick Example

import tempfile
from pathlib import Path
from iden.dataset import create_vanilla_dataset
from iden.shard import create_json_shard, create_shard_dict, create_shard_tuple

# Create a simple dataset
with tempfile.TemporaryDirectory() as tmpdir:
    # Create shards
    train_tuple = create_shard_tuple(
        [
            create_json_shard(
                [1, 2, 3], uri=Path(tmpdir).joinpath("train1.json").as_uri()
            ),
            create_json_shard(
                [4, 5, 6], uri=Path(tmpdir).joinpath("train2.json").as_uri()
            ),
        ],
        uri=Path(tmpdir).joinpath("train_tuple").as_uri(),
    )
    val_tuple = create_shard_tuple(
        [create_json_shard([7, 8, 9], uri=Path(tmpdir).joinpath("val1.json").as_uri())],
        uri=Path(tmpdir).joinpath("val_tuple").as_uri(),
    )

    # Organize shards into splits
    shards = create_shard_dict(
        shards={"train": train_tuple, "val": val_tuple},
        uri=Path(tmpdir).joinpath("shards").as_uri(),
    )
    assets = create_shard_dict(shards={}, uri=Path(tmpdir).joinpath("assets").as_uri())

    # Create dataset
    dataset = create_vanilla_dataset(
        shards=shards,
        assets=assets,
        uri=Path(tmpdir).joinpath("my_dataset").as_uri(),
    )

    # Access data
    train_shards = dataset.get_shards("train")
    print(train_shards[0].get_data())  # Output: [1, 2, 3]

Installation

We highly recommend installing a virtual environment. iden can be installed from pip using the following command:

uv pip install iden

To make the package as slim as possible, only the minimal packages required to use iden are installed. To include all the dependencies, the following command can be used:

uv pip install iden[all]

Please check the get started page to see how to install only some specific dependencies or other alternatives to install the library.

Documentation

Basic Usage

Working with Shards

from iden.shard import create_json_shard

# Create a shard
shard = create_json_shard(data={"key": "value"}, uri="file:///path/to/data.json")

# Get data from shard
data = shard.get_data()

# Cache data for faster access
data = shard.get_data(cache=True)

Managing Datasets

from iden.dataset import create_vanilla_dataset
from iden.shard import create_json_shard, create_shard_dict, create_shard_tuple

# Create a dataset with train/val splits
train_tuple = create_shard_tuple([shard1, shard2, shard3], uri="file:///train_tuple")
val_tuple = create_shard_tuple([shard4, shard5], uri="file:///val_tuple")

shards = create_shard_dict(
    shards={"train": train_tuple, "val": val_tuple},
    uri="file:///shards",
)
assets = create_shard_dict(shards={}, uri="file:///assets")

dataset = create_vanilla_dataset(
    shards=shards,
    assets=assets,
    uri="file:///path/to/dataset",
)

# Access shards
train_shards = dataset.get_shards("train")
first_shard_data = train_shards[0].get_data()

The following is the corresponding iden versions and tested dependencies.

iden coola objectory numpy* pyyaml* safetensors* torch* python
main >=1.1,<2.0 >=0.3,<1.0 >=1.24,<2.0 >=6.0,<7.0 >=0.6,<1.0 >=2.0,<3.0 >=3.10
0.4.1 >=1.1,<2.0 >=0.3,<1.0 >=1.24,<2.0 >=6.0,<7.0 >=0.6,<1.0 >=2.0,<3.0 >=3.10
0.4.0 >=1.0,<2.0 >=0.3,<1.0 >=1.24,<2.0 >=6.0,<7.0 >=0.6,<1.0 >=2.0,<3.0 >=3.10
0.3.0 >=0.11.0,<1.0 >=0.3,<1.0 >=1.24,<2.0 >=6.0,<7.0 >=0.6,<1.0 >=2.0,<3.0 >=3.10
0.2.0 >=0.8.4,<1.0 >=0.2,<1.0 >=1.22,<2.0 >=6.0,<7.0 >=0.4,<1.0 >=2.0,<3.0 >=3.9,<3.14
0.1.0 >=0.8.4,<1.0 >=0.2,<1.0 >=1.22,<2.0 >=6.0,<7.0 >=0.4,<1.0 >=2.0,<3.0 >=3.9,<3.14
iden cloudpickle* joblib*
main >=3.0,<4.0 >=1.3,<2.0
0.4.0 >=3.0,<4.0 >=1.3,<2.0
0.3.0 >=3.0,<4.0 >=1.3,<2.0

* indicates an optional dependency

older versions
iden coola objectory numpy* pyyaml* safetensors* torch* python
0.0.4 >=0.3,<1.0 >=0.1,<1.0 >=1.22,<2.0 >=6.0,<7.0 >=0.4,<1.0 >=2.0,<3.0 >=3.9,<3.13
0.0.3 >=0.3,<1.0 >=0.1,<1.0 >=1.22,<2.0 >=6.0,<7.0 >=0.4,<1.0 >=2.0,<3.0 >=3.9,<3.12
0.0.2 >=0.4,<1.0 >=0.1,<1.0 >=1.22,<2.0 >=6.0,<7.0 >=0.4,<1.0 >=2.0,<2.1 >=3.9,<3.12
0.0.1 >=0.4,<1.0 >=0.1,<1.0 >=1.22,<2.0 >=6.0,<7.0 >=0.4,<1.0 >=2.0,<2.1 >=3.9,<3.12

Contributing

Please check the instructions in CONTRIBUTING.md.

Suggestions and Communication

Everyone is welcome to contribute to the community. For any questions or suggestions, Github Issues can be submitted. All issues will be addressed as soon as possible.

API stability

:warning: While iden is in development stage, no API is guaranteed to be stable from one release to the next. In fact, it is very likely that the API will change multiple times before a stable 1.0.0 release. In practice, this means that upgrading iden to a new version will possibly break any code that was using the old version of iden.

License

iden is licensed under BSD 3-Clause "New" or "Revised" license available in LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iden-0.4.1.tar.gz (41.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

iden-0.4.1-py3-none-any.whl (95.4 kB view details)

Uploaded Python 3

File details

Details for the file iden-0.4.1.tar.gz.

File metadata

  • Download URL: iden-0.4.1.tar.gz
  • Upload date:
  • Size: 41.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for iden-0.4.1.tar.gz
Algorithm Hash digest
SHA256 ebeb979ecf65e4a8d33a809cd8924e59fed950f0ff7528990e16daf58fd76a23
MD5 ce1253996a9153268347f01166e2af37
BLAKE2b-256 e4eeff74e0b33f53f69198bfd4b194ad650974476e7c87c38f58f7c399212d1a

See more details on using hashes here.

File details

Details for the file iden-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: iden-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 95.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for iden-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 fe92653248a5ba098ae6e8de7ae7d88b62e0058e778ad95e132ff086943cd8d6
MD5 8e121de3e37f60e1d1fc9b0a4dfa3d86
BLAKE2b-256 efbc9d9208c0b889052acbcf772226225e40591b3d5cf39bf6769a6816b98bda

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page