simple library to manage a dataset of shards to train machine learning models
Project description
iden
Overview
iden is a simple Python library to manage a dataset of shards when training a machine learning
model.
iden uses a lazy loading approach to load the shard's data, so it is easy to manage shards without
loading their data.
iden supports different formats to store shards on disk.
Key Features
- Lazy Loading: Shards are loaded only when needed, enabling efficient memory management
- Multiple Formats: Support for JSON, YAML, Pickle, PyTorch, safetensors, and more
- Flexible Dataset Management: Organize data into splits (train/val/test) with associated assets
- URI-based Identification: Each shard has a unique URI for easy persistence and loading
- Caching Support: Optional in-memory caching for frequently accessed shards
- Extensible: Easy to add custom shard types and loaders
Quick Example
import tempfile
from pathlib import Path
from iden.dataset import create_vanilla_dataset
from iden.shard import create_json_shard, create_shard_dict, create_shard_tuple
# Create a simple dataset
with tempfile.TemporaryDirectory() as tmpdir:
# Create shards
train_tuple = create_shard_tuple(
[
create_json_shard(
[1, 2, 3], uri=Path(tmpdir).joinpath("train1.json").as_uri()
),
create_json_shard(
[4, 5, 6], uri=Path(tmpdir).joinpath("train2.json").as_uri()
),
],
uri=Path(tmpdir).joinpath("train_tuple").as_uri(),
)
val_tuple = create_shard_tuple(
[create_json_shard([7, 8, 9], uri=Path(tmpdir).joinpath("val1.json").as_uri())],
uri=Path(tmpdir).joinpath("val_tuple").as_uri(),
)
# Organize shards into splits
shards = create_shard_dict(
shards={"train": train_tuple, "val": val_tuple},
uri=Path(tmpdir).joinpath("shards").as_uri(),
)
assets = create_shard_dict(shards={}, uri=Path(tmpdir).joinpath("assets").as_uri())
# Create dataset
dataset = create_vanilla_dataset(
shards=shards,
assets=assets,
uri=Path(tmpdir).joinpath("my_dataset").as_uri(),
)
# Access data
train_shards = dataset.get_shards("train")
print(train_shards[0].get_data()) # Output: [1, 2, 3]
Installation
We highly recommend installing
a virtual environment.
iden can be installed from pip using the following command:
uv pip install iden
To make the package as slim as possible, only the minimal packages required to use iden are
installed.
To include all the dependencies, the following command can be used:
uv pip install iden[all]
Please check the get started page to see how to install only some specific dependencies or other alternatives to install the library.
Documentation
- Get Started: Installation instructions
- User Guide: Learn about shards and datasets
- How-to Guides: Step-by-step guides for common tasks
- API Reference: Complete API documentation
- Examples: Practical code examples
Basic Usage
Working with Shards
from iden.shard import create_json_shard
# Create a shard
shard = create_json_shard(data={"key": "value"}, uri="file:///path/to/data.json")
# Get data from shard
data = shard.get_data()
# Cache data for faster access
data = shard.get_data(cache=True)
Managing Datasets
from iden.dataset import create_vanilla_dataset
from iden.shard import create_json_shard, create_shard_dict, create_shard_tuple
# Create a dataset with train/val splits
train_tuple = create_shard_tuple([shard1, shard2, shard3], uri="file:///train_tuple")
val_tuple = create_shard_tuple([shard4, shard5], uri="file:///val_tuple")
shards = create_shard_dict(
shards={"train": train_tuple, "val": val_tuple},
uri="file:///shards",
)
assets = create_shard_dict(shards={}, uri="file:///assets")
dataset = create_vanilla_dataset(
shards=shards,
assets=assets,
uri="file:///path/to/dataset",
)
# Access shards
train_shards = dataset.get_shards("train")
first_shard_data = train_shards[0].get_data()
The following is the corresponding iden versions and tested dependencies.
iden |
coola |
objectory |
numpy* |
pyyaml* |
safetensors* |
torch* |
python |
|---|---|---|---|---|---|---|---|
main |
>=0.11.0,<1.0 |
>=0.3,<1.0 |
>=1.24,<2.0 |
>=6.0,<7.0 |
>=0.6,<1.0 |
>=2.0,<3.0 |
>=3.10 |
0.3.0 |
>=0.11.0,<1.0 |
>=0.3,<1.0 |
>=1.24,<2.0 |
>=6.0,<7.0 |
>=0.6,<1.0 |
>=2.0,<3.0 |
>=3.10 |
0.2.0 |
>=0.8.4,<1.0 |
>=0.2,<1.0 |
>=1.22,<2.0 |
>=6.0,<7.0 |
>=0.4,<1.0 |
>=2.0,<3.0 |
>=3.9,<3.14 |
0.1.0 |
>=0.8.4,<1.0 |
>=0.2,<1.0 |
>=1.22,<2.0 |
>=6.0,<7.0 |
>=0.4,<1.0 |
>=2.0,<3.0 |
>=3.9,<3.14 |
iden |
cloudpickle* |
joblib* |
|---|---|---|
main |
>=3.0,<4.0 |
>=1.3,<2.0 |
0.3.0 |
>=3.0,<4.0 |
>=1.3,<2.0 |
* indicates an optional dependency
older versions
iden |
coola |
objectory |
numpy* |
pyyaml* |
safetensors* |
torch* |
python |
|---|---|---|---|---|---|---|---|
0.0.4 |
>=0.3,<1.0 |
>=0.1,<1.0 |
>=1.22,<2.0 |
>=6.0,<7.0 |
>=0.4,<1.0 |
>=2.0,<3.0 |
>=3.9,<3.13 |
0.0.3 |
>=0.3,<1.0 |
>=0.1,<1.0 |
>=1.22,<2.0 |
>=6.0,<7.0 |
>=0.4,<1.0 |
>=2.0,<3.0 |
>=3.9,<3.12 |
0.0.2 |
>=0.4,<1.0 |
>=0.1,<1.0 |
>=1.22,<2.0 |
>=6.0,<7.0 |
>=0.4,<1.0 |
>=2.0,<2.1 |
>=3.9,<3.12 |
0.0.1 |
>=0.4,<1.0 |
>=0.1,<1.0 |
>=1.22,<2.0 |
>=6.0,<7.0 |
>=0.4,<1.0 |
>=2.0,<2.1 |
>=3.9,<3.12 |
Contributing
Please check the instructions in CONTRIBUTING.md.
Suggestions and Communication
Everyone is welcome to contribute to the community. For any questions or suggestions, Github Issues can be submitted. All issues will be addressed as soon as possible.
API stability
:warning: While iden is in development stage, no API is guaranteed to be stable from one
release to the next.
In fact, it is very likely that the API will change multiple times before a stable 1.0.0 release.
In practice, this means that upgrading iden to a new version will possibly break any code that
was using the old version of iden.
License
iden is licensed under BSD 3-Clause "New" or "Revised" license available in LICENSE
file.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file iden-0.3.0.tar.gz.
File metadata
- Download URL: iden-0.3.0.tar.gz
- Upload date:
- Size: 41.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
acdf2aec42f97d25edac540dbd8469eae6f4f7d947b82c331884fc33974724bd
|
|
| MD5 |
9af489d1ed3006144f198e2ab790911b
|
|
| BLAKE2b-256 |
7f806f059023a9942c72fee3daea7e0ef327a31c704ec52ffe5cb3117bd11645
|
File details
Details for the file iden-0.3.0-py3-none-any.whl.
File metadata
- Download URL: iden-0.3.0-py3-none-any.whl
- Upload date:
- Size: 93.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a25872a8850bddcaf3804ae5045bdb6c14c259e83ec87ea173672bd6655b0b0b
|
|
| MD5 |
e3e127990f2ef702a6b8af006b9aa28e
|
|
| BLAKE2b-256 |
ec98a229a8d921ff4ab58b83e0c9785a29158a9782cd40a5d5cf244b4a5060a3
|