High-performance O(1) random access indexer for Parquet datasets in PyTorch
Project description
Indexed Parquet Dataset
Indexed Parquet Dataset is a high-performance Python library for O(1) random access to massive datasets in Parquet format.
It is specifically optimized for Deep Learning (PyTorch), consumes minimal memory, and supports advanced features such as Schema Evolution (working with files of different schemas in a single dataset).
Key Features
- ⚡ O(1) Random Access: Instantly navigate to any row in a multi-gigabyte dataset without scanning files.
- 🔄 Schema Evolution: Work with datasets where files have different schemas, missing columns, or renamed fields.
- 📦 Lazy Loading: Files are opened only when data is requested. Features an efficient LRU handle cache.
- 🔥 PyTorch Integration: Native support for
torch.utils.data.Dataset, including adaptivecollate_fngeneration. - 🛠️ Fluent API: Method chaining:
shuffle(global or locality-aware),filter,alias,split,limit,rename,cast,map. - 💾 Index Persistence: Save and fast-load the index from a file.
- 🏗️ Materialization: "Bake" all transformations into new Parquet files via
clone().
Architecture
The library remains lightweight, storing only metadata and a row map in RAM:
graph TD
subgraph RAM ["Application (RAM - Lightweight)"]
direction TB
subgraph DS ["IndexedParquetDataset"]
Indices["Indices Array [np.ndarray]<br/>(Shuffled/Filtered indices)"]
Meta["Metadata & Schema<br/>(File offsets, column mapping)"]
Cache["File Handle Cache<br/>(Lazy Loading LRU)"]
end
User["User Code / PyTorch DataLoader"] -- "dataset[idx]" --> Indices
Indices -- "Global Index" --> Meta
Meta -- "Find File & Row Offset" --> Cache
end
subgraph Storage ["Storage (HDD/SSD/S3-over-FUSE)"]
F1["data_part_1.parquet"]
F2["data_part_2.parquet"]
FN["data_part_N.parquet"]
end
Cache -- "Lazy Read" --> F1
Cache -- "Lazy Read" --> F2
Cache -- "Lazy Read" --> FN
F1 -. "O(1) Row Retrieval" .-> User
F2 -. "O(1) Row Retrieval" .-> User
FN -. "O(1) Row Retrieval" .-> User
Installation
From PyPI:
pip install indexed-parquet-dataset
For PyTorch support:
pip install "indexed-parquet-dataset[torch]"
Quickstart
Basic Initialization
from indexed_parquet_dataset import IndexedParquetDataset
# Scans the folder and builds a global row index
ds = IndexedParquetDataset.from_folder("./path/to/data")
print(f"Total rows: {len(ds)}")
print(f"First row: {ds[0]}") # {'id': 1, 'text': '...', ...}
# Random access to any row is instant
sample = ds[999_999]
Transformations (Fluent API)
ds = (IndexedParquetDataset.from_folder("./data")
.filter(lambda x: x["score"] > 0.5)
.shuffle(seed=42, rg_buffer=32) # Locality-aware shuffle for best I/O performance
.alias("text_len", lambda x: len(x["text"]))
.limit(10000))
# Each row now has a virtual 'text_len' column
print(ds[0]["text_len"])
Usage with PyTorch
from torch.utils.data import DataLoader
ds = IndexedParquetDataset.from_folder("./data", auto_fill=True)
train_ds, val_ds = ds.train_test_split(test_size=0.1)
loader = DataLoader(
train_ds,
batch_size=32,
shuffle=True,
num_workers=4,
collate_fn=ds.generate_collate_fn(on_none='fill')
)
Documentation
Full documentation is available on GitHub Pages.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file indexed_parquet_dataset-0.3.8.dev0.tar.gz.
File metadata
- Download URL: indexed_parquet_dataset-0.3.8.dev0.tar.gz
- Upload date:
- Size: 3.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d890d11bf9f58059e4578fbe0f5a2e9b5fb61a1110efc71c95282abdc09e7b92
|
|
| MD5 |
778eef91a6f42e6e32308426066a1a7f
|
|
| BLAKE2b-256 |
dbc089de480f3226b82b4c8471f84d584fbaf39e9b87474442035a57c03d68e7
|
Provenance
The following attestation bundles were made for indexed_parquet_dataset-0.3.8.dev0.tar.gz:
Publisher:
publish.yml on Laeryid/indexed-parquet-dataset
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
indexed_parquet_dataset-0.3.8.dev0.tar.gz -
Subject digest:
d890d11bf9f58059e4578fbe0f5a2e9b5fb61a1110efc71c95282abdc09e7b92 - Sigstore transparency entry: 1331550539
- Sigstore integration time:
-
Permalink:
Laeryid/indexed-parquet-dataset@0a8767330b3a1c1c12dbf86401d675c067aa3046 -
Branch / Tag:
refs/tags/v0.3.7 - Owner: https://github.com/Laeryid
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@0a8767330b3a1c1c12dbf86401d675c067aa3046 -
Trigger Event:
push
-
Statement type:
File details
Details for the file indexed_parquet_dataset-0.3.8.dev0-py3-none-any.whl.
File metadata
- Download URL: indexed_parquet_dataset-0.3.8.dev0-py3-none-any.whl
- Upload date:
- Size: 27.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5a2806798b3b8a5a61f89ae07b6f42046ece393a86180f3774fc4bf7df5a6eb6
|
|
| MD5 |
ba5b197db66ec8284a947e65969822eb
|
|
| BLAKE2b-256 |
a23f6b1d23237ebdb7cff3a2af3b780a47e682e05923bcc823826f64bec46be5
|
Provenance
The following attestation bundles were made for indexed_parquet_dataset-0.3.8.dev0-py3-none-any.whl:
Publisher:
publish.yml on Laeryid/indexed-parquet-dataset
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
indexed_parquet_dataset-0.3.8.dev0-py3-none-any.whl -
Subject digest:
5a2806798b3b8a5a61f89ae07b6f42046ece393a86180f3774fc4bf7df5a6eb6 - Sigstore transparency entry: 1331550594
- Sigstore integration time:
-
Permalink:
Laeryid/indexed-parquet-dataset@0a8767330b3a1c1c12dbf86401d675c067aa3046 -
Branch / Tag:
refs/tags/v0.3.7 - Owner: https://github.com/Laeryid
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@0a8767330b3a1c1c12dbf86401d675c067aa3046 -
Trigger Event:
push
-
Statement type: