High-performance O(1) random access indexer for Parquet datasets in PyTorch

These details have not been verified by PyPI

Project description

Indexed Parquet Dataset Logo

Python Version License

Indexed Parquet Dataset

Indexed Parquet Dataset is a high-performance Python library for O(1) random access to massive datasets in Parquet format.

It is specifically optimized for Deep Learning (PyTorch), consumes minimal memory, and supports advanced features such as Schema Evolution (working with files of different schemas in a single dataset).

Key Features

⚡ O(1) Random Access: Instantly navigate to any row in a multi-gigabyte dataset without scanning files.
🔄 Schema Evolution: Work with datasets where files have different schemas, missing columns, or renamed fields.
📦 Lazy Loading: Files are opened only when data is requested. Features an efficient LRU handle cache.
🔥 PyTorch Integration: Native support for torch.utils.data.Dataset, including adaptive collate_fn generation.
🛠️ Fluent API: Method chaining: shuffle, filter, alias, split, limit, rename, cast, map.
💾 Index Persistence: Save and fast-load the index from a file.
🏗️ Materialization: "Bake" all transformations into new Parquet files via clone().

Architecture

The library remains lightweight, storing only metadata and a row map in RAM:

graph TD
    subgraph RAM ["Application (RAM - Lightweight)"]
        direction TB
        subgraph DS ["IndexedParquetDataset"]
            Indices["Indices Array [np.ndarray]<br/>(Shuffled/Filtered indices)"]
            Meta["Metadata & Schema<br/>(File offsets, column mapping)"]
            Cache["File Handle Cache<br/>(Lazy Loading LRU)"]
        end
        
        User["User Code / PyTorch DataLoader"] -- "dataset[idx]" --> Indices
        Indices -- "Global Index" --> Meta
        Meta -- "Find File & Row Offset" --> Cache
    end
    
    subgraph Storage ["Storage (HDD/SSD/S3-over-FUSE)"]
        F1["data_part_1.parquet"]
        F2["data_part_2.parquet"]
        FN["data_part_N.parquet"]
    end
    
    Cache -- "Lazy Read" --> F1
    Cache -- "Lazy Read" --> F2
    Cache -- "Lazy Read" --> FN
    
    F1 -. "O(1) Row Retrieval" .-> User
    F2 -. "O(1) Row Retrieval" .-> User
    FN -. "O(1) Row Retrieval" .-> User

Installation

From PyPI:

pip install indexed-parquet-dataset

For PyTorch support:

pip install "indexed-parquet-dataset[torch]"

Quickstart

Basic Initialization

from indexed_parquet_dataset import IndexedParquetDataset

# Scans the folder and builds a global row index
ds = IndexedParquetDataset.from_folder("./path/to/data")

print(f"Total rows: {len(ds)}")
print(f"First row: {ds[0]}") # {'id': 1, 'text': '...', ...}

# Random access to any row is instant
sample = ds[999_999]

Transformations (Fluent API)

ds = (IndexedParquetDataset.from_folder("./data")
      .filter(lambda x: x["score"] > 0.5)
      .shuffle(seed=42)
      .alias("text_len", lambda x: len(x["text"]))
      .limit(10000))

# Each row now has a virtual 'text_len' column
print(ds[0]["text_len"])

Usage with PyTorch

from torch.utils.data import DataLoader

ds = IndexedParquetDataset.from_folder("./data", auto_fill=True)
train_ds, val_ds = ds.train_test_split(test_size=0.1)

loader = DataLoader(
    train_ds, 
    batch_size=32, 
    shuffle=True, 
    num_workers=4,
    collate_fn=ds.generate_collate_fn(on_none='fill')
)

Documentation

Full documentation is available on GitHub Pages.

License

Apache 2.0 License

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.3.10.dev0 pre-release

May 7, 2026

0.3.9.dev0 pre-release

Apr 17, 2026

0.3.8.dev0 pre-release

Apr 17, 2026

0.3.7.dev0 pre-release

Apr 17, 2026

0.3.5.dev0 pre-release

Apr 17, 2026

0.3.3.dev0 pre-release

Apr 11, 2026

0.3.2.dev0 pre-release

Apr 11, 2026

0.3.1.dev0 pre-release

Apr 11, 2026

0.2.10.dev0 pre-release

Apr 11, 2026

0.2.9.dev0 pre-release

Apr 11, 2026

This version

0.2.8.dev0 pre-release

Apr 10, 2026

0.2.7.dev0 pre-release

Apr 9, 2026

0.2.6.dev0 pre-release

Apr 9, 2026

0.2.5.dev0 pre-release

Apr 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

indexed_parquet_dataset-0.2.8.dev0.tar.gz (3.0 MB view details)

Uploaded Apr 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

indexed_parquet_dataset-0.2.8.dev0-py3-none-any.whl (23.2 kB view details)

Uploaded Apr 10, 2026 Python 3

File details

Details for the file indexed_parquet_dataset-0.2.8.dev0.tar.gz.

File metadata

Download URL: indexed_parquet_dataset-0.2.8.dev0.tar.gz
Upload date: Apr 10, 2026
Size: 3.0 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for indexed_parquet_dataset-0.2.8.dev0.tar.gz
Algorithm	Hash digest
SHA256	`ab2d92d066964c4a19fbce2345ca8c6664df9eb4ee30e8f1ea3617075ba1b703`
MD5	`e1f313a6478e418feb9346da4951e0be`
BLAKE2b-256	`281b13f3f1f736b93f5f3205a22aa6bf1fb13c1b76f066e4f1a48ef91be14054`

See more details on using hashes here.

Provenance

The following attestation bundles were made for indexed_parquet_dataset-0.2.8.dev0.tar.gz:

Publisher: publish.yml on Laeryid/indexed-parquet-dataset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: indexed_parquet_dataset-0.2.8.dev0.tar.gz
- Subject digest: ab2d92d066964c4a19fbce2345ca8c6664df9eb4ee30e8f1ea3617075ba1b703
- Sigstore transparency entry: 1273754498
- Sigstore integration time: Apr 10, 2026
Source repository:
- Permalink: Laeryid/indexed-parquet-dataset@7d7d8343917e0e9d00d51884cb82ea8ea023de9f
- Branch / Tag: refs/tags/v0.2.7
- Owner: https://github.com/Laeryid
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@7d7d8343917e0e9d00d51884cb82ea8ea023de9f
- Trigger Event: push

File details

Details for the file indexed_parquet_dataset-0.2.8.dev0-py3-none-any.whl.

File metadata

Download URL: indexed_parquet_dataset-0.2.8.dev0-py3-none-any.whl
Upload date: Apr 10, 2026
Size: 23.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for indexed_parquet_dataset-0.2.8.dev0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f47e5a03574c1302f6494196e09be5641c6822995d3268069c970e6daaeea883`
MD5	`df7d42b0c8476e6412d522c7dd883e9a`
BLAKE2b-256	`f9e60b2919e8f9fd6521c7b55c60d9249fb8fdea047156dd56332a0b02462596`

See more details on using hashes here.

Provenance

The following attestation bundles were made for indexed_parquet_dataset-0.2.8.dev0-py3-none-any.whl:

Publisher: publish.yml on Laeryid/indexed-parquet-dataset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: indexed_parquet_dataset-0.2.8.dev0-py3-none-any.whl
- Subject digest: f47e5a03574c1302f6494196e09be5641c6822995d3268069c970e6daaeea883
- Sigstore transparency entry: 1273754589
- Sigstore integration time: Apr 10, 2026
Source repository:
- Permalink: Laeryid/indexed-parquet-dataset@7d7d8343917e0e9d00d51884cb82ea8ea023de9f
- Branch / Tag: refs/tags/v0.2.7
- Owner: https://github.com/Laeryid
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@7d7d8343917e0e9d00d51884cb82ea8ea023de9f
- Trigger Event: push

indexed-parquet-dataset 0.2.8.dev0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Indexed Parquet Dataset

Key Features

Architecture

Installation

Quickstart

Basic Initialization

Transformations (Fluent API)

Usage with PyTorch

Documentation

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance