High-performance O(1) random access indexer for Parquet datasets in PyTorch

These details have not been verified by PyPI

Project description

Indexed Parquet Dataset Logo

Version Python Version License

Indexed Parquet Dataset

High-performance O(1) random access indexer for Parquet datasets in PyTorch and Python.

This library provides an efficient way to handle large-scale datasets stored as multiple Parquet files, allowing for constant-time random access to any row without loading the entire dataset into memory.

Architecture

The following diagram illustrates how the IndexedParquetDataset remains lightweight by only keeping metadata in memory, while the actual data stays on disk:

graph TD
    subgraph RAM ["Application RAM (Lightweight)"]
        direction TB
        subgraph DS ["IndexedParquetDataset Object"]
            Indices["Indices Array [np.ndarray]<br/>(Filtered/Shuffled indices)"]
            Meta["Metadata & Schema<br/>(File offsets, column maps)"]
            Cache["File Handle Cache<br/>(Lazy Loading)"]
        end
        
        User["User Code / PyTorch DataLoader"] -- "dataset(idx)" --> Indices
        Indices -- "Global Index" --> Meta
        Meta -- "Find File & Row Offset" --> Cache
    end
    
    subgraph Storage ["Storage (SSD/HDD - Large Files)"]
        F1["data_part_1.parquet"]
        F2["data_part_2.parquet"]
        FN["data_part_N.parquet"]
    end
    
    Cache -- "Lazy Read" --> F1
    Cache -- "Lazy Read" --> F2
    Cache -- "Lazy Read" --> FN
    
    F1 -. "O1 Row Retrieval" .-> User
    F2 -. "O1 Row Retrieval" .-> User
    FN -. "O1 Row Retrieval" .-> User

    style DS fill:#e1f5fe,stroke:#01579b,stroke-width:2px
    style Storage fill:#f5f5f5,stroke:#424242,stroke-width:2px,stroke-dasharray: 5 5

Features

Fluent API: Chainable methods for data processing including shuffle, filter, alias, split, limit, rename, copy, and concat.
Computed Columns: Create new columns or replace existing ones using Python functions (lambdas) via the alias method.
Explicit Casting: Change column types on the fly with the cast method.
Materialization: Bake all transformations and computations into a real Parquet file via clone(path), eliminating runtime overhead for streaming.
Dynamic Schema Evolution: Support for concatenating datasets with different schemas and automatic type alignment.
Linear Scalability: Indexed access with O(1) complexity regardless of the dataset size.
Lazy Loading: Concurrent safe file access with minimal memory footprint.
PyTorch Integration: Fully compatible with torch.utils.data.Dataset.
Index Persistence: Save and load dataset indices to skip the indexing phase in future runs.

Installation

You can install the package directly from GitHub using pip:

pip install git+https://github.com/Laeryid/indexed-parquet-dataset

Project Structure

src/indexed_parquet/: Core package directory.
- dataset.py: Implementation of IndexedParquetDataset and transformation logic.
- indexer.py: Logic for scanning Parquet files and building global row maps.
- schema.py: Utilities for handling Parquet schemas and data types.
tests/: Comprehensive test suite verifying indexing, transformations, and PyTorch compatibility.
pyproject.toml: Project metadata and dependency definitions.

Usage

Basic Initialization

Create a dataset from a folder containing Parquet files:

from indexed_parquet import IndexedParquetDataset

# Scans the folder and builds an internal index
dataset = IndexedParquetDataset.from_folder("./path/to/data")

print(f"Total rows: {len(dataset)}")
print(f"First row: {dataset[0]}")

Fluent API (Transformations)

The dataset supports a chainable API for common data preparation tasks:

dataset = (IndexedParquetDataset.from_folder("./data")
           .filter(lambda x: x["split"] == "train")
           .shuffle(seed=42)
           .alias("text_len", lambda x: len(x["text"])) # Computed column
           .cast("text_len", "int")                    # Explicit casting
           .limit(10000))

# Accessing transformed data
sample = dataset[0] # sample["text_len"] is now available

Advanced Manipulation

Copying and Concatenating

Create independent copies or merge multiple datasets with automatic schema alignment:

# Create an independent copy
ds_copy = dataset.copy()

# Vertically concatenate two datasets
# Automatically handles overlapping columns and different aliases
combined_ds = dataset1.concat(dataset2)

Materialization (Cloning)

To avoid performance degradation when using multiple computed columns or heavy filters, you can "bake" the current state into a new Parquet file:

# to_parquet: just save to disk (export)
dataset.to_parquet("baked_data.parquet")

# clone: save to disk AND return a new dataset instance 
# pointing to the new file (zero-overhead)
fast_dataset = dataset.clone("materialized.parquet")

[!NOTE] clone() always requires a destination path. It ensures that all Python-based computations (lambdas) are executed once and their results are stored as real values in the new file.


### Batch Reading

You can pass a list of indices to retrieve multiple rows efficiently:

```python
batch_indices = [0, 10, 100, 500]
rows = dataset[batch_indices]

Index Persistence

Building an index for millions of rows can take time. You can save the index to a file and load it later:

# Save index
dataset.save_index("my_dataset_index.pkl")

# Load index later (much faster than from_folder)
loaded_dataset = IndexedParquetDataset.load_index("my_dataset_index.pkl")

Technical Requirements

PyArrow: Backend for Parquet file operations.
NumPy: Efficient array management for indexing.
PyTorch: Seamless integration for machine learning pipelines.
Pandas: Schema management and metadata utilities.

License

This project is licensed under the Apache 2.0 License.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.3.10.dev0 pre-release

May 7, 2026

0.3.9.dev0 pre-release

Apr 17, 2026

0.3.8.dev0 pre-release

Apr 17, 2026

0.3.7.dev0 pre-release

Apr 17, 2026

0.3.5.dev0 pre-release

Apr 17, 2026

0.3.3.dev0 pre-release

Apr 11, 2026

0.3.2.dev0 pre-release

Apr 11, 2026

0.3.1.dev0 pre-release

Apr 11, 2026

0.2.10.dev0 pre-release

Apr 11, 2026

0.2.9.dev0 pre-release

Apr 11, 2026

0.2.8.dev0 pre-release

Apr 10, 2026

0.2.7.dev0 pre-release

Apr 9, 2026

0.2.6.dev0 pre-release

Apr 9, 2026

This version

0.2.5.dev0 pre-release

Apr 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

indexed_parquet_dataset-0.2.5.dev0.tar.gz (2.8 MB view details)

Uploaded Apr 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

indexed_parquet_dataset-0.2.5.dev0-py3-none-any.whl (23.9 kB view details)

Uploaded Apr 9, 2026 Python 3

File details

Details for the file indexed_parquet_dataset-0.2.5.dev0.tar.gz.

File metadata

Download URL: indexed_parquet_dataset-0.2.5.dev0.tar.gz
Upload date: Apr 9, 2026
Size: 2.8 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for indexed_parquet_dataset-0.2.5.dev0.tar.gz
Algorithm	Hash digest
SHA256	`81484bd7511b9f90386e9fde8ff5f36d57b1d456e88e02b5cd272d00c7868ddb`
MD5	`f210936931fe1c9347d02f1af27448f2`
BLAKE2b-256	`b09447339d7a1efa54235de5c6d2003ee623110de6880270d46c6f2acd21adcb`

See more details on using hashes here.

Provenance

The following attestation bundles were made for indexed_parquet_dataset-0.2.5.dev0.tar.gz:

Publisher: publish.yml on Laeryid/indexed-parquet-dataset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: indexed_parquet_dataset-0.2.5.dev0.tar.gz
- Subject digest: 81484bd7511b9f90386e9fde8ff5f36d57b1d456e88e02b5cd272d00c7868ddb
- Sigstore transparency entry: 1265904647
- Sigstore integration time: Apr 9, 2026
Source repository:
- Permalink: Laeryid/indexed-parquet-dataset@5b2422c074bd9b891f4fb35c98d68623e1a9418e
- Branch / Tag: refs/tags/v0.2.4
- Owner: https://github.com/Laeryid
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5b2422c074bd9b891f4fb35c98d68623e1a9418e
- Trigger Event: push

File details

Details for the file indexed_parquet_dataset-0.2.5.dev0-py3-none-any.whl.

File metadata

Download URL: indexed_parquet_dataset-0.2.5.dev0-py3-none-any.whl
Upload date: Apr 9, 2026
Size: 23.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for indexed_parquet_dataset-0.2.5.dev0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c2f92409ff67054930e3e35a327421f5f73dcbaa58dcdf7b18d2d73da69512de`
MD5	`ff795ffe11a26dbeb7e305ecae9508d4`
BLAKE2b-256	`904afa3582e8b78241752f49b5353e27cdd15b836475df9e87e51b3eba93b5f5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for indexed_parquet_dataset-0.2.5.dev0-py3-none-any.whl:

Publisher: publish.yml on Laeryid/indexed-parquet-dataset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: indexed_parquet_dataset-0.2.5.dev0-py3-none-any.whl
- Subject digest: c2f92409ff67054930e3e35a327421f5f73dcbaa58dcdf7b18d2d73da69512de
- Sigstore transparency entry: 1265904749
- Sigstore integration time: Apr 9, 2026
Source repository:
- Permalink: Laeryid/indexed-parquet-dataset@5b2422c074bd9b891f4fb35c98d68623e1a9418e
- Branch / Tag: refs/tags/v0.2.4
- Owner: https://github.com/Laeryid
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5b2422c074bd9b891f4fb35c98d68623e1a9418e
- Trigger Event: push

indexed-parquet-dataset 0.2.5.dev0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Indexed Parquet Dataset

Architecture

Features

Installation

Project Structure

Usage

Basic Initialization

Fluent API (Transformations)

Advanced Manipulation

Copying and Concatenating

Materialization (Cloning)

Index Persistence

Technical Requirements

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance