High-performance O(1) random access indexer for Parquet datasets in PyTorch
Project description
Indexed Parquet Dataset
High-performance O(1) random access indexer for Parquet datasets in PyTorch and Python.
This library provides an efficient way to handle large-scale datasets stored as multiple Parquet files, allowing for constant-time random access to any row without loading the entire dataset into memory.
Architecture
The following diagram illustrates how the IndexedParquetDataset remains lightweight by only keeping metadata in memory, while the actual data stays on disk:
graph TD
subgraph RAM ["Application RAM (Lightweight)"]
direction TB
subgraph DS ["IndexedParquetDataset Object"]
Indices["Indices Array [np.ndarray]<br/>(Filtered/Shuffled indices)"]
Meta["Metadata & Schema<br/>(File offsets, column maps)"]
Cache["File Handle Cache<br/>(Lazy Loading)"]
end
User["User Code / PyTorch DataLoader"] -- "dataset(idx)" --> Indices
Indices -- "Global Index" --> Meta
Meta -- "Find File & Row Offset" --> Cache
end
subgraph Storage ["Storage (SSD/HDD - Large Files)"]
F1["data_part_1.parquet"]
F2["data_part_2.parquet"]
FN["data_part_N.parquet"]
end
Cache -- "Lazy Read" --> F1
Cache -- "Lazy Read" --> F2
Cache -- "Lazy Read" --> FN
F1 -. "O1 Row Retrieval" .-> User
F2 -. "O1 Row Retrieval" .-> User
FN -. "O1 Row Retrieval" .-> User
style DS fill:#e1f5fe,stroke:#01579b,stroke-width:2px
style Storage fill:#f5f5f5,stroke:#424242,stroke-width:2px,stroke-dasharray: 5 5
Features
- Fluent API: Chainable methods for data processing including
shuffle,filter,alias,split,limit,rename,copy, andconcat. - Computed Columns: Create new columns or replace existing ones using Python functions (lambdas) via the
aliasmethod. - Explicit Casting: Change column types on the fly with the
castmethod. - Materialization: Bake all transformations and computations into a real Parquet file via
clone(path), eliminating runtime overhead for streaming. - Dynamic Schema Evolution: Support for concatenating datasets with different schemas and automatic type alignment.
- Linear Scalability: Indexed access with O(1) complexity regardless of the dataset size.
- Lazy Loading: Concurrent safe file access with minimal memory footprint.
- PyTorch Integration: Fully compatible with
torch.utils.data.Dataset. - Index Persistence: Save and load dataset indices to skip the indexing phase in future runs.
Installation
You can install the package directly from GitHub using pip:
pip install git+https://github.com/Laeryid/indexed-parquet-dataset
Project Structure
src/indexed_parquet/: Core package directory.dataset.py: Implementation ofIndexedParquetDatasetand transformation logic.indexer.py: Logic for scanning Parquet files and building global row maps.schema.py: Utilities for handling Parquet schemas and data types.
tests/: Comprehensive test suite verifying indexing, transformations, and PyTorch compatibility.pyproject.toml: Project metadata and dependency definitions.
Usage
Basic Initialization
Create a dataset from a folder containing Parquet files:
from indexed_parquet import IndexedParquetDataset
# Scans the folder and builds an internal index
dataset = IndexedParquetDataset.from_folder("./path/to/data")
print(f"Total rows: {len(dataset)}")
print(f"First row: {dataset[0]}")
Fluent API (Transformations)
The dataset supports a chainable API for common data preparation tasks:
dataset = (IndexedParquetDataset.from_folder("./data")
.filter(lambda x: x["split"] == "train")
.shuffle(seed=42)
.alias("text_len", lambda x: len(x["text"])) # Computed column
.cast("text_len", "int") # Explicit casting
.limit(10000))
# Accessing transformed data
sample = dataset[0] # sample["text_len"] is now available
Advanced Manipulation
Copying and Concatenating
Create independent copies or merge multiple datasets with automatic schema alignment:
# Create an independent copy
ds_copy = dataset.copy()
# Vertically concatenate two datasets
# Automatically handles overlapping columns and different aliases
combined_ds = dataset1.concat(dataset2)
Materialization (Cloning)
To avoid performance degradation when using multiple computed columns or heavy filters, you can "bake" the current state into a new Parquet file:
# to_parquet: just save to disk (export)
dataset.to_parquet("baked_data.parquet")
# clone: save to disk AND return a new dataset instance
# pointing to the new file (zero-overhead)
fast_dataset = dataset.clone("materialized.parquet")
[!NOTE]
clone()always requires a destination path. It ensures that all Python-based computations (lambdas) are executed once and their results are stored as real values in the new file.
### Batch Reading
You can pass a list of indices to retrieve multiple rows efficiently:
```python
batch_indices = [0, 10, 100, 500]
rows = dataset[batch_indices]
Index Persistence
Building an index for millions of rows can take time. You can save the index to a file and load it later:
# Save index
dataset.save_index("my_dataset_index.pkl")
# Load index later (much faster than from_folder)
loaded_dataset = IndexedParquetDataset.load_index("my_dataset_index.pkl")
Technical Requirements
- PyArrow: Backend for Parquet file operations.
- NumPy: Efficient array management for indexing.
- PyTorch: Seamless integration for machine learning pipelines.
- Pandas: Schema management and metadata utilities.
License
This project is licensed under the Apache 2.0 License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file indexed_parquet_dataset-0.2.5.dev0.tar.gz.
File metadata
- Download URL: indexed_parquet_dataset-0.2.5.dev0.tar.gz
- Upload date:
- Size: 2.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
81484bd7511b9f90386e9fde8ff5f36d57b1d456e88e02b5cd272d00c7868ddb
|
|
| MD5 |
f210936931fe1c9347d02f1af27448f2
|
|
| BLAKE2b-256 |
b09447339d7a1efa54235de5c6d2003ee623110de6880270d46c6f2acd21adcb
|
Provenance
The following attestation bundles were made for indexed_parquet_dataset-0.2.5.dev0.tar.gz:
Publisher:
publish.yml on Laeryid/indexed-parquet-dataset
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
indexed_parquet_dataset-0.2.5.dev0.tar.gz -
Subject digest:
81484bd7511b9f90386e9fde8ff5f36d57b1d456e88e02b5cd272d00c7868ddb - Sigstore transparency entry: 1265904647
- Sigstore integration time:
-
Permalink:
Laeryid/indexed-parquet-dataset@5b2422c074bd9b891f4fb35c98d68623e1a9418e -
Branch / Tag:
refs/tags/v0.2.4 - Owner: https://github.com/Laeryid
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5b2422c074bd9b891f4fb35c98d68623e1a9418e -
Trigger Event:
push
-
Statement type:
File details
Details for the file indexed_parquet_dataset-0.2.5.dev0-py3-none-any.whl.
File metadata
- Download URL: indexed_parquet_dataset-0.2.5.dev0-py3-none-any.whl
- Upload date:
- Size: 23.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c2f92409ff67054930e3e35a327421f5f73dcbaa58dcdf7b18d2d73da69512de
|
|
| MD5 |
ff795ffe11a26dbeb7e305ecae9508d4
|
|
| BLAKE2b-256 |
904afa3582e8b78241752f49b5353e27cdd15b836475df9e87e51b3eba93b5f5
|
Provenance
The following attestation bundles were made for indexed_parquet_dataset-0.2.5.dev0-py3-none-any.whl:
Publisher:
publish.yml on Laeryid/indexed-parquet-dataset
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
indexed_parquet_dataset-0.2.5.dev0-py3-none-any.whl -
Subject digest:
c2f92409ff67054930e3e35a327421f5f73dcbaa58dcdf7b18d2d73da69512de - Sigstore transparency entry: 1265904749
- Sigstore integration time:
-
Permalink:
Laeryid/indexed-parquet-dataset@5b2422c074bd9b891f4fb35c98d68623e1a9418e -
Branch / Tag:
refs/tags/v0.2.4 - Owner: https://github.com/Laeryid
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5b2422c074bd9b891f4fb35c98d68623e1a9418e -
Trigger Event:
push
-
Statement type: