Skip to main content

High-performance O(1) random access indexer for Parquet datasets in PyTorch

Project description

Indexed Parquet Dataset Logo

Version Python Version License

Indexed Parquet Dataset

High-performance O(1) random access indexer for Parquet datasets in PyTorch and Python.

This library provides an efficient way to handle large-scale datasets stored as multiple Parquet files, allowing for constant-time random access to any row without loading the entire dataset into memory.

Architecture

The following diagram illustrates how the IndexedParquetDataset remains lightweight by only keeping metadata in memory, while the actual data stays on disk:

graph TD
    subgraph RAM ["Application RAM (Lightweight)"]
        direction TB
        subgraph DS ["IndexedParquetDataset Object"]
            Indices["Indices Array [np.ndarray]<br/>(Filtered/Shuffled indices)"]
            Meta["Metadata & Schema<br/>(File offsets, column maps)"]
            Cache["File Handle Cache<br/>(Lazy Loading)"]
        end
        
        User["User Code / PyTorch DataLoader"] -- "dataset(idx)" --> Indices
        Indices -- "Global Index" --> Meta
        Meta -- "Find File & Row Offset" --> Cache
    end
    
    subgraph Storage ["Storage (SSD/HDD - Large Files)"]
        F1["data_part_1.parquet"]
        F2["data_part_2.parquet"]
        FN["data_part_N.parquet"]
    end
    
    Cache -- "Lazy Read" --> F1
    Cache -- "Lazy Read" --> F2
    Cache -- "Lazy Read" --> FN
    
    F1 -. "O1 Row Retrieval" .-> User
    F2 -. "O1 Row Retrieval" .-> User
    FN -. "O1 Row Retrieval" .-> User

    style DS fill:#e1f5fe,stroke:#01579b,stroke-width:2px
    style Storage fill:#f5f5f5,stroke:#424242,stroke-width:2px,stroke-dasharray: 5 5

Features

  • Fluent API: Chainable methods for data processing including shuffle, filter, alias, split, limit, rename, copy, and concat.
  • Computed Columns: Create new columns or replace existing ones using Python functions (lambdas) via the alias method.
  • Explicit Casting: Change column types on the fly with the cast method.
  • Materialization: Bake all transformations and computations into a real Parquet file via clone(path), eliminating runtime overhead for streaming.
  • Dynamic Schema Evolution: Support for concatenating datasets with different schemas and automatic type alignment.
  • Linear Scalability: Indexed access with O(1) complexity regardless of the dataset size.
  • Lazy Loading: Concurrent safe file access with minimal memory footprint.
  • PyTorch Integration: Fully compatible with torch.utils.data.Dataset.
  • Index Persistence: Save and load dataset indices to skip the indexing phase in future runs.

Installation

You can install the package directly from GitHub using pip:

pip install git+https://github.com/Laeryid/indexed-parquet-dataset

Project Structure

  • src/indexed_parquet/: Core package directory.
    • dataset.py: Implementation of IndexedParquetDataset and transformation logic.
    • indexer.py: Logic for scanning Parquet files and building global row maps.
    • schema.py: Utilities for handling Parquet schemas and data types.
  • tests/: Comprehensive test suite verifying indexing, transformations, and PyTorch compatibility.
  • pyproject.toml: Project metadata and dependency definitions.

Usage

Basic Initialization

Create a dataset from a folder containing Parquet files:

from indexed_parquet import IndexedParquetDataset

# Scans the folder and builds an internal index
dataset = IndexedParquetDataset.from_folder("./path/to/data")

print(f"Total rows: {len(dataset)}")
print(f"First row: {dataset[0]}")

Fluent API (Transformations)

The dataset supports a chainable API for common data preparation tasks:

dataset = (IndexedParquetDataset.from_folder("./data")
           .filter(lambda x: x["split"] == "train")
           .shuffle(seed=42)
           .alias("text_len", lambda x: len(x["text"])) # Computed column
           .cast("text_len", "int")                    # Explicit casting
           .limit(10000))

# Accessing transformed data
sample = dataset[0] # sample["text_len"] is now available

Advanced Manipulation

Copying and Concatenating

Create independent copies or merge multiple datasets with automatic schema alignment:

# Create an independent copy
ds_copy = dataset.copy()

# Vertically concatenate two datasets
# Automatically handles overlapping columns and different aliases
combined_ds = dataset1.concat(dataset2)

Materialization (Cloning)

To avoid performance degradation when using multiple computed columns or heavy filters, you can "bake" the current state into a new Parquet file:

# to_parquet: just save to disk (export)
dataset.to_parquet("baked_data.parquet")

# clone: save to disk AND return a new dataset instance 
# pointing to the new file (zero-overhead)
fast_dataset = dataset.clone("materialized.parquet")

[!NOTE] clone() always requires a destination path. It ensures that all Python-based computations (lambdas) are executed once and their results are stored as real values in the new file.


### Batch Reading

You can pass a list of indices to retrieve multiple rows efficiently:

```python
batch_indices = [0, 10, 100, 500]
rows = dataset[batch_indices]

Index Persistence

Building an index for millions of rows can take time. You can save the index to a file and load it later:

# Save index
dataset.save_index("my_dataset_index.pkl")

# Load index later (much faster than from_folder)
loaded_dataset = IndexedParquetDataset.load_index("my_dataset_index.pkl")

Technical Requirements

  • PyArrow: Backend for Parquet file operations.
  • NumPy: Efficient array management for indexing.
  • PyTorch: Seamless integration for machine learning pipelines.
  • Pandas: Schema management and metadata utilities.

License

This project is licensed under the Apache 2.0 License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

indexed_parquet_dataset-0.2.5.dev0.tar.gz (2.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

indexed_parquet_dataset-0.2.5.dev0-py3-none-any.whl (23.9 kB view details)

Uploaded Python 3

File details

Details for the file indexed_parquet_dataset-0.2.5.dev0.tar.gz.

File metadata

File hashes

Hashes for indexed_parquet_dataset-0.2.5.dev0.tar.gz
Algorithm Hash digest
SHA256 81484bd7511b9f90386e9fde8ff5f36d57b1d456e88e02b5cd272d00c7868ddb
MD5 f210936931fe1c9347d02f1af27448f2
BLAKE2b-256 b09447339d7a1efa54235de5c6d2003ee623110de6880270d46c6f2acd21adcb

See more details on using hashes here.

Provenance

The following attestation bundles were made for indexed_parquet_dataset-0.2.5.dev0.tar.gz:

Publisher: publish.yml on Laeryid/indexed-parquet-dataset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file indexed_parquet_dataset-0.2.5.dev0-py3-none-any.whl.

File metadata

File hashes

Hashes for indexed_parquet_dataset-0.2.5.dev0-py3-none-any.whl
Algorithm Hash digest
SHA256 c2f92409ff67054930e3e35a327421f5f73dcbaa58dcdf7b18d2d73da69512de
MD5 ff795ffe11a26dbeb7e305ecae9508d4
BLAKE2b-256 904afa3582e8b78241752f49b5353e27cdd15b836475df9e87e51b3eba93b5f5

See more details on using hashes here.

Provenance

The following attestation bundles were made for indexed_parquet_dataset-0.2.5.dev0-py3-none-any.whl:

Publisher: publish.yml on Laeryid/indexed-parquet-dataset

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page