Makes Parquet workflows a piece of cake to slice, save, and serve.
Project description
🧁 Parcake
Parcake makes Parquet workflows a piece of cake to slice, save, and serve. It focuses on ergonomic chunked writing and row-group iteration so you can stream data to and from Parquet without juggling low-level details.
✨ Features
Parcake provides high-level helpers to simplify chunked/pieced reading and writing of Parquet files with a focus on efficiency and ease of use:
- 🍰 Chunked writing with schema enforcement via
PieceSaver - 🔁 Row-group iteration that yields pandas DataFrames with
PieceReader - ⚙️ Parallel row-group processing with
PieceReader.process - 💾 Memory-aware operations that avoid loading entire datasets at once
- 🔽 DuckDB-powered sorting for large datasets with
PieceSorter - 🧮 Streaming group-by aggregation for massive datasets via
PieceGrouper
These utilities make it easier to work with large Parquet datasets—ideal for data pipelines, preprocessing, or scalable ETL jobs.
Full documentation lives in the new docs/ directory, ready for Sphinx builds.
🛠️ Installation
Install from PyPI:
pip install parcake
Or install from source in editable mode:
git clone https://github.com/your-org/parcake.git
cd parcake
pip install meson-python meson ninja
pip install -e . --no-build-isolation
The option --no-build-isolation is required to avoid dependency conflicts when
installing from source.
🚀 Quick Start
PieceSaver — write Parquet pieces safely
from pathlib import Path
import pandas as pd
from parcake import PieceSaver
output = Path("./events.parquet")
header = {"user": "str", "timestamp": "datetime64[ns]", "value": "float"}
with PieceSaver(header, output, max_piece_size=10_000) as saver:
saver.add(user="alice", timestamp=pd.Timestamp.utcnow(), value=1.0)
saver.add(user="bob", timestamp=pd.Timestamp.utcnow(), value=2.0)
PieceSaver buffers rows until max_piece_size is reached, then flushes a new
row group to disk automatically. You can call add_many() with an iterable of
rows, construct an instance from_schema(), or inspect rows_written and the
current buffer_size for monitoring. Need a specific compression level? Pass
compression_level when constructing the saver; None preserves the Parquet
writer default.
PieceReader — iterate row groups efficiently
from parcake import PieceReader
reader = PieceReader("./events.parquet")
for frame in reader: # sequential iteration
print(len(frame))
Need additional context? Use iter_with_info() to obtain the path and row-group
index alongside each DataFrame:
for frame, path, rg in PieceReader(["events_a.parquet", "events_b.parquet"]).iter_with_info():
print(path.name, rg, len(frame))
PieceSorter — reorder Parquet files efficiently
from parcake import PieceSorter
sources = ['events_a.parquet', 'events_b.parquet']
sorter = PieceSorter(sources, columns=['timestamp'])
sorter.sort('./events_sorted.parquet', compression='preserve', threads=4)
Provide per-column directions with tuples, for example ('timestamp', True) for ascending or ('value', False) for descending. Pass directories or glob patterns to combine many files.
PieceGrouper — stream grouped aggregations
from parcake import PieceGrouper
sources = ['events_a.parquet', 'events_b.parquet']
with PieceGrouper(sources, group_by=['customer_id', 'region']) as grouper:
for key, batches in grouper:
for frame in batches:
print(key, frame.shape)
summary = grouper.aggregate({'revenue': ['sum', 'max'], 'order_id': 'count'})
unique_groups = grouper.unique()
PieceGrouper guarantees consistent grouping even when the input is unsorted, optionally materialises a sorted scratch file for reuse, and exposes helpers like aggregate, unique, and sorted_path to speed up analytics workloads without loading whole datasets into memory.
Parallel processing with PieceReader.process
from concurrent.futures import ThreadPoolExecutor
from parcake import PieceReader
reader = PieceReader("./events.parquet")
def summarise(df, rg, path):
return path.name, rg, df["value"].sum()
with ThreadPoolExecutor(max_workers=4) as executor:
for result in reader.process(summarise, executor=executor, keep_order=True):
print(result)
PieceReader.process can work with multiprocessing pools, executors, or
sequentially when no pool is provided. Set ncpu=-1 to use all available cores,
keep_order=True to emit results in file × row-group order, or unordered=True
to stream results as soon as they are ready.
🧪 Examples
See the Sphinx-style docs in docs/—notably docs/examples/save_example.py, docs/examples/load_example.py, docs/examples/sort_example.py, and docs/examples/group_example.py for end-to-end examples that create Parquet files with PieceSaver, stream them with PieceReader, perform large-file sorting with PieceSorter, and run memory-efficient group-by workflows via PieceGrouper.
📦 API Overview
| Class/Method | Description |
|---|---|
PieceSaver |
Buffered writer that saves rows in fixed-size pieces with schema validation |
PieceReader |
Iterator that yields Parquet row groups as pandas DataFrames |
PieceSorter |
DuckDB-backed external sorting for multi-file Parquet datasets |
PieceGrouper |
Streaming group-by iteration and aggregation over Parquet sources |
PieceReader.process |
Apply a callable to each row group sequentially or in parallel |
🧠 Why Parcake?
- Memory Efficient: Designed for datasets larger than available RAM
- Simple and Composable: Fits naturally into Pythonic data workflows
- Scalable: Parallel row-group processing with flexible execution backends
- Lightweight: Built only on essential dependencies (pandas, pyarrow)
🔖 Versioning
- Versions come from annotated Git tags such as
v0.2.0;setuptools-scmreads them for builds. - Use
make release TAG=v0.2.0to sync metadata, commit, tag, and build the release artifact. - Run
python scripts/get_version.pyto see the version that will be embedded into packages. - For manual workflows, run
python scripts/sync_version.py --version v0.2.0to write the tag intopyproject.tomlandmeson.build. - The package exposes
parcake.__version__, which resolves to the installed build number (falls back to0.0.0in editable installs).
📜 License
This project is licensed under the MIT License — see the LICENSE
file for details.
🧁 About the Name
Parcake = Parquet + Piece of Cake 🍰 It’s a small, sweet library to make your Parquet workflows simpler, chunk by chunk.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file parcake-0.4.1.tar.gz.
File metadata
- Download URL: parcake-0.4.1.tar.gz
- Upload date:
- Size: 1.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
16d2ad286df28d014ebcd92301e39a1aba889f039494eb704a48438bb28d4dea
|
|
| MD5 |
59b551845efddb79359dac2784a6d3ba
|
|
| BLAKE2b-256 |
c98c60f420ba35a9b8f602d3c82bb2ddbc0f2a256520dbb1c582eb41536cefdc
|
Provenance
The following attestation bundles were made for parcake-0.4.1.tar.gz:
Publisher:
python-publish.yml on filipinascimento/parcake
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
parcake-0.4.1.tar.gz -
Subject digest:
16d2ad286df28d014ebcd92301e39a1aba889f039494eb704a48438bb28d4dea - Sigstore transparency entry: 614762515
- Sigstore integration time:
-
Permalink:
filipinascimento/parcake@e14edff4519fd4e19f7e53bfd338d66b0b3f6572 -
Branch / Tag:
refs/tags/v0.4.1 - Owner: https://github.com/filipinascimento
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@e14edff4519fd4e19f7e53bfd338d66b0b3f6572 -
Trigger Event:
release
-
Statement type:
File details
Details for the file parcake-0.4.1-py3-none-any.whl.
File metadata
- Download URL: parcake-0.4.1-py3-none-any.whl
- Upload date:
- Size: 20.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ac960be658461a92e013c91c8701ffa71712eb331229b669e623d93c311fc668
|
|
| MD5 |
ef59abeaee19083271529f71124d772b
|
|
| BLAKE2b-256 |
f888963dbbf75d5db06972093e115108b7f17f3a099fb04fe9579035585f7ab6
|
Provenance
The following attestation bundles were made for parcake-0.4.1-py3-none-any.whl:
Publisher:
python-publish.yml on filipinascimento/parcake
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
parcake-0.4.1-py3-none-any.whl -
Subject digest:
ac960be658461a92e013c91c8701ffa71712eb331229b669e623d93c311fc668 - Sigstore transparency entry: 614762600
- Sigstore integration time:
-
Permalink:
filipinascimento/parcake@e14edff4519fd4e19f7e53bfd338d66b0b3f6572 -
Branch / Tag:
refs/tags/v0.4.1 - Owner: https://github.com/filipinascimento
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@e14edff4519fd4e19f7e53bfd338d66b0b3f6572 -
Trigger Event:
release
-
Statement type: