Makes Parquet workflows a piece of cake to slice, save, and serve.

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

filsilva

These details have not been verified by PyPI

Project description

🧁 Parcake

Parcake logo

Parcake makes Parquet workflows a piece of cake to slice, save, and serve. It focuses on ergonomic chunked writing and row-group iteration so you can stream data to and from Parquet without juggling low-level details.

✨ Features

Parcake provides high-level helpers to simplify chunked/pieced reading and writing of Parquet files with a focus on efficiency and ease of use:

🍰 Chunked writing with schema enforcement via PieceSaver
🔁 Row-group iteration that yields pandas DataFrames with PieceReader
⚙️ Parallel row-group processing with PieceReader.process
💾 Memory-aware operations that avoid loading entire datasets at once
🔽 DuckDB-powered sorting for large datasets with PieceSorter
🧮 Streaming group-by aggregation for massive datasets via PieceGrouper

These utilities make it easier to work with large Parquet datasets—ideal for data pipelines, preprocessing, or scalable ETL jobs.

Full documentation lives in the new docs/ directory, ready for Sphinx builds.

🛠️ Installation

Install from PyPI:

pip install parcake

Or install from source in editable mode:

git clone https://github.com/your-org/parcake.git
cd parcake
pip install meson-python meson ninja
pip install -e . --no-build-isolation

The option --no-build-isolation is required to avoid dependency conflicts when installing from source.

🚀 Quick Start

PieceSaver — write Parquet pieces safely

from pathlib import Path
import pandas as pd
from parcake import PieceSaver

output = Path("./events.parquet")
header = {"user": "str", "timestamp": "datetime64[ns]", "value": "float"}

with PieceSaver(header, output, max_piece_size=10_000) as saver:
    saver.add(user="alice", timestamp=pd.Timestamp.utcnow(), value=1.0)
    saver.add(user="bob", timestamp=pd.Timestamp.utcnow(), value=2.0)

PieceSaver buffers rows until max_piece_size is reached, then flushes a new row group to disk automatically. You can call add_many() with an iterable of rows, construct an instance from_schema(), or inspect rows_written and the current buffer_size for monitoring. Need a specific compression level? Pass compression_level when constructing the saver; None preserves the Parquet writer default.

PieceReader — iterate row groups efficiently

from parcake import PieceReader

reader = PieceReader("./events.parquet")
for frame in reader:  # sequential iteration
    print(len(frame))

Need additional context? Use iter_with_info() to obtain the path and row-group index alongside each DataFrame:

for frame, path, rg in PieceReader(["events_a.parquet", "events_b.parquet"]).iter_with_info():
    print(path.name, rg, len(frame))

PieceSorter — reorder Parquet files efficiently

from parcake import PieceSorter

sources = ['events_a.parquet', 'events_b.parquet']
sorter = PieceSorter(sources, columns=['timestamp'])
sorter.sort('./events_sorted.parquet', compression='preserve', threads=4)

Provide per-column directions with tuples, for example ('timestamp', True) for ascending or ('value', False) for descending. Pass directories or glob patterns to combine many files.

PieceGrouper — stream grouped aggregations

from parcake import PieceGrouper

sources = ['events_a.parquet', 'events_b.parquet']
with PieceGrouper(sources, group_by=['customer_id', 'region']) as grouper:
    for key, batches in grouper:
        for frame in batches:
            print(key, frame.shape)

    summary = grouper.aggregate({'revenue': ['sum', 'max'], 'order_id': 'count'})
    unique_groups = grouper.unique()

PieceGrouper guarantees consistent grouping even when the input is unsorted, optionally materialises a sorted scratch file for reuse, and exposes helpers like aggregate, unique, and sorted_path to speed up analytics workloads without loading whole datasets into memory.

Parallel processing with PieceReader.process

from concurrent.futures import ThreadPoolExecutor
from parcake import PieceReader

reader = PieceReader("./events.parquet")

def summarise(df, rg, path):
    return path.name, rg, df["value"].sum()

with ThreadPoolExecutor(max_workers=4) as executor:
    for result in reader.process(summarise, executor=executor, keep_order=True):
        print(result)

PieceReader.process can work with multiprocessing pools, executors, or sequentially when no pool is provided. Set ncpu=-1 to use all available cores, keep_order=True to emit results in file × row-group order, or unordered=True to stream results as soon as they are ready.

🧪 Examples

See the Sphinx-style docs in docs/—notably docs/examples/save_example.py, docs/examples/load_example.py, docs/examples/sort_example.py, and docs/examples/group_example.py for end-to-end examples that create Parquet files with PieceSaver, stream them with PieceReader, perform large-file sorting with PieceSorter, and run memory-efficient group-by workflows via PieceGrouper.

📦 API Overview

Class/Method	Description
`PieceSaver`	Buffered writer that saves rows in fixed-size pieces with schema validation
`PieceReader`	Iterator that yields Parquet row groups as pandas DataFrames
`PieceSorter`	DuckDB-backed external sorting for multi-file Parquet datasets
`PieceGrouper`	Streaming group-by iteration and aggregation over Parquet sources
`PieceReader.process`	Apply a callable to each row group sequentially or in parallel

🧠 Why Parcake?

Memory Efficient: Designed for datasets larger than available RAM
Simple and Composable: Fits naturally into Pythonic data workflows
Scalable: Parallel row-group processing with flexible execution backends
Lightweight: Built only on essential dependencies (pandas, pyarrow)

🔖 Versioning

Versions come from annotated Git tags such as v0.2.0; setuptools-scm reads them for builds.
Use make release TAG=v0.2.0 to sync metadata, commit, tag, and build the release artifact.
Run python scripts/get_version.py to see the version that will be embedded into packages.
For manual workflows, run python scripts/sync_version.py --version v0.2.0 to write the tag into pyproject.toml and meson.build.
The package exposes parcake.__version__, which resolves to the installed build number (falls back to 0.0.0 in editable installs).

📜 License

This project is licensed under the MIT License — see the LICENSE file for details.

🧁 About the Name

Parcake = Parquet + Piece of Cake 🍰 It’s a small, sweet library to make your Parquet workflows simpler, chunk by chunk.

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

filsilva

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.4.4

Mar 2, 2026

0.4.1

Oct 16, 2025

0.3.0

Oct 14, 2025

0.2.1

Oct 14, 2025

0.1.5

Oct 14, 2025

0.1.3

Oct 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parcake-0.4.4.tar.gz (1.9 MB view details)

Uploaded Mar 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

parcake-0.4.4-py3-none-any.whl (22.6 kB view details)

Uploaded Mar 2, 2026 Python 3

File details

Details for the file parcake-0.4.4.tar.gz.

File metadata

Download URL: parcake-0.4.4.tar.gz
Upload date: Mar 2, 2026
Size: 1.9 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for parcake-0.4.4.tar.gz
Algorithm	Hash digest
SHA256	`56e85c7d7742b114fee336f400de5a8e5ee4addac80d020da0500e6994f9c5a5`
MD5	`5d426da9ea0f5f73de068ff2aacd2e0a`
BLAKE2b-256	`0bdb7557225326f0124a81d134c4e1c35aa04dca4bdd1a794e93064fa5780756`

See more details on using hashes here.

Provenance

The following attestation bundles were made for parcake-0.4.4.tar.gz:

Publisher: python-publish.yml on filipinascimento/parcake

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: parcake-0.4.4.tar.gz
- Subject digest: 56e85c7d7742b114fee336f400de5a8e5ee4addac80d020da0500e6994f9c5a5
- Sigstore transparency entry: 1012417530
- Sigstore integration time: Mar 2, 2026
Source repository:
- Permalink: filipinascimento/parcake@3da35ff8cd71d588b7aa6aae353a853ef5910e58
- Branch / Tag: refs/tags/v0.4.4
- Owner: https://github.com/filipinascimento
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@3da35ff8cd71d588b7aa6aae353a853ef5910e58
- Trigger Event: release

File details

Details for the file parcake-0.4.4-py3-none-any.whl.

File metadata

Download URL: parcake-0.4.4-py3-none-any.whl
Upload date: Mar 2, 2026
Size: 22.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for parcake-0.4.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`70e8aa949629527710c498397555d7b90c768d22072abad71c357c356816ebf3`
MD5	`a8207f37b29ca9a2e0cbe0cc979a77fa`
BLAKE2b-256	`bc46280d57f3aa76f74c45272fd0c2859c6d54af75b5e51ffe6cc295450c7c7e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for parcake-0.4.4-py3-none-any.whl:

Publisher: python-publish.yml on filipinascimento/parcake

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: parcake-0.4.4-py3-none-any.whl
- Subject digest: 70e8aa949629527710c498397555d7b90c768d22072abad71c357c356816ebf3
- Sigstore transparency entry: 1012417586
- Sigstore integration time: Mar 2, 2026
Source repository:
- Permalink: filipinascimento/parcake@3da35ff8cd71d588b7aa6aae353a853ef5910e58
- Branch / Tag: refs/tags/v0.4.4
- Owner: https://github.com/filipinascimento
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@3da35ff8cd71d588b7aa6aae353a853ef5910e58
- Trigger Event: release

parcake 0.4.4

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

🧁 Parcake

✨ Features

🛠️ Installation

🚀 Quick Start

PieceSaver — write Parquet pieces safely

PieceReader — iterate row groups efficiently

PieceSorter — reorder Parquet files efficiently

PieceGrouper — stream grouped aggregations

Parallel processing with PieceReader.process

🧪 Examples

📦 API Overview

🧠 Why Parcake?

🔖 Versioning

📜 License

🧁 About the Name

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance