Stream and parallel-process .bz2 files via pbzip2.
Project description
pbz2 v0.2.0
Stream and parallel-process .bz2 files via pbzip2 (parallel bzip2).
Reads compressed files through a pbzip2 -dc subprocess for multi-core decompression — no temp files, no full decompression to disk — and falls back to the stdlib bz2 module when the pbzip2 binary is unavailable. Iterate raw lines, newline-aligned text chunks, or parsed JSONL records, or fan chunks out across a process pool for parallel parsing. Includes a CLI for quick inspection and a Python API for custom pipelines. Corrupt or truncated input raises instead of silently yielding partial data.
Project Structure
pbz2/
├── pbz2/ # Python library
│ ├── reader.py # Streaming readers (open_decompress, iter_*)
│ ├── parallel.py # Process-pool chunk processing
│ └── cli.py # Typer CLI commands
├── tests/ # Test suite
└── pyproject.toml # Project configuration
Installation
uv add pbz2
From source:
git clone https://github.com/gitronald/pbz2.git
cd pbz2
uv sync
From a specific branch:
uv add git+https://github.com/gitronald/pbz2.git@dev
Install the pbzip2 binary for parallel decompression (optional — without it, reads fall back to single-threaded stdlib bz2):
sudo apt install pbzip2 # Debian/Ubuntu
brew install pbzip2 # macOS
Note: the parallel speedup only applies to files that were compressed with pbzip2. pbzip2 writes its output as multiple independent bzip2 streams that can be decompressed concurrently; a file compressed with standard
bzip2(or Python'sbz2) is a single stream, which pbzip2 can only decompress on one core. Compress withpbzip2 data.jsonto get parallel decompression later.
CLI Commands
Quick inspection of .bz2 files from the shell:
# Count lines
pbz2 count data.json.bz2
# Print the first N lines
pbz2 head data.json.bz2 -n 5
Python API
Iterate
import pbz2
# Parsed JSON objects from a .json.bz2 file
for obj in pbz2.iter_jsonl("data.json.bz2"):
...
# Raw UTF-8 lines
for line in pbz2.iter_lines("data.txt.bz2"):
...
# Newline-aligned text chunks (useful for batched processing)
for chunk in pbz2.iter_chunks("data.txt.bz2"):
...
Parallel processing
process_parallel streams chunks of newline-terminated records through a worker pool. The worker function receives raw text chunks (so parsing happens in the worker, not the main process), and on_result runs in the main process to handle each result as it completes.
import json
import pbz2
def parse_chunk(chunk: str) -> list[dict]:
# split on "\n" only -- str.splitlines() also breaks on U+2028/U+2029 etc.,
# which can appear raw inside records and would shatter them
return [json.loads(line) for line in chunk.split("\n") if line]
def save(records: list[dict]) -> None:
... # write to db, file, etc.
pbz2.process_parallel(
"data.json.bz2",
worker_fn=parse_chunk,
on_result=save,
num_processes=8,
)
Reference
| Function | Description |
|---|---|
iter_chunks(path, **opts) |
Yield UTF-8 text chunks ending on a newline boundary. |
iter_lines(path, **opts) |
Yield non-empty UTF-8 lines (no trailing newline). |
iter_jsonl(path, *, loads=None, **opts) |
Yield parsed JSON objects (uses orjson; pass loads= to override). |
process_parallel(path, worker_fn, *, on_result=None, worker_args=(), num_processes=None, max_pending=None, ...) |
Run worker_fn(chunk, *worker_args) in a process pool, dispatching results to on_result. |
open_decompress(path, **opts) |
Low-level: open a binary stream of decompressed bytes. |
Common options
num_processors— pbzip2 worker count (default: cpu_count - 1)bufsize_mb— OS pipe buffer between pbzip2 and Python (default: 32 MB)stream_buffer_mb— Python-side read chunk size (default: 4 MB)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pbz2-0.2.0.tar.gz.
File metadata
- Download URL: pbz2-0.2.0.tar.gz
- Upload date:
- Size: 46.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0b355e7c96d176f661f9cc351a159f5eaa2f6c6d829bd650be61db4ad42dbc87
|
|
| MD5 |
f22b05a8ffba990fd3ef4da51f630ebe
|
|
| BLAKE2b-256 |
062a06086ae47ab53527307a86e8508b62a366d503089ea1ffb9f5ec8ae71e5e
|
Provenance
The following attestation bundles were made for pbz2-0.2.0.tar.gz:
Publisher:
publish.yml on gitronald/pbz2
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pbz2-0.2.0.tar.gz -
Subject digest:
0b355e7c96d176f661f9cc351a159f5eaa2f6c6d829bd650be61db4ad42dbc87 - Sigstore transparency entry: 1781717105
- Sigstore integration time:
-
Permalink:
gitronald/pbz2@8d1a17e492e45d14148dfb07ae4bd6f4c9c7b90a -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/gitronald
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8d1a17e492e45d14148dfb07ae4bd6f4c9c7b90a -
Trigger Event:
push
-
Statement type:
File details
Details for the file pbz2-0.2.0-py3-none-any.whl.
File metadata
- Download URL: pbz2-0.2.0-py3-none-any.whl
- Upload date:
- Size: 8.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4c2f4feea67376f35d3aa5e7f184d9acb38548b413c777b9e8bab495fccfa4e6
|
|
| MD5 |
116d24346233380c892c194018ff80d6
|
|
| BLAKE2b-256 |
639682140341ef93acc9941cce5a56bdea79a233263b06a793e0ad4db1d0c572
|
Provenance
The following attestation bundles were made for pbz2-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on gitronald/pbz2
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pbz2-0.2.0-py3-none-any.whl -
Subject digest:
4c2f4feea67376f35d3aa5e7f184d9acb38548b413c777b9e8bab495fccfa4e6 - Sigstore transparency entry: 1781717392
- Sigstore integration time:
-
Permalink:
gitronald/pbz2@8d1a17e492e45d14148dfb07ae4bd6f4c9c7b90a -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/gitronald
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8d1a17e492e45d14148dfb07ae4bd6f4c9c7b90a -
Trigger Event:
push
-
Statement type: