Stream and parallel-process .bz2 files via pbzip2.
Project description
pbz2
Stream and parallel-process .bz2 files via pbzip2 (parallel bzip2). Falls back to the stdlib bz2 module when the pbzip2 binary is unavailable.
Install
uv add pbz2
Install pbzip2 for parallel decompression:
sudo apt install pbzip2 # Debian/Ubuntu
brew install pbzip2 # macOS
Usage
Iterate
import pbz2
# Parsed JSON objects from a .json.bz2 file
for obj in pbz2.iter_jsonl("data.json.bz2"):
...
# Raw UTF-8 lines
for line in pbz2.iter_lines("data.txt.bz2"):
...
# Newline-aligned text chunks (useful for batched processing)
for chunk in pbz2.iter_chunks("data.txt.bz2"):
...
Parallel processing
process_parallel streams chunks of newline-terminated records through a worker pool. The worker function receives raw text chunks (so parsing happens in the worker, not the main process), and on_result runs in the main process to handle each result as it completes.
import json
import pbz2
def parse_chunk(chunk: str) -> list[dict]:
# split on "\n" only -- str.splitlines() also breaks on U+2028/U+2029 etc.,
# which can appear raw inside records and would shatter them
return [json.loads(line) for line in chunk.split("\n") if line]
def save(records: list[dict]) -> None:
... # write to db, file, etc.
pbz2.process_parallel(
"data.json.bz2",
worker_fn=parse_chunk,
on_result=save,
num_processes=8,
)
CLI
pbz2 count data.json.bz2
pbz2 head data.json.bz2 -n 5
API
| Function | Description |
|---|---|
iter_chunks(path, **opts) |
Yield UTF-8 text chunks ending on a newline boundary. |
iter_lines(path, **opts) |
Yield non-empty UTF-8 lines (no trailing newline). |
iter_jsonl(path, *, loads=None, **opts) |
Yield parsed JSON objects (uses orjson if installed). |
process_parallel(path, worker_fn, *, on_result=None, worker_args=(), num_processes=None, max_pending=None, ...) |
Run worker_fn(chunk, *worker_args) in a process pool, dispatching results to on_result. |
open_decompress(path, **opts) |
Low-level: open a binary stream of decompressed bytes. |
Common options
num_processors— pbzip2 worker count (default: cpu_count - 1)bufsize_mb— OS pipe buffer between pbzip2 and Python (default: 32 MB)stream_buffer_mb— Python-side read chunk size (default: 4 MB)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pbz2-0.1.3.tar.gz.
File metadata
- Download URL: pbz2-0.1.3.tar.gz
- Upload date:
- Size: 43.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ae9fe79cff91cda44fd26017263fe5d74d5f3e458fba641e69bd9b6dd06cf7a
|
|
| MD5 |
4c6638d318dab2f02b7605dc22e2aebf
|
|
| BLAKE2b-256 |
f243e0df587a85c76ba9ceb2d9be207b22ff3da81e3a977daf571eb54641fe22
|
Provenance
The following attestation bundles were made for pbz2-0.1.3.tar.gz:
Publisher:
publish.yml on gitronald/pbz2
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pbz2-0.1.3.tar.gz -
Subject digest:
6ae9fe79cff91cda44fd26017263fe5d74d5f3e458fba641e69bd9b6dd06cf7a - Sigstore transparency entry: 1661426800
- Sigstore integration time:
-
Permalink:
gitronald/pbz2@9b8f559f42544ef81dac4a419e59ae2f0bc79e5d -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/gitronald
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9b8f559f42544ef81dac4a419e59ae2f0bc79e5d -
Trigger Event:
push
-
Statement type:
File details
Details for the file pbz2-0.1.3-py3-none-any.whl.
File metadata
- Download URL: pbz2-0.1.3-py3-none-any.whl
- Upload date:
- Size: 7.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7966a11430be3649253da8e8dbbbf68f5fd0afacd25aaf9483295b800c3edf11
|
|
| MD5 |
6adea5967fe205c41479bccc0bbd94d1
|
|
| BLAKE2b-256 |
35895048f6361b88408364d94e043fa71af5d2190c1eaf2d65262686c11b4b75
|
Provenance
The following attestation bundles were made for pbz2-0.1.3-py3-none-any.whl:
Publisher:
publish.yml on gitronald/pbz2
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pbz2-0.1.3-py3-none-any.whl -
Subject digest:
7966a11430be3649253da8e8dbbbf68f5fd0afacd25aaf9483295b800c3edf11 - Sigstore transparency entry: 1661426870
- Sigstore integration time:
-
Permalink:
gitronald/pbz2@9b8f559f42544ef81dac4a419e59ae2f0bc79e5d -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/gitronald
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9b8f559f42544ef81dac4a419e59ae2f0bc79e5d -
Trigger Event:
push
-
Statement type: