Stream and parallel-process .bz2 files via pbzip2.

These details have been verified by PyPI

Project links

repository

GitHub Statistics

Maintainers

rer

Project description

pbz2 v0.2.0

Stream and parallel-process .bz2 files via pbzip2 (parallel bzip2).

Reads compressed files through a pbzip2 -dc subprocess for multi-core decompression — no temp files, no full decompression to disk — and falls back to the stdlib bz2 module when the pbzip2 binary is unavailable. Iterate raw lines, newline-aligned text chunks, or parsed JSONL records, or fan chunks out across a process pool for parallel parsing. Includes a CLI for quick inspection and a Python API for custom pipelines. Corrupt or truncated input raises instead of silently yielding partial data.

Project Structure

pbz2/
├── pbz2/                 # Python library
│   ├── reader.py         # Streaming readers (open_decompress, iter_*)
│   ├── parallel.py       # Process-pool chunk processing
│   └── cli.py            # Typer CLI commands
├── tests/                # Test suite
└── pyproject.toml        # Project configuration

Installation

uv add pbz2

From source:

git clone https://github.com/gitronald/pbz2.git
cd pbz2
uv sync

From a specific branch:

uv add git+https://github.com/gitronald/pbz2.git@dev

Install the pbzip2 binary for parallel decompression (optional — without it, reads fall back to single-threaded stdlib bz2):

sudo apt install pbzip2     # Debian/Ubuntu
brew install pbzip2         # macOS

Note: the parallel speedup only applies to files that were compressed with pbzip2. pbzip2 writes its output as multiple independent bzip2 streams that can be decompressed concurrently; a file compressed with standard bzip2 (or Python's bz2) is a single stream, which pbzip2 can only decompress on one core. Compress with pbzip2 data.json to get parallel decompression later.

CLI Commands

Quick inspection of .bz2 files from the shell:

# Count lines
pbz2 count data.json.bz2

# Print the first N lines
pbz2 head data.json.bz2 -n 5

Python API

Iterate

import pbz2

# Parsed JSON objects from a .json.bz2 file
for obj in pbz2.iter_jsonl("data.json.bz2"):
    ...

# Raw UTF-8 lines
for line in pbz2.iter_lines("data.txt.bz2"):
    ...

# Newline-aligned text chunks (useful for batched processing)
for chunk in pbz2.iter_chunks("data.txt.bz2"):
    ...

Parallel processing

process_parallel streams chunks of newline-terminated records through a worker pool. The worker function receives raw text chunks (so parsing happens in the worker, not the main process), and on_result runs in the main process to handle each result as it completes.

import json
import pbz2

def parse_chunk(chunk: str) -> list[dict]:
    # split on "\n" only -- str.splitlines() also breaks on U+2028/U+2029 etc.,
    # which can appear raw inside records and would shatter them
    return [json.loads(line) for line in chunk.split("\n") if line]

def save(records: list[dict]) -> None:
    ...  # write to db, file, etc.

pbz2.process_parallel(
    "data.json.bz2",
    worker_fn=parse_chunk,
    on_result=save,
    num_processes=8,
)

Reference

Function	Description
`iter_chunks(path, **opts)`	Yield UTF-8 text chunks ending on a newline boundary.
`iter_lines(path, **opts)`	Yield non-empty UTF-8 lines (no trailing newline).
`iter_jsonl(path, , loads=None, *opts)`	Yield parsed JSON objects (uses `orjson`; pass `loads=` to override).
`process_parallel(path, worker_fn, *, on_result=None, worker_args=(), num_processes=None, max_pending=None, ...)`	Run `worker_fn(chunk, *worker_args)` in a process pool, dispatching results to `on_result`.
`open_decompress(path, **opts)`	Low-level: open a binary stream of decompressed bytes.

Common options

num_processors — pbzip2 worker count (default: cpu_count - 1)
bufsize_mb — OS pipe buffer between pbzip2 and Python (default: 32 MB)
stream_buffer_mb — Python-side read chunk size (default: 4 MB)

Project details

These details have been verified by PyPI

Project links

repository

GitHub Statistics

Maintainers

rer

Release history Release notifications | RSS feed

0.2.1

Jun 10, 2026

This version

0.2.0

Jun 10, 2026

0.1.3

May 28, 2026

0.1.2

May 25, 2026

0.1.1

May 25, 2026

0.1.0

May 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pbz2-0.2.0.tar.gz (46.2 kB view details)

Uploaded Jun 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pbz2-0.2.0-py3-none-any.whl (8.3 kB view details)

Uploaded Jun 10, 2026 Python 3

File details

Details for the file pbz2-0.2.0.tar.gz.

File metadata

Download URL: pbz2-0.2.0.tar.gz
Upload date: Jun 10, 2026
Size: 46.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for pbz2-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`0b355e7c96d176f661f9cc351a159f5eaa2f6c6d829bd650be61db4ad42dbc87`
MD5	`f22b05a8ffba990fd3ef4da51f630ebe`
BLAKE2b-256	`062a06086ae47ab53527307a86e8508b62a366d503089ea1ffb9f5ec8ae71e5e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pbz2-0.2.0.tar.gz:

Publisher: publish.yml on gitronald/pbz2

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pbz2-0.2.0.tar.gz
- Subject digest: 0b355e7c96d176f661f9cc351a159f5eaa2f6c6d829bd650be61db4ad42dbc87
- Sigstore transparency entry: 1781717105
- Sigstore integration time: Jun 10, 2026
Source repository:
- Permalink: gitronald/pbz2@8d1a17e492e45d14148dfb07ae4bd6f4c9c7b90a
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/gitronald
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@8d1a17e492e45d14148dfb07ae4bd6f4c9c7b90a
- Trigger Event: push

File details

Details for the file pbz2-0.2.0-py3-none-any.whl.

File metadata

Download URL: pbz2-0.2.0-py3-none-any.whl
Upload date: Jun 10, 2026
Size: 8.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for pbz2-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4c2f4feea67376f35d3aa5e7f184d9acb38548b413c777b9e8bab495fccfa4e6`
MD5	`116d24346233380c892c194018ff80d6`
BLAKE2b-256	`639682140341ef93acc9941cce5a56bdea79a233263b06a793e0ad4db1d0c572`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pbz2-0.2.0-py3-none-any.whl:

Publisher: publish.yml on gitronald/pbz2

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pbz2-0.2.0-py3-none-any.whl
- Subject digest: 4c2f4feea67376f35d3aa5e7f184d9acb38548b413c777b9e8bab495fccfa4e6
- Sigstore transparency entry: 1781717392
- Sigstore integration time: Jun 10, 2026
Source repository:
- Permalink: gitronald/pbz2@8d1a17e492e45d14148dfb07ae4bd6f4c9c7b90a
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/gitronald
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@8d1a17e492e45d14148dfb07ae4bd6f4c9c7b90a
- Trigger Event: push

pbz2 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Project description

pbz2 v0.2.0

Project Structure

Installation

CLI Commands

Python API

Iterate

Parallel processing

Reference

Common options

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance