Skip to main content

Stream and parallel-process .bz2 files via pbzip2.

Project description

pbz2

Stream and parallel-process .bz2 files via pbzip2 (parallel bzip2). Falls back to the stdlib bz2 module when the pbzip2 binary is unavailable.

Install

uv add pbz2

Install pbzip2 for parallel decompression:

sudo apt install pbzip2     # Debian/Ubuntu
brew install pbzip2         # macOS

Usage

Iterate

import pbz2

# Parsed JSON objects from a .json.bz2 file
for obj in pbz2.iter_jsonl("data.json.bz2"):
    ...

# Raw UTF-8 lines
for line in pbz2.iter_lines("data.txt.bz2"):
    ...

# Newline-aligned text chunks (useful for batched processing)
for chunk in pbz2.iter_chunks("data.txt.bz2"):
    ...

Parallel processing

process_parallel streams chunks of newline-terminated records through a worker pool. The worker function receives raw text chunks (so parsing happens in the worker, not the main process), and on_result runs in the main process to handle each result as it completes.

import json
import pbz2

def parse_chunk(chunk: str) -> list[dict]:
    # split on "\n" only -- str.splitlines() also breaks on U+2028/U+2029 etc.,
    # which can appear raw inside records and would shatter them
    return [json.loads(line) for line in chunk.split("\n") if line]

def save(records: list[dict]) -> None:
    ...  # write to db, file, etc.

pbz2.process_parallel(
    "data.json.bz2",
    worker_fn=parse_chunk,
    on_result=save,
    num_processes=8,
)

CLI

pbz2 count data.json.bz2
pbz2 head data.json.bz2 -n 5

API

Function Description
iter_chunks(path, **opts) Yield UTF-8 text chunks ending on a newline boundary.
iter_lines(path, **opts) Yield non-empty UTF-8 lines (no trailing newline).
iter_jsonl(path, *, loads=None, **opts) Yield parsed JSON objects (uses orjson if installed).
process_parallel(path, worker_fn, *, on_result=None, worker_args=(), num_processes=None, max_pending=None, ...) Run worker_fn(chunk, *worker_args) in a process pool, dispatching results to on_result.
open_decompress(path, **opts) Low-level: open a binary stream of decompressed bytes.

Common options

  • num_processors — pbzip2 worker count (default: cpu_count - 1)
  • bufsize_mb — OS pipe buffer between pbzip2 and Python (default: 32 MB)
  • stream_buffer_mb — Python-side read chunk size (default: 4 MB)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pbz2-0.1.3.tar.gz (43.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pbz2-0.1.3-py3-none-any.whl (7.1 kB view details)

Uploaded Python 3

File details

Details for the file pbz2-0.1.3.tar.gz.

File metadata

  • Download URL: pbz2-0.1.3.tar.gz
  • Upload date:
  • Size: 43.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pbz2-0.1.3.tar.gz
Algorithm Hash digest
SHA256 6ae9fe79cff91cda44fd26017263fe5d74d5f3e458fba641e69bd9b6dd06cf7a
MD5 4c6638d318dab2f02b7605dc22e2aebf
BLAKE2b-256 f243e0df587a85c76ba9ceb2d9be207b22ff3da81e3a977daf571eb54641fe22

See more details on using hashes here.

Provenance

The following attestation bundles were made for pbz2-0.1.3.tar.gz:

Publisher: publish.yml on gitronald/pbz2

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pbz2-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: pbz2-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 7.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pbz2-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 7966a11430be3649253da8e8dbbbf68f5fd0afacd25aaf9483295b800c3edf11
MD5 6adea5967fe205c41479bccc0bbd94d1
BLAKE2b-256 35895048f6361b88408364d94e043fa71af5d2190c1eaf2d65262686c11b4b75

See more details on using hashes here.

Provenance

The following attestation bundles were made for pbz2-0.1.3-py3-none-any.whl:

Publisher: publish.yml on gitronald/pbz2

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page