Skip to main content

A multi-threading implementation of Python gzip module

Project description

👷👷👷 Maintainers Wanted 👷👷👷 See https://github.com/pgzip/pgzip/issues/37

pgzip

Run tests CodeQL

pgzip is a multi-threaded gzip implementation for python that increases the compression and decompression performance.

Compression and decompression performance gains are made by parallelizing the usage of block indexing within a gzip file. Block indexing utilizes gzip's FEXTRA feature which records the index of compressed members. FEXTRA is defined in the official gzip specification starting at version 4.3. Because FEXTRA is part of the gzip specification, pgzip is compatible with regular gzip files.

pgzip is ~25X faster for compression and ~7X faster for decompression when benchmarked on a 24 core machine. Performance is limited only by I/O and the python interpreter.

Theoretically, the compression and decompression speed should be linear with the number of cores available. However, I/O and a language's general performance limits the compression and decompression speed in practice.

Usage and Examples

CLI

❯ python -m pgzip -h
usage: __main__.py [-h] [-o OUTPUT] [-f FILENAME] [-d] [-l {0-9}] [-t THREADS] input

positional arguments:
  input                 Input file or '-' for stdin

options:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Output file or '-' for stdout (Default: Input file with 'gz' extension or stdout)
  -f FILENAME, --filename FILENAME
                        Name for the original file when compressing
  -d, --decompress      Decompress instead of compress
  -l {0-9}, --compression-level {0-9}
                        Compression level; 0 = no compression (Default: 9)
  -t THREADS, --threads THREADS
                        Number of threads to use (Default: Determine automatically)

Programatically

Using pgzip is the same as using the built-in gzip module.

Compressing data and writing it to a file:

import pgzip

s = "a big string..."

# An explanation of parameters:
# `thread=8` - Use 8 threads to compress. `None` or `0` uses all cores (default)
# `blocksize=2*10**8` - Use a compression block size of 200MB
with pgzip.open("test.txt.gz", "wt", thread=8, blocksize=2*10**8) as fw:
    fw.write(s)

Decompressing data from a file:

import pgzip

s = "a big string..."

with pgzip.open("test.txt.gz", "rt", thread=8) as fr:
    assert fr.read(len(s)) == s

Performance

Compression Performance

Compression Performance

Decompression Performance

Decompression Performance

Decompression was benchmarked using an 8.0GB FASTQ text file with 48 threads across 24 cores on a machine with Xeon(R) E5-2650 v4 @ 2.20GHz CPUs.

The compressed file used in this benchmark was created with a blocksize of 200MB.

Warning

pgzip only replaces the following methods of gzip's GzipFile class:

  • open()
  • compress()
  • decompress()

Other class methods and functionality have not been well tested.

Contributions or improvements is appreciated for methods such as:

  • seek()
  • tell()

History

Created initially by Vincent Li (@vinlyx), this project is a fork of https://github.com/vinlyx/mgzip. We had several bug fixes to implement, but we could not contact them. The pgzip team would like to thank Vincent Li (@vinlyx) for their hard work. We hope that they will contact us when they discover this project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pgzip-0.4.0.tar.gz (80.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pgzip-0.4.0-py3-none-any.whl (13.1 kB view details)

Uploaded Python 3

File details

Details for the file pgzip-0.4.0.tar.gz.

File metadata

  • Download URL: pgzip-0.4.0.tar.gz
  • Upload date:
  • Size: 80.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pgzip-0.4.0.tar.gz
Algorithm Hash digest
SHA256 91f6526f0c6e1f6c2d3522707777e5843fad7c791401745593fd01bd869b7a05
MD5 e891c729d724f9c53708cf36e8ec4986
BLAKE2b-256 1a1922405f889aea5b1cd8cea39124c6c53aac63fc2f0088a374b726d771a6a9

See more details on using hashes here.

File details

Details for the file pgzip-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: pgzip-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 13.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pgzip-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 41e44c03a9f1bbbd390dd6b6e37a11484113f8c159a049fa81cd8a9ddde82a21
MD5 257cb5475f71f81ee3f1bab81656e064
BLAKE2b-256 f859445da026e3c4dbcba202419303412b51a761947d2ca3a48deaf0c6b33731

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page