A multi-threading implementation of Python gzip module
Project description
pgzip
pgzip
is a multi-threaded gzip
implementation for python
that increases the compression and decompression performance.
Compression and decompression performance gains are made by parallelizing the usage of block indexing within a gzip
file. Block indexing utilizes gzip's FEXTRA
feature which records the index of compressed members. FEXTRA
is defined in the official gzip
specification starting at version 4.3. Because FEXTRA
is part of the gzip
specification, pgzip
is compatible with regular gzip
files.
pgzip
is ~25X faster for compression and ~7X faster for decompression when benchmarked on a 24 core machine. Performance is limited only by I/O and the python
interpreter.
Theoretically, the compression and decompression speed should be linear with the number of cores available. However, I/O and a language's general performance limits the compression and decompression speed in practice.
Usage and Examples
CLI
❯ python -m pgzip -h
usage: __main__.py [-h] [-o OUTPUT] [-f FILENAME] [-d] [-l {0-9}] [-t THREADS] input
positional arguments:
input Input file or '-' for stdin
options:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
Output file or '-' for stdout (Default: Input file with 'gz' extension or stdout)
-f FILENAME, --filename FILENAME
Name for the original file when compressing
-d, --decompress Decompress instead of compress
-l {0-9}, --compression-level {0-9}
Compression level; 0 = no compression (Default: 9)
-t THREADS, --threads THREADS
Number of threads to use (Default: Determine automatically)
Programatically
Using pgzip
is the same as using the built-in gzip
module.
Compressing data and writing it to a file:
import pgzip
s = "a big string..."
# An explanation of parameters:
# `thread=8` - Use 8 threads to compress. `None` or `0` uses all cores (default)
# `blocksize=2*10**8` - Use a compression block size of 200MB
with pgzip.open("test.txt.gz", "wt", thread=8, blocksize=2*10**8) as fw:
fw.write(s)
Decompressing data from a file:
import pgzip
s = "a big string..."
with pgzip.open("test.txt.gz", "rt", thread=8) as fr:
assert fr.read(len(s)) == s
Performance
Compression Performance
Decompression Performance
Decompression was benchmarked using an 8.0GB FASTQ
text file with 48 threads across 24 cores on a machine with Xeon(R) E5-2650 v4 @ 2.20GHz CPUs.
The compressed file used in this benchmark was created with a blocksize of 200MB.
Warning
pgzip
only replaces the following methods of gzip
's GzipFile
class:
open()
compress()
decompress()
Other class methods and functionality have not been well tested.
Contributions or improvements is appreciated for methods such as:
seek()
tell()
History
Created initially by Vincent Li (@vinlyx), this project is a fork of https://github.com/vinlyx/mgzip. We had several bug fixes to implement, but we could not contact them. The pgzip
team would like to thank Vincent Li (@vinlyx) for their hard work. We hope that they will contact us when they discover this project.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pgzip-0.3.5.tar.gz
.
File metadata
- Download URL: pgzip-0.3.5.tar.gz
- Upload date:
- Size: 14.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.17
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | dd35510f59f6bd6b64e31c4baf90c10cdbb2775235fcc079b14b404fbd7f46bf |
|
MD5 | 4c72e3911f4160cf013b0a9ad92ecb94 |
|
BLAKE2b-256 | de64547af1d8616a1dc29ba67d5b0a6722dcd7b569e15520dfff0f390c4c3024 |
File details
Details for the file pgzip-0.3.5-py3-none-any.whl
.
File metadata
- Download URL: pgzip-0.3.5-py3-none-any.whl
- Upload date:
- Size: 13.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.17
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4e13ab66ecface5c51c5af51d8cd676aa51675cf85df000f501a86cf38c208c1 |
|
MD5 | 09071694c8154806603151c122f1eebd |
|
BLAKE2b-256 | 619f1a97b17fb29d7d1b6293faf13899c483460a2c524c3e06fe4226f6916133 |