Skip to main content

Parallel random access to gzip files

Project description

Parallel Random Access Gzip (pragzip)

PyPI version Python Version PyPI Platforms Downloads License Build Status codecov C++17

This module provides a PragzipFile class, which can be used to seek inside gzip files without having to decompress them first. Alternatively, you can use this simply as a parallelized gzip decoder as a replacement for Python's builtin gzip module in order to fully utilize all your cores.

The random seeking support is the same as provided by indexed_gzip but further speedups are realized at the cost of higher memory usage thanks to a least-recently-used cache in combination with a parallelized prefetcher.

Table of Contents

  1. Installation
  2. Usage
    1. Command Line Tool
    2. Python Library
    3. Via Ratarmount
    4. C++ Library
  3. Performance comparison with gzip module
  4. Internal Architecture
  5. Tracing the Decoder

Installation

You can simply install it from PyPI:

python3 -m pip install --upgrade pip  # Recommended for newer manylinux wheels
python3 -m pip install pragzip

The latest unreleased development version can be tested out with:

python3 -m pip install --force-reinstall 'git+https://github.com/mxmlnkn/indexed_bzip2.git@master#egginfo=pragzip&subdirectory=python/pragzip'

And to build locally, you can use build and install the wheel:

cd python/pragzip
rm -rf dist
python3 -m build .
python3 -m pip install --force-reinstall --user dist/*.whl

Usage

Command Line Tool

pragzip --help

# Parallel decoding: 1.7 s
time pragzip -d -c -P 0 sample.gz | wc -c

# Serial decoding: 22 s
time gzip -d -c sample.gz | wc -c

Python Library

Simple open, seek, read, and close

from pragzip import PragzipFile

file = PragzipFile( "example.gz", parallelization = os.cpu_count() )

# You can now use it like a normal file
file.seek( 123 )
data = file.read( 100 )
file.close()

The first call to seek will ensure that the block offset list is complete and therefore might create them first. Because of this the first call to seek might take a while.

Use with context manager

import os
import pragzip

with pragzip.open( "example.gz", parallelization = os.cpu_count() ) as file:
    file.seek( 123 )
    data = file.read( 100 )

Storing and loading the block offset map

The creation of the list of gzip blocks can take a while because it has to decode the gzip file completely. To avoid this setup when opening a gzip file, the block offset list can be exported and imported.

Open a pure Python file-like object for indexed reading

import io
import os
import pragzip as pragzip

with open( "example.gz", 'rb' ) as file:
    in_memory_file = io.BytesIO( file.read() )

with pragzip.open( in_memory_file, parallelization = os.cpu_count() ) as file:
    file.seek( 123 )
    data = file.read( 100 )

Via Ratarmount

Because pragzip can be used as a backend in ratarmount, you can use ratarmount to mount single gzip files easily. Furthermore, since ratarmount 0.11.0, parallelization is the default and does not have to be specified explicitly with -P.

base64 /dev/urandom | head -c $(( 4 * 1024 * 1024 * 1024 )) | gzip > sample.gz
# Serial decoding: 23 s
time gzip -c -d sample.gz | wc -c

python3 -m pip install --user ratarmount
ratarmount sample.gz mounted

# Parallel decoding: 3.5 s
time cat mounted/sample | wc -c

# Random seeking to the middle of the file and reading 1 MiB: 0.287 s
time dd if=mounted/sample bs=$(( 1024 * 1024 )) \
       iflag=skip_bytes,count_bytes skip=$(( 2 * 1024 * 1024 * 1024 )) count=$(( 1024 * 1024 )) | wc -c

C++ library

Because it is written in C++, it can of course also be used as a C++ library. In order to make heavy use of templates and to simplify compiling with Python setuptools, it is mostly header-only so that integration it into another project should be easy. The license is also permissive enough for most use cases.

I currently did not yet test integrating it into other projects other than simply manually copying the source in src/core, src/pragzip, and if integrated zlib is desired also src/external/zlib. If you have suggestions and wishes like support with CMake or Conan, please open an issue.

Performance comparison with gzip module when a gzip index exists

These are simple timing tests for reading all the contents of a gzip file sequentially.

import gzip
import time

with gzip.open( gzipFilePath ) as file:
    t0 = time.time()
    while file.read( 4*1024*1024 ):
        pass
    t1 = time.time()
    print( f"Decoded file in {t1-t0}s" )

The usage of pragzip is slightly different:

import indexed_gzip
import pragzip
import time

with indexed_gzip.IndexedGzipFile(gzipFilePath) as file:
    file.build_full_index()
    file.export_index(gzipFilePath + ".index")

# parallelization = 0 means that it is automatically using all available cores.
for parallelization in [0, 1, 2, 6, 12, 24, 32]:
    with pragzip.PragzipFile(gzipFilePath, parallelization = parallelization) as file:
        file.set_block_offsets(open(gzipFilePath + ".index", 'rb'))

        t0 = time.time()
        while file.read( 4*1024*1024 ):
            pass
        t1 = time.time()
        print( f"Decoded file in {t1-t0}s" )

Results for an AMD Ryzen 3900X 12-core (24 virtual cores) processor and with gzipFilePath=4GB-base64.gz, which is a 4 GiB gzip compressed file with base64 random data.

Module Runtime / s
gzip 17.2
pragzip with parallelization = 0 1.25
pragzip with parallelization = 1 13.8
pragzip with parallelization = 2 7.0
pragzip with parallelization = 6 2.5
pragzip with parallelization = 12 1.47
pragzip with parallelization = 24 1.25
pragzip with parallelization = 32 1.33

The speedup of pragzip over the gzip module with parallelization = 0 is 17.2/1.25 = 14. When using only one core, pragzip is faster by (17.2-13.8)/17.2 = 20%.

Internal Architecture

The main part of the internal architecture used for parallelizing is the same as used for indexed_bzip2.

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pragzip-0.2.2.tar.gz (488.8 kB view hashes)

Uploaded source

Built Distributions

pragzip-0.2.2-pp39-pypy39_pp73-win_amd64.whl (341.3 kB view hashes)

Uploaded pp39

pragzip-0.2.2-pp38-pypy38_pp73-win_amd64.whl (341.5 kB view hashes)

Uploaded pp38

pragzip-0.2.2-pp37-pypy37_pp73-win_amd64.whl (341.5 kB view hashes)

Uploaded pp37

pragzip-0.2.2-cp310-cp310-win_amd64.whl (346.9 kB view hashes)

Uploaded cp310

pragzip-0.2.2-cp310-cp310-win32.whl (319.6 kB view hashes)

Uploaded cp310

pragzip-0.2.2-cp39-cp39-win_amd64.whl (348.0 kB view hashes)

Uploaded cp39

pragzip-0.2.2-cp39-cp39-win32.whl (320.4 kB view hashes)

Uploaded cp39

pragzip-0.2.2-cp38-cp38-win_amd64.whl (348.0 kB view hashes)

Uploaded cp38

pragzip-0.2.2-cp38-cp38-win32.whl (320.3 kB view hashes)

Uploaded cp38

pragzip-0.2.2-cp37-cp37m-win_amd64.whl (347.3 kB view hashes)

Uploaded cp37

pragzip-0.2.2-cp37-cp37m-win32.whl (319.8 kB view hashes)

Uploaded cp37

pragzip-0.2.2-cp36-cp36m-win_amd64.whl (347.4 kB view hashes)

Uploaded cp36

pragzip-0.2.2-cp36-cp36m-win32.whl (319.9 kB view hashes)

Uploaded cp36

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page