Skip to main content

Parallel random access to gzip files

Project description

Parallel Random Access Gzip (pragzip)

PyPI version Python Version PyPI Platforms Downloads License Build Status codecov C++17

This module provides a PragzipFile class, which can be used to seek inside gzip files without having to decompress them first. Alternatively, you can use this simply as a parallelized gzip decoder as a replacement for Python's builtin gzip module in order to fully utilize all your cores.

The random seeking support is the same as provided by indexed_gzip but further speedups are realized at the cost of higher memory usage thanks to a least-recently-used cache in combination with a parallelized prefetcher.

Table of Contents

  1. Installation
  2. Performance
    1. Decompression with Existing Index
    2. Decompression from Scratch
  3. Usage
    1. Command Line Tool
    2. Python Library
    3. Via Ratarmount
    4. C++ Library
  4. Internal Architecture
  5. Tracing the Decoder

Performance

These are simple timing tests for reading all the contents of a gzip file sequentially.

Results are shown for an AMD Ryzen 3900X 12-core (24 virtual cores) processor and with gzipFilePath=4GB-base64.gz, which is a 4 GiB gzip compressed file with base64 random data.

Decompression with Existing Index

Module Runtime / s Speedup
gzip 17.2 1x
pragzip with parallelization = 0 1.25 13.8x
pragzip with parallelization = 1 13.8 1.25x
pragzip with parallelization = 2 7.0 2.46x
pragzip with parallelization = 6 2.5 6.88x
pragzip with parallelization = 12 1.47 11.7x
pragzip with parallelization = 24 1.25 13.8x
pragzip with parallelization = 32 1.33 12.9x

The speedup of pragzip over the gzip module with parallelization = 0 is 17.2/1.25 = 14. When using only one core, pragzip is faster by (17.2-13.8)/17.2 = 20%.

Benchmark Code
import gzip
import time

with gzip.open(gzipFilePath) as file:
    t0 = time.time()
    while file.read(4*1024*1024):
        pass
    t1 = time.time()
    print(f"Decoded file in {t1-t0}s")

The usage of pragzip is slightly different:

import indexed_gzip
import pragzip
import time

with indexed_gzip.IndexedGzipFile(gzipFilePath) as file:
    file.build_full_index()
    file.export_index(gzipFilePath + ".index")

# parallelization = 0 means that it is automatically using all available cores.
for parallelization in [0, 1, 2, 6, 12, 24, 32]:
    with pragzip.PragzipFile(gzipFilePath, parallelization = parallelization) as file:
        file.import_index(open(gzipFilePath + ".index", 'rb'))
        t0 = time.time()
        while file.read( 4*1024*1024 ):
            pass
        t1 = time.time()
        print( f"Decoded file in {t1-t0}s" )

Decompression from Scratch

Module Runtime / s Speedup
gzip 17.2 1x
pragzip with parallelization = 0 2.04 8.43x
pragzip with parallelization = 1 31.0 0.55x
pragzip with parallelization = 2 16.9 1.02x
pragzip with parallelization = 6 6.10 2.82x
pragzip with parallelization = 12 3.23 5.32x
pragzip with parallelization = 24 2.04 8.43x
pragzip with parallelization = 32 2.06 8.35x
Benchmark Code
import pragzip
import time

# parallelization = 0 means that it is automatically using all available cores.
for parallelization in [0, 1, 2, 6, 12, 24, 32]:
    with pragzip.PragzipFile(gzipFilePath, parallelization = parallelization) as file:
        t0 = time.time()
        while file.read(4*1024*1024):
            pass
        t1 = time.time()
        print(f"Decoded file in {t1-t0}s")

Installation

You can simply install it from PyPI:

python3 -m pip install --upgrade pip  # Recommended for newer manylinux wheels
python3 -m pip install pragzip

The latest unreleased development version can be tested out with:

python3 -m pip install --force-reinstall 'git+https://github.com/mxmlnkn/indexed_bzip2.git@master#egginfo=pragzip&subdirectory=python/pragzip'

And to build locally, you can use build and install the wheel:

cd python/pragzip
rm -rf dist
python3 -m build .
python3 -m pip install --force-reinstall --user dist/*.whl

Usage

Command Line Tool

pragzip --help

# Parallel decoding: 1.7 s
time pragzip -d -c -P 0 sample.gz | wc -c

# Serial decoding: 22 s
time gzip -d -c sample.gz | wc -c

Python Library

Simple open, seek, read, and close

from pragzip import PragzipFile

file = PragzipFile( "example.gz", parallelization = os.cpu_count() )

# You can now use it like a normal file
file.seek( 123 )
data = file.read( 100 )
file.close()

The first call to seek will ensure that the block offset list is complete and therefore might create them first. Because of this the first call to seek might take a while.

Use with context manager

import os
import pragzip

with pragzip.open( "example.gz", parallelization = os.cpu_count() ) as file:
    file.seek( 123 )
    data = file.read( 100 )

Storing and loading the block offset map

The creation of the list of gzip blocks can take a while because it has to decode the gzip file completely. To avoid this setup when opening a gzip file, the block offset list can be exported and imported.

Open a pure Python file-like object for indexed reading

import io
import os
import pragzip as pragzip

with open( "example.gz", 'rb' ) as file:
    in_memory_file = io.BytesIO( file.read() )

with pragzip.open( in_memory_file, parallelization = os.cpu_count() ) as file:
    file.seek( 123 )
    data = file.read( 100 )

Via Ratarmount

pragzip is planned to be used as a backend inside ratarmount with version 0.12. Then, you can use ratarmount to mount single gzip files easily.

base64 /dev/urandom | head -c $(( 4 * 1024 * 1024 * 1024 )) | gzip > sample.gz
# Serial decoding: 23 s
time gzip -c -d sample.gz | wc -c

python3 -m pip install --user ratarmount
ratarmount sample.gz mounted

# Parallel decoding: 3.5 s
time cat mounted/sample | wc -c

# Random seeking to the middle of the file and reading 1 MiB: 0.287 s
time dd if=mounted/sample bs=$(( 1024 * 1024 )) \
       iflag=skip_bytes,count_bytes skip=$(( 2 * 1024 * 1024 * 1024 )) count=$(( 1024 * 1024 )) | wc -c

C++ library

Because it is written in C++, it can of course also be used as a C++ library. In order to make heavy use of templates and to simplify compiling with Python setuptools, it is mostly header-only so that integration it into another project should be easy. The license is also permissive enough for most use cases.

I currently did not yet test integrating it into other projects other than simply manually copying the source in src/core, src/pragzip, and if integrated zlib is desired also src/external/zlib. If you have suggestions and wishes like support with CMake or Conan, please open an issue.

Internal Architecture

The main part of the internal architecture used for parallelizing is the same as used for indexed_bzip2.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pragzip-0.4.0.tar.gz (528.6 kB view hashes)

Uploaded Source

Built Distributions

pragzip-0.4.0-pp39-pypy39_pp73-win_amd64.whl (372.8 kB view hashes)

Uploaded PyPy Windows x86-64

pragzip-0.4.0-pp39-pypy39_pp73-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (624.8 kB view hashes)

Uploaded PyPy manylinux: glibc 2.27+ x86-64 manylinux: glibc 2.28+ x86-64

pragzip-0.4.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (721.8 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

pragzip-0.4.0-pp39-pypy39_pp73-manylinux_2_17_i686.manylinux2014_i686.whl (770.4 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ i686

pragzip-0.4.0-pp39-pypy39_pp73-macosx_10_14_x86_64.whl (468.7 kB view hashes)

Uploaded PyPy macOS 10.14+ x86-64

pragzip-0.4.0-pp38-pypy38_pp73-win_amd64.whl (372.9 kB view hashes)

Uploaded PyPy Windows x86-64

pragzip-0.4.0-pp38-pypy38_pp73-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (624.7 kB view hashes)

Uploaded PyPy manylinux: glibc 2.27+ x86-64 manylinux: glibc 2.28+ x86-64

pragzip-0.4.0-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (721.7 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

pragzip-0.4.0-pp38-pypy38_pp73-manylinux_2_17_i686.manylinux2014_i686.whl (770.5 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ i686

pragzip-0.4.0-pp38-pypy38_pp73-macosx_10_14_x86_64.whl (469.2 kB view hashes)

Uploaded PyPy macOS 10.14+ x86-64

pragzip-0.4.0-pp37-pypy37_pp73-win_amd64.whl (372.9 kB view hashes)

Uploaded PyPy Windows x86-64

pragzip-0.4.0-pp37-pypy37_pp73-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (628.0 kB view hashes)

Uploaded PyPy manylinux: glibc 2.27+ x86-64 manylinux: glibc 2.28+ x86-64

pragzip-0.4.0-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (727.6 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

pragzip-0.4.0-pp37-pypy37_pp73-manylinux_2_17_i686.manylinux2014_i686.whl (776.6 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ i686

pragzip-0.4.0-pp37-pypy37_pp73-macosx_10_14_x86_64.whl (469.2 kB view hashes)

Uploaded PyPy macOS 10.14+ x86-64

pragzip-0.4.0-cp311-cp311-win_amd64.whl (377.4 kB view hashes)

Uploaded CPython 3.11 Windows x86-64

pragzip-0.4.0-cp311-cp311-win32.whl (346.6 kB view hashes)

Uploaded CPython 3.11 Windows x86

pragzip-0.4.0-cp311-cp311-musllinux_1_1_x86_64.whl (4.9 MB view hashes)

Uploaded CPython 3.11 musllinux: musl 1.1+ x86-64

pragzip-0.4.0-cp311-cp311-musllinux_1_1_i686.whl (4.9 MB view hashes)

Uploaded CPython 3.11 musllinux: musl 1.1+ i686

pragzip-0.4.0-cp311-cp311-manylinux_2_28_x86_64.whl (4.5 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.28+ x86-64

pragzip-0.4.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.5 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

pragzip-0.4.0-cp311-cp311-manylinux_2_17_i686.manylinux2014_i686.whl (4.5 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ i686

pragzip-0.4.0-cp311-cp311-macosx_10_14_x86_64.whl (504.7 kB view hashes)

Uploaded CPython 3.11 macOS 10.14+ x86-64

pragzip-0.4.0-cp310-cp310-win_amd64.whl (378.5 kB view hashes)

Uploaded CPython 3.10 Windows x86-64

pragzip-0.4.0-cp310-cp310-win32.whl (347.8 kB view hashes)

Uploaded CPython 3.10 Windows x86

pragzip-0.4.0-cp310-cp310-musllinux_1_1_x86_64.whl (4.9 MB view hashes)

Uploaded CPython 3.10 musllinux: musl 1.1+ x86-64

pragzip-0.4.0-cp310-cp310-musllinux_1_1_i686.whl (4.9 MB view hashes)

Uploaded CPython 3.10 musllinux: musl 1.1+ i686

pragzip-0.4.0-cp310-cp310-manylinux_2_28_x86_64.whl (4.5 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.28+ x86-64

pragzip-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.5 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

pragzip-0.4.0-cp310-cp310-manylinux_2_17_i686.manylinux2014_i686.whl (4.5 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ i686

pragzip-0.4.0-cp310-cp310-macosx_10_14_x86_64.whl (505.3 kB view hashes)

Uploaded CPython 3.10 macOS 10.14+ x86-64

pragzip-0.4.0-cp39-cp39-win_amd64.whl (379.5 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

pragzip-0.4.0-cp39-cp39-win32.whl (348.6 kB view hashes)

Uploaded CPython 3.9 Windows x86

pragzip-0.4.0-cp39-cp39-musllinux_1_1_x86_64.whl (4.9 MB view hashes)

Uploaded CPython 3.9 musllinux: musl 1.1+ x86-64

pragzip-0.4.0-cp39-cp39-musllinux_1_1_i686.whl (4.9 MB view hashes)

Uploaded CPython 3.9 musllinux: musl 1.1+ i686

pragzip-0.4.0-cp39-cp39-manylinux_2_28_x86_64.whl (4.5 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.28+ x86-64

pragzip-0.4.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.5 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

pragzip-0.4.0-cp39-cp39-manylinux_2_17_i686.manylinux2014_i686.whl (4.5 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ i686

pragzip-0.4.0-cp39-cp39-macosx_10_14_x86_64.whl (506.8 kB view hashes)

Uploaded CPython 3.9 macOS 10.14+ x86-64

pragzip-0.4.0-cp38-cp38-win_amd64.whl (379.6 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

pragzip-0.4.0-cp38-cp38-win32.whl (348.6 kB view hashes)

Uploaded CPython 3.8 Windows x86

pragzip-0.4.0-cp38-cp38-musllinux_1_1_x86_64.whl (4.9 MB view hashes)

Uploaded CPython 3.8 musllinux: musl 1.1+ x86-64

pragzip-0.4.0-cp38-cp38-musllinux_1_1_i686.whl (4.9 MB view hashes)

Uploaded CPython 3.8 musllinux: musl 1.1+ i686

pragzip-0.4.0-cp38-cp38-manylinux_2_28_x86_64.whl (4.5 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.28+ x86-64

pragzip-0.4.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.5 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

pragzip-0.4.0-cp38-cp38-manylinux_2_17_i686.manylinux2014_i686.whl (4.5 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ i686

pragzip-0.4.0-cp38-cp38-macosx_10_14_x86_64.whl (506.0 kB view hashes)

Uploaded CPython 3.8 macOS 10.14+ x86-64

pragzip-0.4.0-cp37-cp37m-win_amd64.whl (378.8 kB view hashes)

Uploaded CPython 3.7m Windows x86-64

pragzip-0.4.0-cp37-cp37m-win32.whl (347.8 kB view hashes)

Uploaded CPython 3.7m Windows x86

pragzip-0.4.0-cp37-cp37m-musllinux_1_1_x86_64.whl (4.9 MB view hashes)

Uploaded CPython 3.7m musllinux: musl 1.1+ x86-64

pragzip-0.4.0-cp37-cp37m-musllinux_1_1_i686.whl (4.9 MB view hashes)

Uploaded CPython 3.7m musllinux: musl 1.1+ i686

pragzip-0.4.0-cp37-cp37m-manylinux_2_28_x86_64.whl (4.5 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.28+ x86-64

pragzip-0.4.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.5 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

pragzip-0.4.0-cp37-cp37m-manylinux_2_17_i686.manylinux2014_i686.whl (4.5 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ i686

pragzip-0.4.0-cp37-cp37m-macosx_10_14_x86_64.whl (505.5 kB view hashes)

Uploaded CPython 3.7m macOS 10.14+ x86-64

pragzip-0.4.0-cp36-cp36m-win_amd64.whl (378.9 kB view hashes)

Uploaded CPython 3.6m Windows x86-64

pragzip-0.4.0-cp36-cp36m-win32.whl (347.8 kB view hashes)

Uploaded CPython 3.6m Windows x86

pragzip-0.4.0-cp36-cp36m-musllinux_1_1_x86_64.whl (4.9 MB view hashes)

Uploaded CPython 3.6m musllinux: musl 1.1+ x86-64

pragzip-0.4.0-cp36-cp36m-musllinux_1_1_i686.whl (4.9 MB view hashes)

Uploaded CPython 3.6m musllinux: musl 1.1+ i686

pragzip-0.4.0-cp36-cp36m-manylinux_2_28_x86_64.whl (4.5 MB view hashes)

Uploaded CPython 3.6m manylinux: glibc 2.28+ x86-64

pragzip-0.4.0-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.5 MB view hashes)

Uploaded CPython 3.6m manylinux: glibc 2.17+ x86-64

pragzip-0.4.0-cp36-cp36m-manylinux_2_17_i686.manylinux2014_i686.whl (4.5 MB view hashes)

Uploaded CPython 3.6m manylinux: glibc 2.17+ i686

pragzip-0.4.0-cp36-cp36m-macosx_10_14_x86_64.whl (506.6 kB view hashes)

Uploaded CPython 3.6m macOS 10.14+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page