Skip to main content

Parallel random access to gzip files

Project description

Parallel Random Access Gzip (pragzip)

PyPI version Python Version PyPI Platforms Downloads License Build Status codecov C++17

This module provides a PragzipFile class, which can be used to seek inside gzip files without having to decompress them first. Alternatively, you can use this simply as a parallelized gzip decoder as a replacement for Python's builtin gzip module in order to fully utilize all your cores.

The random seeking support is the same as provided by indexed_gzip but further speedups are realized at the cost of higher memory usage thanks to a least-recently-used cache in combination with a parallelized prefetcher.

Table of Contents

  1. Performance
    1. Decompression with Existing Index
    2. Decompression from Scratch
  2. Installation
  3. Usage
    1. Command Line Tool
    2. Python Library
    3. Via Ratarmount
    4. C++ Library
  4. Internal Architecture
  5. Tracing the Decoder

Performance

These are simple timing tests for reading all the contents of a gzip file sequentially.

Results are shown for an AMD Ryzen 3900X 12-core (24 virtual cores) processor and with gzipFilePath="4GiB-base64.gz", which is a 4 GiB gzip compressed file with base64 random data.

Be aware that the chunk size requested from the Python code does influence the performance heavily. This benchmarks use a chunk size of 512 KiB.

Decompression with Existing Index

Module Runtime / s Bandwidth / (MB/s) Speedup
gzip 17.9 180 1
pragzip with parallelization = 0 1.21 2700 14.8
pragzip with parallelization = 1 14.0 230 1.3
pragzip with parallelization = 2 7.2 450 2.5
pragzip with parallelization = 6 2.51 1300 7.1
pragzip with parallelization = 12 1.40 2330 12.8
pragzip with parallelization = 24 1.11 2940 16.1
pragzip with parallelization = 32 1.12 2920 16.0
Benchmark Code
import gzip
import time

with gzip.open(gzipFilePath) as file:
    t0 = time.time()
    while file.read(4*1024*1024):
        pass
    gzipDuration = time.time() - t0
    print(f"Decoded file in {gzipDuration:.2f}s, bandwidth: {fileSize / gzipDuration / 1e6:.0f} MB/s")

The usage of pragzip is slightly different:

import os
import time

import indexed_gzip
import pragzip

fileSize = os.stat(gzipFilePath).st_size

if not os.path.exists(gzipFilePath + ".index"):
    with indexed_gzip.IndexedGzipFile(gzipFilePath) as file:
        file.build_full_index()
        file.export_index(gzipFilePath + ".index")

# parallelization = 0 means that it is automatically using all available cores.
for parallelization in [0, 1, 2, 6, 12, 24, 32]:
    with pragzip.PragzipFile(gzipFilePath, parallelization = parallelization) as file:
        file.import_index(open(gzipFilePath + ".index", 'rb'))
        t0 = time.time()
        # Unfortunately, the chunk size is very performance critical! It might depend on the cache size.
        while file.read(512*1024):
            pass
        pragzipDuration = time.time() - t0
        print(f"Decoded file in {pragzipDuration:.2f}s"
              f", bandwidth: {fileSize / pragzipDuration / 1e6:.0f} MB/s"
              f", speedup: {gzipDuration/pragzipDuration:.1f}")

Decompression from Scratch

Python

Module Runtime / s Bandwidth / (MB/s) Speedup
gzip 17.5 190 1
pragzip with parallelization = 0 1.22 2670 14.3
pragzip with parallelization = 1 18.2 180 1.0
pragzip with parallelization = 2 9.3 350 1.9
pragzip with parallelization = 6 3.28 1000 5.3
pragzip with parallelization = 12 1.82 1800 9.6
pragzip with parallelization = 24 1.25 2620 14.0
pragzip with parallelization = 32 1.30 2520 13.5

Note that pragzip is generally faster than given an index because it can delegate the decompression to zlib while it has to use its own gzip decompression engine when no index exists yet.

Note that values deviate roughly by 10% and therefore are rounded.

Benchmark Code
import gzip
import os
import time

import pragzip

fileSize = os.stat(gzipFilePath).st_size

with gzip.open(gzipFilePath) as file:
    t0 = time.time()
    while file.read(4*1024*1024):
        pass
    gzipDuration = time.time() - t0
    print(f"Decoded file in {gzipDuration:.2f}s, bandwidth: {fileSize / gzipDuration / 1e6:.0f} MB/s")

# parallelization = 0 means that it is automatically using all available cores.
for parallelization in [0, 1, 2, 6, 12, 24, 32]:
    with pragzip.PragzipFile(gzipFilePath, parallelization = parallelization) as file:
        t0 = time.time()
        # Unfortunately, the chunk size is very performance critical! It might depend on the cache size.
        while file.read(512*1024):
            pass
        pragzipDuration = time.time() - t0
        print(f"Decoded file in {pragzipDuration:.2f}s"
              f", bandwidth: {fileSize / pragzipDuration / 1e6:.0f} MB/s"
              f", speedup: {gzipDuration/pragzipDuration:.1f}")

Installation

You can simply install it from PyPI:

python3 -m pip install --upgrade pip  # Recommended for newer manylinux wheels
python3 -m pip install pragzip

The latest unreleased development version can be tested out with:

python3 -m pip install --force-reinstall 'git+https://github.com/mxmlnkn/indexed_bzip2.git@master#egginfo=pragzip&subdirectory=python/pragzip'

And to build locally, you can use build and install the wheel:

cd python/pragzip
rm -rf dist
python3 -m build .
python3 -m pip install --force-reinstall --user dist/*.whl

Usage

Command Line Tool

pragzip --help

# Parallel decoding: 1.7 s
time pragzip -d -c -P 0 sample.gz | wc -c

# Serial decoding: 22 s
time gzip -d -c sample.gz | wc -c

Python Library

Simple open, seek, read, and close

from pragzip import PragzipFile

file = PragzipFile( "example.gz", parallelization = os.cpu_count() )

# You can now use it like a normal file
file.seek( 123 )
data = file.read( 100 )
file.close()

The first call to seek will ensure that the block offset list is complete and therefore might create them first. Because of this the first call to seek might take a while.

Use with context manager

import os
import pragzip

with pragzip.open( "example.gz", parallelization = os.cpu_count() ) as file:
    file.seek( 123 )
    data = file.read( 100 )

Storing and loading the block offset map

The creation of the list of gzip blocks can take a while because it has to decode the gzip file completely. To avoid this setup when opening a gzip file, the block offset list can be exported and imported.

Open a pure Python file-like object for indexed reading

import io
import os
import pragzip as pragzip

with open( "example.gz", 'rb' ) as file:
    in_memory_file = io.BytesIO( file.read() )

with pragzip.open( in_memory_file, parallelization = os.cpu_count() ) as file:
    file.seek( 123 )
    data = file.read( 100 )

Via Ratarmount

pragzip is planned to be used as a backend inside ratarmount with version 0.12. Then, you can use ratarmount to mount single gzip files easily.

base64 /dev/urandom | head -c $(( 4 * 1024 * 1024 * 1024 )) | gzip > sample.gz
# Serial decoding: 23 s
time gzip -c -d sample.gz | wc -c

python3 -m pip install --user ratarmount
ratarmount sample.gz mounted

# Parallel decoding: 3.5 s
time cat mounted/sample | wc -c

# Random seeking to the middle of the file and reading 1 MiB: 0.287 s
time dd if=mounted/sample bs=$(( 1024 * 1024 )) \
       iflag=skip_bytes,count_bytes skip=$(( 2 * 1024 * 1024 * 1024 )) count=$(( 1024 * 1024 )) | wc -c

C++ library

Because it is written in C++, it can of course also be used as a C++ library. In order to make heavy use of templates and to simplify compiling with Python setuptools, it is mostly header-only so that integration it into another project should be easy. The license is also permissive enough for most use cases.

I currently did not yet test integrating it into other projects other than simply manually copying the source in src/core, src/pragzip, and if integrated zlib is desired also src/external/zlib. If you have suggestions and wishes like support with CMake or Conan, please open an issue.

Internal Architecture

The main part of the internal architecture used for parallelizing is the same as used for indexed_bzip2.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pragzip-0.5.0.tar.gz (540.9 kB view hashes)

Uploaded Source

Built Distributions

pragzip-0.5.0-pp39-pypy39_pp73-win_amd64.whl (402.4 kB view hashes)

Uploaded PyPy Windows x86-64

pragzip-0.5.0-pp39-pypy39_pp73-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (694.6 kB view hashes)

Uploaded PyPy manylinux: glibc 2.27+ x86-64 manylinux: glibc 2.28+ x86-64

pragzip-0.5.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (781.2 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

pragzip-0.5.0-pp39-pypy39_pp73-manylinux_2_17_i686.manylinux2014_i686.whl (832.9 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ i686

pragzip-0.5.0-pp39-pypy39_pp73-macosx_10_14_x86_64.whl (518.8 kB view hashes)

Uploaded PyPy macOS 10.14+ x86-64

pragzip-0.5.0-pp38-pypy38_pp73-win_amd64.whl (402.6 kB view hashes)

Uploaded PyPy Windows x86-64

pragzip-0.5.0-pp38-pypy38_pp73-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (694.7 kB view hashes)

Uploaded PyPy manylinux: glibc 2.27+ x86-64 manylinux: glibc 2.28+ x86-64

pragzip-0.5.0-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (781.3 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

pragzip-0.5.0-pp38-pypy38_pp73-manylinux_2_17_i686.manylinux2014_i686.whl (832.8 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ i686

pragzip-0.5.0-pp38-pypy38_pp73-macosx_10_14_x86_64.whl (519.4 kB view hashes)

Uploaded PyPy macOS 10.14+ x86-64

pragzip-0.5.0-pp37-pypy37_pp73-win_amd64.whl (402.6 kB view hashes)

Uploaded PyPy Windows x86-64

pragzip-0.5.0-pp37-pypy37_pp73-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (698.1 kB view hashes)

Uploaded PyPy manylinux: glibc 2.27+ x86-64 manylinux: glibc 2.28+ x86-64

pragzip-0.5.0-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (788.2 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

pragzip-0.5.0-pp37-pypy37_pp73-manylinux_2_17_i686.manylinux2014_i686.whl (840.3 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ i686

pragzip-0.5.0-pp37-pypy37_pp73-macosx_10_14_x86_64.whl (519.4 kB view hashes)

Uploaded PyPy macOS 10.14+ x86-64

pragzip-0.5.0-cp311-cp311-win_amd64.whl (407.4 kB view hashes)

Uploaded CPython 3.11 Windows x86-64

pragzip-0.5.0-cp311-cp311-musllinux_1_1_x86_64.whl (5.7 MB view hashes)

Uploaded CPython 3.11 musllinux: musl 1.1+ x86-64

pragzip-0.5.0-cp311-cp311-musllinux_1_1_i686.whl (5.7 MB view hashes)

Uploaded CPython 3.11 musllinux: musl 1.1+ i686

pragzip-0.5.0-cp311-cp311-manylinux_2_28_x86_64.whl (5.3 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.28+ x86-64

pragzip-0.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.3 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

pragzip-0.5.0-cp311-cp311-manylinux_2_17_i686.manylinux2014_i686.whl (5.3 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ i686

pragzip-0.5.0-cp311-cp311-macosx_10_14_x86_64.whl (562.5 kB view hashes)

Uploaded CPython 3.11 macOS 10.14+ x86-64

pragzip-0.5.0-cp310-cp310-win_amd64.whl (408.4 kB view hashes)

Uploaded CPython 3.10 Windows x86-64

pragzip-0.5.0-cp310-cp310-musllinux_1_1_x86_64.whl (5.7 MB view hashes)

Uploaded CPython 3.10 musllinux: musl 1.1+ x86-64

pragzip-0.5.0-cp310-cp310-musllinux_1_1_i686.whl (5.7 MB view hashes)

Uploaded CPython 3.10 musllinux: musl 1.1+ i686

pragzip-0.5.0-cp310-cp310-manylinux_2_28_x86_64.whl (5.3 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.28+ x86-64

pragzip-0.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.3 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

pragzip-0.5.0-cp310-cp310-manylinux_2_17_i686.manylinux2014_i686.whl (5.3 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ i686

pragzip-0.5.0-cp310-cp310-macosx_10_14_x86_64.whl (563.3 kB view hashes)

Uploaded CPython 3.10 macOS 10.14+ x86-64

pragzip-0.5.0-cp39-cp39-win_amd64.whl (409.6 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

pragzip-0.5.0-cp39-cp39-musllinux_1_1_x86_64.whl (5.7 MB view hashes)

Uploaded CPython 3.9 musllinux: musl 1.1+ x86-64

pragzip-0.5.0-cp39-cp39-musllinux_1_1_i686.whl (5.7 MB view hashes)

Uploaded CPython 3.9 musllinux: musl 1.1+ i686

pragzip-0.5.0-cp39-cp39-manylinux_2_28_x86_64.whl (5.3 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.28+ x86-64

pragzip-0.5.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.3 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

pragzip-0.5.0-cp39-cp39-manylinux_2_17_i686.manylinux2014_i686.whl (5.3 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ i686

pragzip-0.5.0-cp39-cp39-macosx_10_14_x86_64.whl (565.1 kB view hashes)

Uploaded CPython 3.9 macOS 10.14+ x86-64

pragzip-0.5.0-cp38-cp38-win_amd64.whl (409.6 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

pragzip-0.5.0-cp38-cp38-musllinux_1_1_x86_64.whl (5.7 MB view hashes)

Uploaded CPython 3.8 musllinux: musl 1.1+ x86-64

pragzip-0.5.0-cp38-cp38-musllinux_1_1_i686.whl (5.7 MB view hashes)

Uploaded CPython 3.8 musllinux: musl 1.1+ i686

pragzip-0.5.0-cp38-cp38-manylinux_2_28_x86_64.whl (5.3 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.28+ x86-64

pragzip-0.5.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.3 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

pragzip-0.5.0-cp38-cp38-manylinux_2_17_i686.manylinux2014_i686.whl (5.3 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ i686

pragzip-0.5.0-cp38-cp38-macosx_10_14_x86_64.whl (563.9 kB view hashes)

Uploaded CPython 3.8 macOS 10.14+ x86-64

pragzip-0.5.0-cp37-cp37m-win_amd64.whl (408.9 kB view hashes)

Uploaded CPython 3.7m Windows x86-64

pragzip-0.5.0-cp37-cp37m-musllinux_1_1_x86_64.whl (5.7 MB view hashes)

Uploaded CPython 3.7m musllinux: musl 1.1+ x86-64

pragzip-0.5.0-cp37-cp37m-musllinux_1_1_i686.whl (5.7 MB view hashes)

Uploaded CPython 3.7m musllinux: musl 1.1+ i686

pragzip-0.5.0-cp37-cp37m-manylinux_2_28_x86_64.whl (5.3 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.28+ x86-64

pragzip-0.5.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.3 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

pragzip-0.5.0-cp37-cp37m-manylinux_2_17_i686.manylinux2014_i686.whl (5.3 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ i686

pragzip-0.5.0-cp37-cp37m-macosx_10_14_x86_64.whl (563.3 kB view hashes)

Uploaded CPython 3.7m macOS 10.14+ x86-64

pragzip-0.5.0-cp36-cp36m-win_amd64.whl (408.9 kB view hashes)

Uploaded CPython 3.6m Windows x86-64

pragzip-0.5.0-cp36-cp36m-musllinux_1_1_x86_64.whl (5.7 MB view hashes)

Uploaded CPython 3.6m musllinux: musl 1.1+ x86-64

pragzip-0.5.0-cp36-cp36m-musllinux_1_1_i686.whl (5.7 MB view hashes)

Uploaded CPython 3.6m musllinux: musl 1.1+ i686

pragzip-0.5.0-cp36-cp36m-manylinux_2_28_x86_64.whl (5.3 MB view hashes)

Uploaded CPython 3.6m manylinux: glibc 2.28+ x86-64

pragzip-0.5.0-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.3 MB view hashes)

Uploaded CPython 3.6m manylinux: glibc 2.17+ x86-64

pragzip-0.5.0-cp36-cp36m-manylinux_2_17_i686.manylinux2014_i686.whl (5.3 MB view hashes)

Uploaded CPython 3.6m manylinux: glibc 2.17+ i686

pragzip-0.5.0-cp36-cp36m-macosx_10_14_x86_64.whl (564.8 kB view hashes)

Uploaded CPython 3.6m macOS 10.14+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page