Skip to main content

DictZip - Random Access gzip files

Project description

python-idzip

Seekable, gzip compatible, compression format

Gzip allows to store extra fields in the gzip header. Idzip stores offsets for the efficient seeking there.

Install

python setup.py install

or

[python-idzip RHEL6 signed RPM] (http://pkgs.bauman.in/repoview/python-idzip.html)

Acknowledgement

based on https://code.google.com/p/idzip/

The file format was designed by Rik Faith for dictzip. Idzip just uses multiple gzip members to have no file size limit.

Idzip means Improved Dictzip.

added a Writer class

Sizing

downloaded

http://textfiles.com/stories/bureau.txt

cat several copies together up to 20GB > input.txt

gzfile generated using standard gzip

dzfile generated using this library

    total 50172612
    -rw-rw-r--. 1 dan dan 21313751280 May 10 15:58 input.txt
    -rw-rw-r--. 1 dan dan  8576570661 May 10 17:21 dzfile.txt.dz
    -rw-rw-r--. 1 dan dan  8076548622 May 10 16:28 gzfile.txt.gz

Size is almost the same as standard gzip

Seek Timing

    seekpos = 21313751280 - 15
    from time import time

    start=time()
    original = open("/home/dan/ziptest/input.txt")
    original.seek(seekpos)
    original.close()
    print "Raw Seek to end", time() - start, "seconds"


    import gzip
    start=time()
    verify = gzip.open("/home/dan/ziptest/gzfile.txt.gz", "rb")
    verify.seek(seekpos)
    verify.close()
    print "Standard GZIP Seek to end", time() - start, "seconds"


    import idzip
    start=time()
    verify = idzip.open("/home/dan/ziptest/input.txt.dz")
    verify.seek(seekpos)
    verify.close()
    print "idzip Seek to end", time() - start, "seconds"
    Raw Seek to end 0.000866889953613 seconds
    Standard GZIP Seek to end 255.133864164 seconds
    idzip Seek to end 0.0381989479065 seconds

Stream Writer

class allows streaming.

    from idzip import Writer

    outfile = "/home/dan/ziptest/input1.txt.dz"
    writer = Writer(outfile, sync_size=1048576*100)
    infile = open("/home/dan/ziptest/input.txt", "rb")
    while True:
        data = infile.read(1048576+1)
        if not data:
            break
        writer.write(data)
    writer.close()
    infile.close()

Alternatively, you can open an IdzipFile in write mode and accomplish the same task:

    import idzip

    infile =

    infile = open("/home/dan/ziptest/input.txt", "rb")
    writer = idzip.IdzipFile("/home/dan/ziptest/input1.txt.dz", "wb", sync_size=1048576*100)

    with infile, writer:
        while True:
            data = infile.read(1048576 + 1)
            if not data:
                break
            writer.write(data)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

python_idzip-0.3.10.tar.gz (18.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

python_idzip-0.3.10-py3-none-any.whl (16.4 kB view details)

Uploaded Python 3

File details

Details for the file python_idzip-0.3.10.tar.gz.

File metadata

  • Download URL: python_idzip-0.3.10.tar.gz
  • Upload date:
  • Size: 18.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for python_idzip-0.3.10.tar.gz
Algorithm Hash digest
SHA256 dd6f688225b0ba94e4c58e2c00aab807ec1206a37f90b04ccf161345eec39837
MD5 ea0a5e791b8b606dc3db5a947573ca2e
BLAKE2b-256 078280f322cccfeb592a3d630feb14084fdfa5aa5d31d929b13fa568793d4831

See more details on using hashes here.

File details

Details for the file python_idzip-0.3.10-py3-none-any.whl.

File metadata

  • Download URL: python_idzip-0.3.10-py3-none-any.whl
  • Upload date:
  • Size: 16.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for python_idzip-0.3.10-py3-none-any.whl
Algorithm Hash digest
SHA256 7b0dfc782b6d33382f85f95a86ac8cb586659d0951303ed3f02a748c0969280b
MD5 7e723c26133f17128c4c95061039af04
BLAKE2b-256 775802dbd20c61773b0020c352c6605aa0e8393ae4578d1ed27668dc00291a20

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page