DictZip - Random Access gzip files
Project description
python-idzip
Seekable, gzip compatible, compression format
Gzip allows to store extra fields in the gzip header. Idzip stores offsets for the efficient seeking there.
Install
python setup.py install
or
[python-idzip RHEL6 signed RPM] (http://pkgs.bauman.in/repoview/python-idzip.html)
Acknowledgement
based on https://code.google.com/p/idzip/
The file format was designed by Rik Faith for dictzip. Idzip just uses multiple gzip members to have no file size limit.
Idzip means Improved Dictzip.
added a Writer class
Sizing
downloaded
http://textfiles.com/stories/bureau.txt
cat several copies together up to 20GB > input.txt
gzfile generated using standard gzip
dzfile generated using this library
total 50172612
-rw-rw-r--. 1 dan dan 21313751280 May 10 15:58 input.txt
-rw-rw-r--. 1 dan dan 8576570661 May 10 17:21 dzfile.txt.dz
-rw-rw-r--. 1 dan dan 8076548622 May 10 16:28 gzfile.txt.gz
Size is almost the same as standard gzip
Seek Timing
seekpos = 21313751280 - 15
from time import time
start=time()
original = open("/home/dan/ziptest/input.txt")
original.seek(seekpos)
original.close()
print "Raw Seek to end", time() - start, "seconds"
import gzip
start=time()
verify = gzip.open("/home/dan/ziptest/gzfile.txt.gz", "rb")
verify.seek(seekpos)
verify.close()
print "Standard GZIP Seek to end", time() - start, "seconds"
import idzip
start=time()
verify = idzip.open("/home/dan/ziptest/input.txt.dz")
verify.seek(seekpos)
verify.close()
print "idzip Seek to end", time() - start, "seconds"
Raw Seek to end 0.000866889953613 seconds
Standard GZIP Seek to end 255.133864164 seconds
idzip Seek to end 0.0381989479065 seconds
Stream Writer
class allows streaming.
from idzip import Writer
outfile = "/home/dan/ziptest/input1.txt.dz"
writer = Writer(outfile, sync_size=1048576*100)
infile = open("/home/dan/ziptest/input.txt", "rb")
while True:
data = infile.read(1048576+1)
if not data:
break
writer.write(data)
writer.close()
infile.close()
Alternatively, you can open an IdzipFile in write mode and accomplish the
same task:
import idzip
infile =
infile = open("/home/dan/ziptest/input.txt", "rb")
writer = idzip.IdzipFile("/home/dan/ziptest/input1.txt.dz", "wb", sync_size=1048576*100)
with infile, writer:
while True:
data = infile.read(1048576 + 1)
if not data:
break
writer.write(data)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file python_idzip-0.3.10.tar.gz.
File metadata
- Download URL: python_idzip-0.3.10.tar.gz
- Upload date:
- Size: 18.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dd6f688225b0ba94e4c58e2c00aab807ec1206a37f90b04ccf161345eec39837
|
|
| MD5 |
ea0a5e791b8b606dc3db5a947573ca2e
|
|
| BLAKE2b-256 |
078280f322cccfeb592a3d630feb14084fdfa5aa5d31d929b13fa568793d4831
|
File details
Details for the file python_idzip-0.3.10-py3-none-any.whl.
File metadata
- Download URL: python_idzip-0.3.10-py3-none-any.whl
- Upload date:
- Size: 16.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7b0dfc782b6d33382f85f95a86ac8cb586659d0951303ed3f02a748c0969280b
|
|
| MD5 |
7e723c26133f17128c4c95061039af04
|
|
| BLAKE2b-256 |
775802dbd20c61773b0020c352c6605aa0e8393ae4578d1ed27668dc00291a20
|