Python library to load DBGZ files
Project description
dbgz
DBGZ (Data block "GNU" zip) is a lightweight seekable compressed binary format for storing and retrieving data. The format is designed to hold schema and data comprising primary objects, such as numbers, strings and arrays, as well as more complicated instances via msgpack, such as dictionaries and python objects. It is based on the bgzip file format used in biology and other fields.
This utility can be used to write and read DBGZ files in python.
Installation
Install using pip
pip install dbgz
or from source:
pip git+https://github.com/filipinascimento/dbgz.git
Usage
First import dbgz:
import dbgz
Defining a scheme
scheme = [
("anInteger","i"),
("aFloat","f"),
("aString","s"),
("anIntArray","I"),
("aFloatArray","F"),
("anStringArray","S"),
("anyType","a"), #any data type
]
Writing some data to a dbgz file
from tqdm.auto import tqdm # Optional, to print progress bar
# pip install tqdm
totalCount = 1000000;
with dbgz.DBGZWriter("test.dbgz",scheme) as fd:
# New entries can be added as:
fd.write(anInteger=1, aString="1")
fd.write(anInteger=2, aString="2", aFloat=5)
fd.write(anInteger=3, aString="3",anIntArray=list(range(10)), aFloatArray=[0.1,0.2,0.3,0.5])
# Here is a loop to write a lot of data:
for index in tqdm(range(totalCount)):
fd.write(
anInteger=index,
aFloat=index*0.01,
anIntArray=list(range(index,index+10)),
aString=str(index),
aFloatArray=[index+0.1,index-0.2,index+0.3,index+0.4],
anStringArray=[str(index),str(index+1),str(index+2),str(index+3)],
anyType={"a": index, "b": index+1, "c": index+2}
)
Reading the dbgz file sequencially:
with dbgz.DBGZReader("test.dbgz") as fd:
print(fd.scheme)
for entry in tqdm(fd.entries,total=fd.entriesCount):
assert entry["anInteger"] == int(entry["aString"])
Loading a dbgz file manually by using the read()
method:
with dbgz.DBGZReader("test.dbgz") as fd:
pbar = tqdm(total=fd.entriesCount)
print(fd.scheme)
while True:
entries = fd.read(10)
if(not entries):
break
for entry in entries:
assert entry["anInteger"] == int(entry["aString"])
pbar.update(len(entries))
pbar.refresh()
pbar.close()
Saving dictionary to file and loading it again
with dbgz.DBGZReader("test.dbgz") as fd:
indexDictionary = fd.generateIndex("anInteger",
indicesPath=None,
filterFunction=lambda entry: entry["anInteger"]<10,
useDictionary=True,
showProgressbar = True
)
for key,values in indexDictionary.items():
print(key,values)
for value in values:
assert int(key) == fd.readAt(value)[0]["anInteger"]
Saving dictionary to file and loading it again
with dbgz.DBGZReader("test.dbgz") as fd:
fd.generateIndex(
key = "anInteger",
indicesPath="test_byAnInteger.idbgz",
filterFunction=lambda entry: entry["anInteger"]<10,
useDictionary=True,
showProgressbar = True
)
indexDictionary = dbgz.readIndicesDictionary("test_by.idbgz")
for key,values in indexDictionary.items():
print(key,values)
for value in values:
assert int(key) == fd.readAt(value)[0]["anInteger"]
Using a custom key generator for the index:
with dbgz.DBGZReader("test.dbgz") as fd:
fd.generateIndex(
keyFunction=lambda entry: entry["anyType"]["b"] if entry["anyType"] else None,
indicesPath="test_byAnyType_b.idbgz",
filterFunction=lambda entry: entry["anInteger"]<10,
useDictionary=True,
showProgressbar = True
)
indexDictionary = dbgz.readIndicesDictionary("test_byAnyType_b.idbgz")
for key,values in indexDictionary.items():
print(key,values)
for value in values:
entry = fd.readAt(value)[0]
assert int(key) == entry["anyType"]["b"] if entry["anyType"] else None
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file dbgz-0.5.4.tar.gz
.
File metadata
- Download URL: dbgz-0.5.4.tar.gz
- Upload date:
- Size: 18.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b17c116b55cac911c95997825b6e6a311268dce01b181f831b51fa8e9fcdd4d2 |
|
MD5 | 734aadda2f2cc6e5ae5a6746ddedb7d0 |
|
BLAKE2b-256 | 9c415bd5d817e03ba8d1697808c2c469a32cc668e587be5450fcc1fa7334ac5a |