Skip to main content

Python library to load DBGZ files

Project description

dbgz

DBGZ (Data block "GNU" zip) is a lightweight seekable compressed binary format for storing and retrieving data. The format is designed to hold schema and data comprising primary objects, such as numbers, strings and arrays, as well as more complicated instances via msgpack, such as dictionaries and python objects. It is based on the bgzip file format used in biology and other fields.

This utility can be used to write and read DBGZ files in python.

Installation

Install using pip

pip install dbgz

or from source:

pip git+https://github.com/filipinascimento/dbgz.git

Usage

First import dbgz:

import dbgz

Defining a scheme

scheme = [
    ("anInteger","i"),
    ("aFloat","f"),
    ("aString","s"),
    ("anIntArray","I"),
    ("aFloatArray","F"),
    ("anStringArray","S"),
    ("anyType","a"), #any data type
]

Writing some data to a dbgz file

from tqdm.auto import tqdm # Optional, to print progress bar
# pip install tqdm

totalCount = 1000000;
with dbgz.DBGZWriter("test.dbgz",scheme) as fd:
    # New entries can be added as:
    fd.write(anInteger=1, aString="1")
    fd.write(anInteger=2, aString="2", aFloat=5)
    fd.write(anInteger=3, aString="3",anIntArray=list(range(10)), aFloatArray=[0.1,0.2,0.3,0.5])

    # Here is a loop to write a lot of data:
    for index in tqdm(range(totalCount)):
        fd.write(
            anInteger=index,
            aFloat=index*0.01,
            anIntArray=list(range(index,index+10)),
            aString=str(index),
            aFloatArray=[index+0.1,index-0.2,index+0.3,index+0.4],
            anStringArray=[str(index),str(index+1),str(index+2),str(index+3)],
            anyType={"a": index, "b": index+1, "c": index+2}
        )

Reading the dbgz file sequencially:

with dbgz.DBGZReader("test.dbgz") as fd:
    print(fd.scheme)
    for entry in tqdm(fd.entries,total=fd.entriesCount):
        assert entry["anInteger"] == int(entry["aString"])

Loading a dbgz file manually by using the read() method:

with dbgz.DBGZReader("test.dbgz") as fd:
    pbar = tqdm(total=fd.entriesCount)
    print(fd.scheme)
    while True:
        entries = fd.read(10)
        if(not entries):
            break
        for entry in entries:
            assert entry["anInteger"] == int(entry["aString"])
        pbar.update(len(entries))
pbar.refresh()
pbar.close()

Saving dictionary to file and loading it again

with dbgz.DBGZReader("test.dbgz") as fd:
    indexDictionary = fd.generateIndex("anInteger",
        indicesPath=None,
        filterFunction=lambda entry: entry["anInteger"]<10,
        useDictionary=True,
        showProgressbar = True
        )
    for key,values in indexDictionary.items():
        print(key,values)
        for value in values:
            assert int(key) == fd.readAt(value)[0]["anInteger"]

Saving dictionary to file and loading it again

with dbgz.DBGZReader("test.dbgz") as fd:
    fd.generateIndex(
        key = "anInteger",
        indicesPath="test_byAnInteger.idbgz", 
        filterFunction=lambda entry: entry["anInteger"]<10,
        useDictionary=True,
        showProgressbar = True
        )

    indexDictionary = dbgz.readIndicesDictionary("test_by.idbgz")
    for key,values in indexDictionary.items():
        print(key,values)
        for value in values:
            assert int(key) == fd.readAt(value)[0]["anInteger"]

Using a custom key generator for the index:

with dbgz.DBGZReader("test.dbgz") as fd:
    fd.generateIndex(
        keyFunction=lambda entry: entry["anyType"]["b"] if entry["anyType"] else None,
        indicesPath="test_byAnyType_b.idbgz", 
        filterFunction=lambda entry: entry["anInteger"]<10,
        useDictionary=True,
        showProgressbar = True
        )

    indexDictionary = dbgz.readIndicesDictionary("test_byAnyType_b.idbgz")
    for key,values in indexDictionary.items():
        print(key,values)
        for value in values:
            entry = fd.readAt(value)[0]
            assert int(key) == entry["anyType"]["b"] if entry["anyType"] else None

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbgz-0.5.1.tar.gz (19.1 kB view details)

Uploaded Source

File details

Details for the file dbgz-0.5.1.tar.gz.

File metadata

  • Download URL: dbgz-0.5.1.tar.gz
  • Upload date:
  • Size: 19.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.12

File hashes

Hashes for dbgz-0.5.1.tar.gz
Algorithm Hash digest
SHA256 22e4d961b1047a36390f8370dd0f95bdc74ad25dfc5eeabf2d8537edab0f6ff9
MD5 625bbee73a615946ebd9dbecd7c45855
BLAKE2b-256 e983dba7eda141f8c12152dc57eeb0871e7e8884b31652fc634a5d09e3aa25d2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page