No project description provided

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

serde_mol2

Python/Rust module for mol2 format (de)serialization

Installation

Install from PyPi (required python >= 3.8):

pip install serde-mol2

After that:

-> python3
Python 3.9.5 (default, Jun  4 2021, 12:28:51)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import serde_mol2
>>> m = serde_mol2.read_file('example.mol2')
>>> m
[<builtins.Mol2 object at 0x7f6da9ebcae0>]

Or using a binary:

-> serde-mol2 -h
serde-mol2 0.2.2
CSC - IT Center for Science Ltd. (Jaroslaw Kalinowski <jaroslaw.kalinowski@csc.fi>)

USAGE:
    serde-mol2 [OPTIONS]

OPTIONS:
    -a, --append                       Append to mol2 files when writing rather than truncate
    -c, --compression <COMPRESSION>    Level of compression for BLOB data, 0 means no compression
                                       [default: 3]
        --comment <COMMENT>            Comment to add/filter to/by the molecule comment field
        --desc <DESC>                  Description to add/filter to/by entries when writing to the
                                       database
        --filename-desc                Add filename to the desc field when adding a batch of files
                                       to the database
    -h, --help                         Print help information
    -i, --input <INPUT_FILE>...        Input mol2 file
        --limit <LIMIT>                Limit the number of structures retrieved from the database.
                                       Zero means no limit. [default: 0]
        --list-desc                    List available row descriptions present in the database
        --no-shm                       Do not try using shm device when writing to databases
    -o, --output <OUTPUT_FILE>         Output mol2 file
        --offset <OFFSET>              Offset when limiting the number of structures retrieved from
                                       the database. Zero means no offset. [default: 0]
    -s, --sqlite <SQLITE_FILE>         Sqlite database file
    -V, --version                      Print version information

Usage a.k.a. quick function reference

class Mol2

Mol2.to_json()

Return a JSON string for a Mol2 object.
Mol2.as_string()

Return a mol2 string for a Mol2 object.
Mol2.write_mol2( filename, append=False )

Write Mol2 object to a mol2 file.
Mol2.serialized()

Return a Mol2 object in a python serialized form.

Functions

write_mol2( list, filename, append=False )

list is a list of Mol2 objects. Functions writes all structures in the list into a mol2 file named filename.
db_insert( list, filename, compression=3, shm=True )

Insert vector of structures into a database. Append if the database exists.

Input:
- list: vector of structures
- filename: path to the database
- compression: compression level
- shm: should be try and use a database out from a temporary location?
read_db_all( filename, shm=False, desc=None, comment=None, limit=0, offset=0 )

Read all structures from a database and return as a vector

Input:
- filename: path to the database
- shm: should we try and use the database out of a temporary location?
- desc: return only entries containing desc in the desc field
- comment: return only entries containing comment in the molecule comment
- limit: Limit the number of structures retrieved from the database and zero means no limit
- _offset: Offset when limiting the number of structures retrieved from the database and zero means no offset
read_db_all_serialized( filename, shm=True, desc=None, comment=None, limit=0, offset=0 )

Read all structures from a database and return as a vector, but keep structures in a serialized python form rather than binary.

Input:
- filename: path to the database
- shm: should we try and use the database out of a temporary location?
- desc: return only entries containing desc in the desc field
- comment: return only entries containing comment in the molecule comment
- limit: Limit the number of structures retrieved from the database and zero means no limit
- _offset: Offset when limiting the number of structures retrieved from the database and zero means no offset
read_file_to_db( filename, db-filename, compression=3, shm=True , desc=None, comment=None )

Convenience function. Read structures from a mol2 file and write directly to the database.

Input:
- filename: path to the mol2 file
- db-filename: path to the database
- compression: compression level
- shm: should we use the database out of a temporary location?
- desc: add this description to structures read
- comment: add this comment to the molecule comment field
read_file_to_db_batch( filenames, db-filename, compression=3, shm=True, desc=None, comment=None )

Convenience function. Read structures from a set of files directly into the database.

Input:
- filenames: vector of paths to mol2 files
- db-filename: path to the database
- compression: compression level
- shm: should we use the database out of a temporary location?
- desc: add this description to structures read
- comment: add this comment to the molecule comment field
read_file( filename, desc=None, comment=None )

Read a mol2 file and return a vector of structures

Input:
- filename: path to the mol2 file
- desc: add this description to structures read
- comment: add this comment to the molecule comment field
read_file_serialized( filename, desc=None, comment=None )

Read a mol2 file and return a vector of structures, but serialized python structures rather than a binary form.

Input:
- filename: path to the mol2 file
- desc: add this description to structures read
- comment: add this comment to the molecule comment field
desc_list( filename, shm=False )

List unique entry descriptions found in a database.

Input:
- filename: path to a database
- shm: should we use the database out of a temporary location?

Notes

Compression

Compression applies to sections other than MOLECULE. Those sections are stored in the database in a binary form (BLOB) as those sections contain multiple rows. Since it is not human readable it makes sense to apply at least some compression. The algorithm of choice currently is zstd. Default level of compression here is 3. However, by default, for zstd compression 0 means default level of compression, but in this module compression level 0 means no compression.

At the time of writing the overhead that comes from (de)compressing the data is negligible compared to IO/CPU cost of rw and parsing.

SHM

When writing to the database we are writing just one row at a time. On shared filesystems writing like that is very slow. When using shm functionality the module tries to copy the database to /dev/shm and use it there, essentially performing all operations in-memory. However, this means that file in the original location is essentially not usable by other processes as it will be overwritten at the end.

Another problem with doing things in /dev/shm is that if the database is too big, we can run out of space. So make sure your database fits into memory available.

In the future there will be an option to choose a different TMPDIR than /dev/shm, for example one that points to a fast NVMe storage.

By default shm is used only when writing to the database, as reading seems to not be affected so much.

Limitations

The biggest limitation at the moment is that only the following sections are read:

MOLECULE
ATOM
BOND
SUBSTRUCTURE

All other sections are currently just dropped silently.

Project details

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.2.4

Jan 28, 2022

0.2.1

Jan 12, 2022

0.2.0

Jan 12, 2022

0.1.2

Jan 10, 2022

0.1.1

Jan 10, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

serde_mol2-0.2.4-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl (3.1 MB view hashes)

Uploaded Jan 28, 2022 CPython 3.10 manylinux: glibc 2.5+ x86-64

serde_mol2-0.2.4-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl (3.1 MB view hashes)

Uploaded Jan 28, 2022 CPython 3.9 manylinux: glibc 2.5+ x86-64

serde_mol2-0.2.4-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl (3.1 MB view hashes)

Uploaded Jan 28, 2022 CPython 3.8 manylinux: glibc 2.5+ x86-64

Hashes for serde_mol2-0.2.4-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl

Hashes for serde_mol2-0.2.4-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm	Hash digest
SHA256	`e5cb26c6b3e72456d89c69ee1f4e281c760a5bfd1c4c035dfb5a00fbebd548f0`
MD5	`7fcff589c3d940739e77d2b80e27895b`
BLAKE2b-256	`973598f9e7c0a04e5ce7c3dc67a196e6e97b31b6f72cb0fe8b58e99fee701dc6`

Hashes for serde_mol2-0.2.4-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl

Hashes for serde_mol2-0.2.4-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm	Hash digest
SHA256	`13e31c0f22510997968e42701354d3c1690e7b71d3c41ab0c24366bf46017627`
MD5	`5f345cf97419dc7b59ff61679352e81e`
BLAKE2b-256	`9d26e91208885964ff3b27ae882e20d14cf63c10da4a88414e2287026e2f034a`

Hashes for serde_mol2-0.2.4-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl

Hashes for serde_mol2-0.2.4-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm	Hash digest
SHA256	`348fee2481074456247af31294478496a198ab3677a517636f497cd9310607c7`
MD5	`ee5714944913101f9cf1173bb9cc84c3`
BLAKE2b-256	`704e997663a4f3d32a7afc85a57c9be1debc0563abaeda6d366d41390d6b625f`