An implementation of Roger Sayle's SmiZip algorithm for compressing short strings
Project description
SmiZip is a compression method for short strings. It was developed by Roger Sayle in 1998 while at Metaphorics LLC to compress SMILES strings.
This repo is an implementation in Python by Noel O’Boyle of the SmiZip algorithm as described by Roger in a Mug01 presentation in 2001: https://www.daylight.com/meetings/mug01/Sayle/SmiZip/index.htm
Quick start
Install as follows:
pip install smizip
Let’s compress and decompress a .smi file that contains canonical SMILES from RDKit using n-grams trained for this purpose listed in rdkit.slow.json (available from the GitHub site):
smizip -i test.smi -o test.smiz -n example-ngrams/rdkit.slow.json smizip -d -i test.smiz -o test.2.smi -n example-ngrams/rdkit.slow.json
To create your own JSON file of n-grams, you can train on a dataset (find_best_ngrams), or modify an existing JSON (add_char_to_json).
To use from Python:
import json from smizip import SmiZip json_file = "rdkit.slow.json" with open(json_file) as inp: ngrams = json.load(inp) zipper = SmiZip(ngrams) zipped = zipper.zip("c1ccccc1C(=O)Cl") # gives bytes unzipped = zipper.unzip(zipped)
Note
You should include \n (carraige-return) as a single-character n-gram if you intend to store the zipped representation in a file with lines terminated by \n. Otherwise, the byte value of \n will be assigned to a multi-gram, and zipped SMILES will be generated containing \n.
A similar warning goes for any SMILES termination character in a file. If you expect to store zipped SMILES that terminate in a TAB or SPACE character, you should add these characters as single-character n-grams. Otherwise the zipped representation may contain these and you won’t know which TABs are terminations and which are part of the representation.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file smizip-1.0.1.tar.gz
.
File metadata
- Download URL: smizip-1.0.1.tar.gz
- Upload date:
- Size: 9.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/0.0.0 pkginfo/1.9.6 requests/2.31.0 setuptools/67.3.2 requests-toolbelt/1.0.0 tqdm/4.65.0 CPython/3.11.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9398c6ee6e73185458389a57daa6418c020fe1188cfbcef9c529ce465bd10866 |
|
MD5 | 8f30218e3ec75ce45f68a4b76f1e1abe |
|
BLAKE2b-256 | f9008998e0b581a7668db506d997396c69f97951c9a26655d543a4525d418b28 |
File details
Details for the file smizip-1.0.1-py3-none-any.whl
.
File metadata
- Download URL: smizip-1.0.1-py3-none-any.whl
- Upload date:
- Size: 10.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/0.0.0 pkginfo/1.9.6 requests/2.31.0 setuptools/67.3.2 requests-toolbelt/1.0.0 tqdm/4.65.0 CPython/3.11.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 310119fc94792c6d54840815faa36c896ff2418a66512fb9421c4a47f459c854 |
|
MD5 | 709f3a4ca3974d282a589de7cb55df06 |
|
BLAKE2b-256 | 3f406b137f43a245ac3a469d7b682332a7f126c6bb0ce1bfb91962d9e2230e4b |