Skip to main content

An implementation of Roger Sayle's SmiZip algorithm for compressing short strings

Project description

SmiZip is a compression method for short strings. It was developed by Roger Sayle in 1998 while at Metaphorics LLC to compress SMILES strings.

This repo is an implementation in Python by Noel O’Boyle of the SmiZip algorithm as described by Roger in a Mug01 presentation in 2001: https://www.daylight.com/meetings/mug01/Sayle/SmiZip/index.htm

Quick start

Install as follows:

pip install smizip

Let’s compress and decompress a .smi file that contains canonical SMILES from RDKit using n-grams trained for this purpose listed in rdkit.slow.json (available from the GitHub site):

smizip    -i test.smi  -o test.smiz  -n example-ngrams/rdkit.slow.json
smizip -d -i test.smiz -o test.2.smi -n example-ngrams/rdkit.slow.json

To create your own JSON file of n-grams, you can train on a dataset (find_best_ngrams), or modify an existing JSON (add_char_to_json).

To use from Python:

import json
from smizip import SmiZip

json_file = "rdkit.slow.json"
with open(json_file) as inp:
   ngrams = json.load(inp)

zipper = SmiZip(ngrams)
zipped = zipper.zip("c1ccccc1C(=O)Cl") # gives bytes
unzipped = zipper.unzip(zipped)

Note

You should include \n (carraige-return) as a single-character n-gram if you intend to store the zipped representation in a file with lines terminated by \n. Otherwise, the byte value of \n will be assigned to a multi-gram, and zipped SMILES will be generated containing \n.

A similar warning goes for any SMILES termination character in a file. If you expect to store zipped SMILES that terminate in a TAB or SPACE character, you should add these characters as single-character n-grams. Otherwise the zipped representation may contain these and you won’t know which TABs are terminations and which are part of the representation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smizip-1.0.1.tar.gz (9.8 kB view details)

Uploaded Source

Built Distribution

smizip-1.0.1-py3-none-any.whl (10.9 kB view details)

Uploaded Python 3

File details

Details for the file smizip-1.0.1.tar.gz.

File metadata

  • Download URL: smizip-1.0.1.tar.gz
  • Upload date:
  • Size: 9.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/0.0.0 pkginfo/1.9.6 requests/2.31.0 setuptools/67.3.2 requests-toolbelt/1.0.0 tqdm/4.65.0 CPython/3.11.0

File hashes

Hashes for smizip-1.0.1.tar.gz
Algorithm Hash digest
SHA256 9398c6ee6e73185458389a57daa6418c020fe1188cfbcef9c529ce465bd10866
MD5 8f30218e3ec75ce45f68a4b76f1e1abe
BLAKE2b-256 f9008998e0b581a7668db506d997396c69f97951c9a26655d543a4525d418b28

See more details on using hashes here.

File details

Details for the file smizip-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: smizip-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 10.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/0.0.0 pkginfo/1.9.6 requests/2.31.0 setuptools/67.3.2 requests-toolbelt/1.0.0 tqdm/4.65.0 CPython/3.11.0

File hashes

Hashes for smizip-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 310119fc94792c6d54840815faa36c896ff2418a66512fb9421c4a47f459c854
MD5 709f3a4ca3974d282a589de7cb55df06
BLAKE2b-256 3f406b137f43a245ac3a469d7b682332a7f126c6bb0ce1bfb91962d9e2230e4b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page