A compressor and decompressor library for cheminformatics SMILES strings
Project description
SMILEZ - compression for SMILES strings
=======================================
SMILEZ home page: https://bitbucket.org/dalke/smilez
Smilez is a simple compression library for SMILES strings. It's
closely based on the SMAZ compression library for short strings,
available from https://github.com/antirez/smaz .
Smilez is not a general purpose compression algorithm. It can compress
SMILES strings by about 50-60%, including many SMILES strings which
are only a few bytes long. In the best case it can compress up to 75%,
but if you give it non-SMILES data the result might be 50% larger.
You can use SMILEZ through the C API or the Python API.
COMPRESSION EXAMPLES
====================
The SMILEZ code dictionaries were trained on a mixture of PubChem and
ChEMBL data. Compression performance will depend on how closely the
SMILES strings match that training set. In general it does best with
"normal" SMILES, and poorly with uncommon features like reaction maps
and isotopes.
from to
size size SMILES
---- ---- ------------------------------------------
1 1 C
2 1 CC
3 2 CCC
4 1 CCCC
7 4 F/C=C/F
6 8 [NH4+]
9 4 c1ccccc1O
26 13 Cn1cnc2c1c(=O)n(c(=O)n2C)C
28 10 CN1C=NC2=C1C(=O)N(C(=O)N2C)C
42 17 COc1ccc2nc([nH]c2c1)S(=O)Cc1ncc(C)c(OC)c1C
43 46 [CH2:1]=[CH:2][CH2:1][CH2:3][C:4](C)[CH2:3]
60 30 C[N+]1(C2CC(CC1C3C2O3)OC(=O)C(C4=CC=CS4)(C5=CC=CS5)O)C.[Br-]
ChEMBL Example
==============
I have a SMILES data set derived from ChEMBL 16 with 1,292,344
records.
uncompressed: 66,896,248 bytes
compressed: 25,455,759 bytes
=> 60% smaller
I timed the process using the Python API. It took about 2.6 seconds to
compress the strings and 1.4 seconds to decompress them.
By comparison, it takes zlib about 51 seconds to compress the same
data set. The result takes up 61,406,214 bytes, which is only 8%
smaller than the original. (This is why you don't use zlib on small
strings.)
I double-checked using the 24,323 SMILES from the NCI data set. The
compressed SMILES are 58% smaller than the original SMILES (1.21 MB to
0.51 MB).
C API
=====
int smilez_compress(const char *in, int inlen, char *out, int outlen,
int dictionary)
Compress the SMILES string in 'in' of length 'inlen' and put the
compressed data into 'out' of maximum length 'outlen' bytes. If the
output buffer is too short to hold the whole compressed string,
outlen+1 is returned. Otherwise, the length of the compressed string
(less then or equal to outlen) is returned.
The 'dictionary' option specifies the encoding dictionary. If
SMILEZ_BYTE_DICTIONARY (=0) then the compressed string may use the
bytes 0-255. If SMILEZ_WHITESPACE_DICTIONARY (=1) then the compressed
strings will not use the bytes "\n", "\r", "\t", or " ". This lets you
use a SMILEZ string as a field in a tab or space separated file, or in
an SD tag, for a very small size penalty.
The dictionary only affects the encoder. The decoder handles both
dictionaries automatically. If the decoder value is out of range then
smilez_compress returns 0.
int smaz_decompress(char *in, int inlen, char *out, int outlen);
Decompress the buffer 'in' of length 'inlen' and put the decompressed data into
'out' of max length 'outlen' bytes. If the output buffer is too short to hold
the whole decompressed string, outlen+1 is returned. Otherwise the length of the
compressed string (less then or equal to outlen) is returned. This function will
not automatically put a nul-term at the end of the string if the original
compressed string didn't included a nulterm.
The following macros are nearly self-explanatory:
#define SMILEZ_VERSION "1.0"
#define SMILEZ_MAJOR_VERSION 1
#define SMILEZ_MINOR_VERSION 0
#define SMILEZ_COMPRESSION_VERSION 1
#define SMILEZ_BYTE_DICTIONARY 0
#define SMILEZ_WHITESPACE_DICTIONARY 1
#define SMILEZ_NUM_DICTIONARIES 2
The SMILEZ_COMPRESSION_VERSION will change only when the compression
format changes.
The SMILEZ_VERSION and SMILEZ_COMPRESSION_VERSION values are also
available through function calls:
const char *smilez_get_version(void);
int smilez_get_compression_version(void);
Python API
==========
Use "python setup.py install" to build and install the Python module
"smilez".
The module defines a few constants, based on smilez_get_version() and
smilez_get_compression_version():
>>> import smilez
>>> smilez.__version__
'1.0'
>>> smilez.compression_version
1
as well as constants for the two dictionaries:
>>> smilez.BYTE_DICTIONARY
0
>>> smilez.WHITESPACE_DICTIONARY
1
Here's an example of how to compress and decompress phenol:
>>> smilez.compress("c1ccccc1O")
'I\xda\xc9\xb9'
>>> smilez.decompress('I\xda\xc9\xb9')
'c1ccccc1O'
and an example of compressing with the whitespace dictionary:
>>> smilez.compress("P", smilez.WHITESPACE_DICTIONARY)
'\xfeP'
>>> smilez.compress("P")
' '
Command-line programs
=====================
Running "make" will make three demonstration or test programs:
smilezip
--------
Convert a SMILES file into a SMILEZ file.
A SMILES file contains one record per line. Each line contains space
or tab delimited fields where the SMILES is the first field. A SMILEZ
file is the same format, after compressing the first field.
% ./smilezip nci_09425001_09450000.smi > nci.smiz
% ls -l nci_09425001_09450000.smi nci.smiz
-rw-r--r-- 1 dalke staff 1433190 Jun 8 2008 nci_09425001_09450000.smi
-rw-r--r-- 1 dalke staff 724140 Jul 2 02:52 nci.smiz
NOTE! There is no real reason to have a SMILEZ file. The gzip'ed
version of the same file is only 187483 bytes, or almost 1/4th the
size of the SMILEZ file. This is a demonstration of how one might use
the whitespace dictionary.
smilezcat
---------
Convert a SMILEZ file into a SMILES file:
% ./smilezcat nci.smiz > nci.smi
% cmp nci_09425001_09450000.smi nci.smi
smilez_test
-----------
This runs a set of self-tests on the C API.
CREDITS
=======
Smaz was written by Salvatore Sanfilippo and is released under the BSD
license. Check the COPYING file for more information.
Smilez was written by Andrew Dalke and is released under the same BSD
license.
=======================================
SMILEZ home page: https://bitbucket.org/dalke/smilez
Smilez is a simple compression library for SMILES strings. It's
closely based on the SMAZ compression library for short strings,
available from https://github.com/antirez/smaz .
Smilez is not a general purpose compression algorithm. It can compress
SMILES strings by about 50-60%, including many SMILES strings which
are only a few bytes long. In the best case it can compress up to 75%,
but if you give it non-SMILES data the result might be 50% larger.
You can use SMILEZ through the C API or the Python API.
COMPRESSION EXAMPLES
====================
The SMILEZ code dictionaries were trained on a mixture of PubChem and
ChEMBL data. Compression performance will depend on how closely the
SMILES strings match that training set. In general it does best with
"normal" SMILES, and poorly with uncommon features like reaction maps
and isotopes.
from to
size size SMILES
---- ---- ------------------------------------------
1 1 C
2 1 CC
3 2 CCC
4 1 CCCC
7 4 F/C=C/F
6 8 [NH4+]
9 4 c1ccccc1O
26 13 Cn1cnc2c1c(=O)n(c(=O)n2C)C
28 10 CN1C=NC2=C1C(=O)N(C(=O)N2C)C
42 17 COc1ccc2nc([nH]c2c1)S(=O)Cc1ncc(C)c(OC)c1C
43 46 [CH2:1]=[CH:2][CH2:1][CH2:3][C:4](C)[CH2:3]
60 30 C[N+]1(C2CC(CC1C3C2O3)OC(=O)C(C4=CC=CS4)(C5=CC=CS5)O)C.[Br-]
ChEMBL Example
==============
I have a SMILES data set derived from ChEMBL 16 with 1,292,344
records.
uncompressed: 66,896,248 bytes
compressed: 25,455,759 bytes
=> 60% smaller
I timed the process using the Python API. It took about 2.6 seconds to
compress the strings and 1.4 seconds to decompress them.
By comparison, it takes zlib about 51 seconds to compress the same
data set. The result takes up 61,406,214 bytes, which is only 8%
smaller than the original. (This is why you don't use zlib on small
strings.)
I double-checked using the 24,323 SMILES from the NCI data set. The
compressed SMILES are 58% smaller than the original SMILES (1.21 MB to
0.51 MB).
C API
=====
int smilez_compress(const char *in, int inlen, char *out, int outlen,
int dictionary)
Compress the SMILES string in 'in' of length 'inlen' and put the
compressed data into 'out' of maximum length 'outlen' bytes. If the
output buffer is too short to hold the whole compressed string,
outlen+1 is returned. Otherwise, the length of the compressed string
(less then or equal to outlen) is returned.
The 'dictionary' option specifies the encoding dictionary. If
SMILEZ_BYTE_DICTIONARY (=0) then the compressed string may use the
bytes 0-255. If SMILEZ_WHITESPACE_DICTIONARY (=1) then the compressed
strings will not use the bytes "\n", "\r", "\t", or " ". This lets you
use a SMILEZ string as a field in a tab or space separated file, or in
an SD tag, for a very small size penalty.
The dictionary only affects the encoder. The decoder handles both
dictionaries automatically. If the decoder value is out of range then
smilez_compress returns 0.
int smaz_decompress(char *in, int inlen, char *out, int outlen);
Decompress the buffer 'in' of length 'inlen' and put the decompressed data into
'out' of max length 'outlen' bytes. If the output buffer is too short to hold
the whole decompressed string, outlen+1 is returned. Otherwise the length of the
compressed string (less then or equal to outlen) is returned. This function will
not automatically put a nul-term at the end of the string if the original
compressed string didn't included a nulterm.
The following macros are nearly self-explanatory:
#define SMILEZ_VERSION "1.0"
#define SMILEZ_MAJOR_VERSION 1
#define SMILEZ_MINOR_VERSION 0
#define SMILEZ_COMPRESSION_VERSION 1
#define SMILEZ_BYTE_DICTIONARY 0
#define SMILEZ_WHITESPACE_DICTIONARY 1
#define SMILEZ_NUM_DICTIONARIES 2
The SMILEZ_COMPRESSION_VERSION will change only when the compression
format changes.
The SMILEZ_VERSION and SMILEZ_COMPRESSION_VERSION values are also
available through function calls:
const char *smilez_get_version(void);
int smilez_get_compression_version(void);
Python API
==========
Use "python setup.py install" to build and install the Python module
"smilez".
The module defines a few constants, based on smilez_get_version() and
smilez_get_compression_version():
>>> import smilez
>>> smilez.__version__
'1.0'
>>> smilez.compression_version
1
as well as constants for the two dictionaries:
>>> smilez.BYTE_DICTIONARY
0
>>> smilez.WHITESPACE_DICTIONARY
1
Here's an example of how to compress and decompress phenol:
>>> smilez.compress("c1ccccc1O")
'I\xda\xc9\xb9'
>>> smilez.decompress('I\xda\xc9\xb9')
'c1ccccc1O'
and an example of compressing with the whitespace dictionary:
>>> smilez.compress("P", smilez.WHITESPACE_DICTIONARY)
'\xfeP'
>>> smilez.compress("P")
' '
Command-line programs
=====================
Running "make" will make three demonstration or test programs:
smilezip
--------
Convert a SMILES file into a SMILEZ file.
A SMILES file contains one record per line. Each line contains space
or tab delimited fields where the SMILES is the first field. A SMILEZ
file is the same format, after compressing the first field.
% ./smilezip nci_09425001_09450000.smi > nci.smiz
% ls -l nci_09425001_09450000.smi nci.smiz
-rw-r--r-- 1 dalke staff 1433190 Jun 8 2008 nci_09425001_09450000.smi
-rw-r--r-- 1 dalke staff 724140 Jul 2 02:52 nci.smiz
NOTE! There is no real reason to have a SMILEZ file. The gzip'ed
version of the same file is only 187483 bytes, or almost 1/4th the
size of the SMILEZ file. This is a demonstration of how one might use
the whitespace dictionary.
smilezcat
---------
Convert a SMILEZ file into a SMILES file:
% ./smilezcat nci.smiz > nci.smi
% cmp nci_09425001_09450000.smi nci.smi
smilez_test
-----------
This runs a set of self-tests on the C API.
CREDITS
=======
Smaz was written by Salvatore Sanfilippo and is released under the BSD
license. Check the COPYING file for more information.
Smilez was written by Andrew Dalke and is released under the same BSD
license.