Tools and helpers for RDKit.
Project description
RDKIT-TOOLS
Tools for use with RDKit. Motivated and intended for use with CFDE and CFChemDb, developed by the IDG-CFDE team.
See also:
- CFChemDb (repository)
- CFChemDb_UI (repository)
- rdktools (Pypi package)
- CFDE: Common Fund Data Ecosystem
RDKit:
Dependencies
- RDKit Python package (via conda recommended).
$ conda create -n rdktools -c conda-forge rdkit ipykernel
$ conda activate rdktools
(rdktools) $ conda install -c conda-forge pyvis
(rdktools) $ conda install -c conda-forge networkx=2.5
See also: conda/environment.yml
Contents
- Formats - chemical file format conversion
- Depictions - 2D molecular depictions
- Standardization - molecular standardization
- Fingerprints - molecular path and pattern based binary feature vectors, similarity, and clustering tools
- Conformations - distance geometry based 3D conformation generation
- Properties - molecular property calculation: Lipinsky, Wildman-Crippen LogP, Kier-Hall electrotopological descriptors, solvent accessible surface area (SASA), and more.
- Scaffolds - Bemis-Murcko and BRICS scaffold analysis, rdScaffoldNetworks.
- SMARTS - molecular pattern matching (subgraph isomorphism)
- Reactions - Reaction SMILES, SMARTS, and SMIRKS based reaction analytics
- util.sklearn - Scikit-learn utilities for processing molecular fingerprints and other feature vectors.
Formats
(rdktools) $ python3 -m rdktools.formats.App -h
usage: App.py [-h] [--i IFILE] [--o OFILE] [--kekulize] [--sanitize] [--header]
[--delim DELIM] [--smilesColumn SMILESCOLUMN] [--nameColumn NAMECOLUMN]
[-v]
{mdl2smi,mdl2tsv,smi2mdl,smiclean,mdlclean,mol2inchi,mol2inchikey,demo}
RDKit chemical format utility
positional arguments:
{mdl2smi,mdl2tsv,smi2mdl,smiclean,mdlclean,mol2inchi,mol2inchikey,demo}
operation
optional arguments:
-h, --help show this help message and exit
--i IFILE input file (SMILES/TSV or SDF)
--o OFILE output file (specify '-' for stdout)
--kekulize Kekulize
--sanitize Sanitize
--header input SMILES/TSV file has header line
--delim DELIM delimiter for SMILES/TSV
--smilesColumn SMILESCOLUMN
input SMILES column
--nameColumn NAMECOLUMN
input name column
-v, --verbose
Depictions
(rdktools) $ python3 -m rdktools.depict.App -h
usage: App.py [-h] [--i IFILE] [--ifmt {AUTO,SMI,MDL}] [--ofmt {PNG,JPEG,PDF}]
[--smilesColumn SMILESCOLUMN] [--nameColumn NAMECOLUMN] [--header]
[--delim DELIM] [--height HEIGHT] [--width WIDTH] [--kekulize]
[--wedgebonds] [--pdf_title PDF_TITLE] [--batch_dir BATCH_DIR]
[--batch_prefix BATCH_PREFIX] [--o OFILE] [-v]
{single,batch,pdf,demo,demo2}
RDKit molecule depiction utility
positional arguments:
{single,batch,pdf,demo,demo2}
OPERATION
optional arguments:
-h, --help show this help message and exit
--i IFILE input molecule file
--ifmt {AUTO,SMI,MDL}
input file format
--ofmt {PNG,JPEG,PDF}
output file format
--smilesColumn SMILESCOLUMN
--nameColumn NAMECOLUMN
--header SMILES/TSV file has header
--delim DELIM SMILES/TSV field delimiter
--height HEIGHT height of image
--width WIDTH width of image
--kekulize display Kekule form
--wedgebonds stereo wedge bonds
--pdf_title PDF_TITLE
PDF doc title
--batch_dir BATCH_DIR
destination for batch files
--batch_prefix BATCH_PREFIX
prefix for batch files
--o OFILE output file
-v, --verbose
Modes: single = one image; batch = multiple images; pdf = multi-page
python3 -m rdktools.depict.App single -height 500 --width 600 --i valium.smiles --o valium.png
Scaffolds
(rdktools) $ python3 -m rdktools.scaffold.App -h
usage: App.py [-h] [--i IFILE] [--o OFILE] [--o_html OFILE_HTML]
[--scratchdir SCRATCHDIR] [--smicol SMICOL] [--namcol NAMCOL]
[--idelim IDELIM] [--odelim ODELIM] [--iheader] [--oheader]
[--brics] [-v]
{bmscaf,scafnet,demobm,demonet,demonetvis}
RDKit scaffold analysis
positional arguments:
{bmscaf,scafnet,demobm,demonet,demonetvis}
OPERATION
optional arguments:
-h, --help show this help message and exit
--i IFILE input file, TSV or SDF
--o OFILE output file, TSV|SDF
--o_html OFILE_HTML output file, HTML
--scratchdir SCRATCHDIR
--smicol SMICOL SMILES column from TSV (counting from 0)
--namcol NAMCOL name column from TSV (counting from 0)
--idelim IDELIM delim for input TSV
--odelim ODELIM delim for output TSV
--iheader input TSV has header
--oheader output TSV has header
--brics BRICS fragmentation rules (Degen, 2008)
-v, --verbose
python3 -m rdktools.scaffold.App bmscaf --i drugs.smiles --o_vis drugs_bmscaf.png
Standardization
(rdktools) $ python3 -m rdktools.standard.App -h
usage: App.py [-h] [--i IFILE] [--o OFILE] [--delim DELIM] [--smilesColumn SMILESCOLUMN]
[--nameColumn NAMECOLUMN] [--nameSDField NAMESDFIELD] [--header]
[--sanitize] [--kekuleSmiles] [--normset {DEFAULT,UNM}]
[--i_normset IFILE_NORMSET] [--isomericSmiles] [--metalRemove]
[--largestFragment] [--neutralize] [-v]
{standardize,canonicalize,saltremove,list_norms,show_params,demo}
RDKit chemical standardizer
positional arguments:
{standardize,canonicalize,saltremove,list_norms,show_params,demo}
OPERATION
options:
-h, --help show this help message and exit
--i IFILE input file, SMI or SDF
--o OFILE output file, SMI or SDF
--delim DELIM SMILES/TSV delimiter
--smilesColumn SMILESCOLUMN
--nameColumn NAMECOLUMN
--nameSDField NAMESDFIELD
SD field to use as name
--header SMILES/TSV has header line
--sanitize Sanitize molecules as read.
--kekuleSmiles Kekule SMILES output.
--normset {DEFAULT,UNM}
normalization sets
--i_normset IFILE_NORMSET
input normalizations file, format: SMIRKS<space>NAME
--isomericSmiles If false, output SMILES isomerism removed
--metalRemove Remove disconnected metals like salts charges (use with
saltremove).
--largestFragment Remove non-largest fragments (use with saltremove).
--neutralize Neutralize charges (use with saltremove).
-v, --verbose
For documentation on RDKit Molecular Sanitization, see The RDKit Book. Briefly:
The idea is to generate useful computed properties (like hybridization, ring membership, etc.) for the rest of the code and to ensure that the molecules are "reasonable": that they can be represented with octet-complete Lewis dot structures.
Conformations
(rdktools) $ python3 -m rdktools.conform.App -h
usage: App.py [-h] [--i IFILE] [--o OFILE] [--ff {UFF,MMFF}] [--optiters OPTITERS]
[--nconf NCONF] [--etol ETOL] [--title_in_header] [-v]
RDKit Conformer Generation
optional arguments:
-h, --help show this help message and exit
--i IFILE input file, SMI or SDF
--o OFILE output SDF with 3D
--ff {UFF,MMFF} force-field
--optiters OPTITERS optimizer iterations per conf
--nconf NCONF # confs per mol
--etol ETOL energy tolerance
--title_in_header title line in header
-v, --verbose
Based on distance geometry method by Blaney et al.
Fingerprints
By default, RDKit and Morgan fingerprints are generated length 2048 bits, by the following methods:
RDKit path-based, Daylight-like:
Chem.RDKFingerprint(mol, minPath=1, maxPath=7, fpSize=2048, nBitsPerHash=2, useHs=False, minSize=2048)
Morgan ECFP-like:
AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=2048)
(rdktools) $ python3 -m rdktools.fp.App -h
usage: App.py [-h] [--i IFILE] [--iheader] [--o OFILE] [--output_as_dataframe]
[--output_as_tsv] [--useHs] [--useValence] [--dbName DBNAME]
[--tableName TABLENAME] [--minSize MINSIZE] [--maxSize MAXSIZE]
[--density DENSITY] [--outTable OUTTABLE] [--outDbName OUTDBNAME]
[--fpColName FPCOLNAME] [--minPath MINPATH] [--maxPath MAXPATH]
[--nBitsPerHash NBITSPERHASH] [--discrim] [--smilesColumn SMILESCOLUMN]
[--molPkl MOLPKL] [--input_format {SMILES,SD}] [--idColumn IDCOLUMN]
[--maxMols MAXMOLS] [--fpAlgo {RDKIT,MACCS,MORGAN}]
[--morgan_nbits MORGAN_NBITS] [--morgan_radius MORGAN_RADIUS]
[--replaceTable] [--smilesTable SMILESTABLE] [--topN TOPN]
[--thresh THRESH] [--querySmiles QUERYSMILES]
[--metric {ALLBIT,ASYMMETRIC,DICE,COSINE,KULCZYNSKI,MCCONNAUGHEY,ONBIT,RUSSEL,SOKAL,TANIMOTO,TVERSKY}]
[--tversky_alpha TVERSKY_ALPHA] [--tversky_beta TVERSKY_BETA]
[--clusterAlgo {WARD,SLINK,CLINK,UPGMA,BUTINA}] [--actTable ACTTABLE]
[--actName ACTNAME] [--reportFreq REPORTFREQ] [--showVis] [-v]
{FingerprintMols,MolSimilarity,ClusterMols}
RDKit fingerprint-based analytics
positional arguments:
{FingerprintMols,MolSimilarity,ClusterMols}
OPERATION
optional arguments:
-h, --help show this help message and exit
--i IFILE input file; if provided and no tableName is specified, data will
be read from the input file. Text files delimited with either
commas (extension .csv) or tabs (extension .txt) are supported.
--iheader input file has header line
--o OFILE output file (pickle file with one label,fingerprint entry for
each molecule).
--output_as_dataframe
Output FPs as Pandas dataframe (pickled) with names as index,
columns as feature names, if available.
--output_as_tsv Output FPs as TSV with names as index, columns as feature names,
if available.
--useHs include Hs in the fingerprint Default is *false*.
--useValence include valence information in the fingerprints Default is
*false*.
--dbName DBNAME name of the database from which to pull input molecule
information. If output is going to a database, this will also be
used for that unless the --outDbName option is used.
--tableName TABLENAME
name of the database table from which to pull input molecule
information
--minSize MINSIZE minimum size of the fingerprints to be generated (limits the
amount of folding that happens) [64].
--maxSize MAXSIZE base size of the fingerprints to be generated [2048].
--density DENSITY target bit density in the fingerprint. The fingerprint will be
folded until this density is reached [0.3].
--outTable OUTTABLE name of the output db table used to store fingerprints. If this
table already exists, it will be replaced.
--outDbName OUTDBNAME
name of output database, if it's being used. Defaults to be the
same as the input db.
--fpColName FPCOLNAME
name to use for the column which stores fingerprints (in pickled
format) in the output db table [AutoFragmentFP].
--minPath MINPATH minimum path length to be included in fragment-based
fingerprints [1].
--maxPath MAXPATH maximum path length to be included in fragment-based
fingerprints [7].
--nBitsPerHash NBITSPERHASH
number of bits to be set in the output fingerprint for each
fragment [2].
--discrim use of path-based discriminators to hash bits.
--smilesColumn SMILESCOLUMN
name of the SMILES column in the input database [#SMILES].
--molPkl MOLPKL
--input_format {SMILES,SD}
SMILES table or SDF file [{DEFAULTS['input_format']}].
--idColumn IDCOLUMN, --nameColumn IDCOLUMN
name of the id column in the input database. Defaults to the
first column for dbs [Name].
--maxMols MAXMOLS maximum number of molecules to be fingerprinted.
--fpAlgo {RDKIT,MACCS,MORGAN}
RDKIT = Daylight path-based; MACCS = MDL MACCS 166 keys [RDKIT]
--morgan_nbits MORGAN_NBITS
[1024]
--morgan_radius MORGAN_RADIUS
[2]
--replaceTable
--smilesTable SMILESTABLE
name of database table which contains SMILES for the input
fingerprints. If provided with --smilesName, output will contain
SMILES data.
--topN TOPN top N similar; precedence over threshold [12].
--thresh THRESH similarity threshold.
--querySmiles QUERYSMILES
query smiles for similarity screening.
--metric {ALLBIT,ASYMMETRIC,DICE,COSINE,KULCZYNSKI,MCCONNAUGHEY,ONBIT,RUSSEL,SOKAL,TANIMOTO,TVERSKY}
similarity algorithm [TANIMOTO]
--tversky_alpha TVERSKY_ALPHA
Tversky alpha parameter, weights query molecule features [0.8]
--tversky_beta TVERSKY_BETA
Tversky beta parameter, weights target molecule features [0.2]
--clusterAlgo {WARD,SLINK,CLINK,UPGMA,BUTINA}
clustering algorithm: WARD = Ward's minimum variance; SLINK =
single-linkage clustering algorithm; CLINK = complete-linkage
clustering algorithm; UPGMA = group-average clustering
algorithm; BUTINA = Butina JCICS 39 747-750 (1999) [WARD]
--actTable ACTTABLE name of table containing activity values (used to color points
in the cluster tree).
--actName ACTNAME name of column with activities in the activity table. The values
in this column should either be integers or convertible into
integers.
--reportFreq REPORTFREQ
[100]
--showVis show visualization if available.
-v, --verbose
This app employs custom, updated versions of RDKit FingerprintMols.py, MolSimilarity.py,
ClusterMols.py, with enhanced command-line functionality for molecular fingerprint-based
analytics.
Examples:
(rdktools) $ python3 -m rdktools.fp.App FingerprintMols --i drugcentral.smiles --smilesColumn "smiles" --idColumn "name" --fpAlgo MORGAN --morgan_nbits 2048 --output_as_tsv --o drugcentral_morganfp.tsv
(rdktools) $ python3 -m rdktools.fp.App MolSimilarity --i drugcentral.smiles --smilesColumn "smiles" --idColumn "name" --querySmiles "NCCc1ccc(O)c(O)c1 dopamine" --fpAlgo MORGAN --morgan_nbits 512 --metric TVERSKY --tversky_alpha 0.8 --tversky_beta 0.2
(rdktools) $ python3 -m rdktools.fp.App ClusterMols --i drugcentral.smiles --smilesColumn "smiles" --idColumn "name" --fpAlgo MORGAN --morgan_nbits 512 --clusterAlgo BUTINA --metric TANIMOTO
SMARTS
(rdktools) $ python3 -m rdktools.smarts.App -h
usage: App.py [-h]
{matchCounts,matchFilter,matchCountsMulti,matchFilterMulti,filterPAINS,demo}
...
RDKit SMARTS utility
positional arguments:
{matchCounts,matchFilter,matchCountsMulti,matchFilterMulti,filterPAINS,demo}
operation
matchCounts Count matches of a single SMARTS in each molecule
matchFilter Filter molecules that match a single SMARTS
matchCountsMulti Count matches of multiple SMARTS (from file) in each molecule
matchFilterMulti Filter molecules that match multiple SMARTS (from file)
filterPAINS Filter molecules that match PAINS
demo Demo of matchCounts
options:
-h, --help show this help message and exit
Additional information for a specific operation can be found by using the -h
flag after providing the operation. For example:
(rdktools) $ python3 -m rdktools.smarts.App matchCountsMulti -h
usage: App.py matchCountsMulti [-h] [--log_fname LOG_FNAME] [-v] --smartsfile SMARTSFILE
[--strict] [--usa] --i IFILE [--o OFILE] [--delim DELIM]
[--smiles_column SMILES_COLUMN] [--name_column NAME_COLUMN]
[--iheader] [--exclude_mol_props]
options:
-h, --help show this help message and exit
--log_fname LOG_FNAME
File to save logs to. If not given will log to stdout. (default:
None)
-v, --verbose verbosity of logging (default: 0)
--smartsfile SMARTSFILE
input SMARTS file (for multi-ops)
--strict raise error if any SMARTS cannot be parsed. If not set, will ignore
invalid SMARTS. (default: False)
--usa unique set-of-atoms match counts (default: False)
--i IFILE input file, SMI or SDF
--o OFILE output file, TSV. Will use stdout if not specified. (default: None)
--delim DELIM delimiter for SMILES/TSV (default: )
--smiles_column SMILES_COLUMN
(integer) column where SMILES are located (for SMI file) (default: 0)
--name_column NAME_COLUMN
(integer) column where molecule names are located (for SMI file)
(default: 1)
--iheader input SMILES/TSV has header line (default: False)
--exclude_mol_props exclude molecular properties present in input SMILES/SDF in output
(i.e., only include SMILES & Name properties) (default: False)
Reactions
$ python3 -m rdktools.reactions.App -h
usage: App.py [-h] [--i IFILES] [--o OFILE] [--output_mode {products,reactions}]
[--o_depict OFILE_DEPICT] [--smirks SMIRKS] [--kekulize] [--sanitize]
[--depict] [--header] [--delim DELIM] [--smilesColumn SMILESCOLUMN]
[--nameColumn NAMECOLUMN] [-v]
{enumerateLibrary,react,demo,demo2,demo3,demo4}
RDKit chemical reactions utility
positional arguments:
{enumerateLibrary,react,demo,demo2,demo3,demo4}
OPERATION
optional arguments:
-h, --help show this help message and exit
--i IFILES input file[s] (SMILES/TSV or SDF)
--o OFILE output file (SMILES) [stdout]
--output_mode {products,reactions}
products|reactions [products]
--o_depict OFILE_DEPICT
output depiction file (PNG) [display]
--smirks SMIRKS SMIRKS reaction transform
--kekulize Kekulize
--sanitize Sanitize
--depict Depict (1st reaction or product only)
--header input SMILES/TSV file has header line
--delim DELIM delimiter for SMILES/TSV
--smilesColumn SMILESCOLUMN
input SMILES column
--nameColumn NAMECOLUMN
input name column
-v, --verbose
For 'react' operation, reactants are specified as disconnected components of single
input molecule record. For 'enumerateLibrary', reactants for each role are specfied from
separate input files, ordered as in the SMIRKS.
python3 -m rdktools.reactions.App react --smirks '[O:2]=[C:1][OH].[N:3]>>[O:2][C:1][N:3]' --i reactants.smiles --nameColumn 0 --depict --o_depict reaction.png
Properties
(rdktools) $ python3 -m rdktools.properties.App -h
usage: App.py [-h] --i IFILE [--o OFILE] [--iheader] [--oheader] [--kekulize]
[--sanitize] [--delim DELIM] [--smilesColumn SMILESCOLUMN]
[--nameColumn NAMECOLUMN] [-v]
{descriptors,descriptors3d,lipinski,logp,estate,freesasa,demo}
RDKit molecular properties utility
positional arguments:
{descriptors,descriptors3d,lipinski,logp,estate,freesasa,demo}
OPERATION
optional arguments:
-h, --help show this help message and exit
--i IFILE input molecule file
--o OFILE output file with data (TSV)
--iheader input file has header line
--oheader include TSV header line with smiles output
--kekulize Kekulize
--sanitize Sanitize
--delim DELIM SMILES/TSV delimiter
--smilesColumn SMILESCOLUMN
input SMILES column
--nameColumn NAMECOLUMN
input name column
-v, --verbose
util.sklearn
Scikit-learn utilities for processing molecular fingerprints and other feature vectors.
(rdktools) lengua$ python3 -m rdktools.util.sklearn.ClusterFingerprints -h
usage: ClusterFingerprints.py [-h] [--i IFILE] [--o OFILE] [--o_vis OFILE_VIS]
[--scratchdir SCRATCHDIR] [--idelim IDELIM]
[--odelim ODELIM]
[--affinity {euclidean,l1,l2,manhattan,cosine,precomputed}]
[--linkage {ward,complete,average,single}]
[--truncate_level TRUNCATE_LEVEL] [--iheader] [--oheader]
[--dendrogram_orientation {left,top,right,bottom}]
[--display] [-v]
{cluster,demo}
Hierarchical, agglomerative clustering by Scikit-learn
positional arguments:
{cluster,demo} OPERATION
optional arguments:
-h, --help show this help message and exit
--i IFILE input file, TSV
--o OFILE output file, TSV
--o_vis OFILE_VIS output file, PNG or HTML
--scratchdir SCRATCHDIR
--idelim IDELIM delim for input TSV
--odelim ODELIM delim for output TSV
--affinity {euclidean,l1,l2,manhattan,cosine,precomputed}
--linkage {ward,complete,average,single}
--truncate_level TRUNCATE_LEVEL
Level from root of hierarchy for clusters and dendrogram.
--iheader input TSV has header
--oheader output TSV has header
--dendrogram_orientation {left,top,right,bottom}
--display display dendrogram
-v, --verbose
(rdktools) $ python3 -m rdktools.util.sklearn.ClusterFingerprints cluster --i drugcentral_morganfp.tsv --truncate_level 5 --o_vis drugcentral_morganfp_ward-clusters_dendrogram.png
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file rdktools-0.9.8.tar.gz
.
File metadata
- Download URL: rdktools-0.9.8.tar.gz
- Upload date:
- Size: 91.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a0ecca18a9a97dfe33e26d3e1bd25b686c37a3adf11cfbfc76c4e7a2f480541b |
|
MD5 | 6b2a733250a9fb5f3c95bef1f32b474f |
|
BLAKE2b-256 | 4bedade4093375dcf20f703ef3fb18a73a5000198f8abf2f27df1ddfd16ff5cb |
File details
Details for the file rdktools-0.9.8-py3-none-any.whl
.
File metadata
- Download URL: rdktools-0.9.8-py3-none-any.whl
- Upload date:
- Size: 107.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 361a6d30bc3c3aa4636b08a16daeb4e6429f293a6969e186e9cfd22100b3c4ad |
|
MD5 | 2970f8e5b3566fa1a09a2a436470780c |
|
BLAKE2b-256 | f0e7505efc291845b51be17fa99d663146a0ea1cba97ee8dd9a9d407ccda52d3 |