A collection of tools for daily cheminformatics tasks.
Project description
cheminftools
Installation
From GitHub repo:
pip install git+https://github.com/marcossantanaioc/cheminftools.git
From PyPi:
pip install cheminftools
How to use
Chemtools offer a collection of cheminformatics scripts for daily tasks. Currently supported tasks include:
1 - Standardization of chemical structures
2 - Calculation of molecular descriptors
3 - Filtering datasets using predefined alerts (e.g. PAINS, Dundee, Glaxo, etc.)
Standardization
A dataset of molecules can be standardize in just 1 line of code!
import pandas as pd
import numpy as np
from chemtools.tools.sanitizer import MolCleaner
from chemtools.tools.featurizer import MolFeaturizer
from chemtools.tools.filtering import MolFiltering
from rdkit import Chem
import json
data = pd.read_csv('../data/example_data.csv')
Sanitizing
The
MolCleaner
class performs sanitization tasks, including:
1. Standardize unknown stereochemistry (Handled by the RDKit Mol file parser)
i) Fix wiggly bonds on sp3 carbons - sets atoms and bonds marked as unknown stereo to no stereo
ii) Fix wiggly bonds on double bonds – set double bond to crossed bond
2. Clears S Group data from the mol file
3. Kekulize the structure
4. Remove H atoms (See the page on explicit Hs for more details)
5. Normalization:
Fix hypervalent nitro groups
Fix KO to K+ O- and NaO to Na+ O- (Also add Li+ to this)
Correct amides with N=COH
Standardise sulphoxides to charge separated form
Standardize diazonium N (atom :2 here: [*:1]-[N;X2:2]#[N;X1:3]>>[*:1]) to N+
Ensure quaternary N is charged
Ensure trivalent O ([*:1]=[O;X2;v3;+0:2]-[#6:3]) is charged
Ensure trivalent S ([O:1]=[S;D2;+0:2]-[#6:3]) is charged
Ensure halogen with no neighbors ([F,Cl,Br,I;X0;+0:1]) is charged
6. The molecule is neutralized, if possible. See the page on neutralization rules for more details.
7. Remove stereo from tartrate to simplify salt matching
8. Normalise (straighten) triple bonds and allenes
The curation steps in ChEMBL structure pipeline were augmented with additional steps to identify duplicated entries
9. Find stereo centers
10. Generate inchi keys
11. Find duplicated SMILES. If the same SMILES is present multiple times, two outcomes are possible.
i. The same compound (e.g. same ID and same SMILES)
ii. Isomers with different SMILES, IDs and/or activities
In case i), the compounds are merged by taking the median values of all numeric columns in the dataframe.
For case ii), the compounds are further classified as 'to merge' or 'to keep' depending on the activity values.
a) Compounds are considered for mergining (to merge) if the difference in acvitities is less than 1log unit.
b) Compounds are considered for keeping as individual entries (to keep) if the difference in activities is larger than 1log unit. In this case, the user can
select which compound to keep - the one with highest or lowest activity.
processed_data = MolCleaner.from_df(data, smiles_col='smiles', act_col='pIC50', id_col='molecule_chembl_id')
+-------------------------------------------------------------+-------------------------------------------------------------+
| processed_smiles | smiles |
+=============================================================+=============================================================+
| N#Cc1cnc(Nc2cccc(Br)c2)c2cc(NC(=O)c3ccco3)ccc12 | N#Cc1cnc(Nc2cccc(Br)c2)c2cc(NC(=O)c3ccco3)ccc12 |
+-------------------------------------------------------------+-------------------------------------------------------------+
| COc1cccc(-c2cn(-c3ccc(CNCCO)cc3)c3ncnc(N)c23)c1 | COc1cccc(-c2cn(-c3ccc(CNCCO)cc3)c3ncnc(N)c23)c1 |
+-------------------------------------------------------------+-------------------------------------------------------------+
| Cc1ncc([N+](=O)[O-])n1C/C(=N/NC(=O)c1ccc(O)cc1)c1ccc(Br)cc1 | Cc1ncc([N+](=O)[O-])n1C/C(=N/NC(=O)c1ccc(O)cc1)c1ccc(Br)cc1 |
+-------------------------------------------------------------+-------------------------------------------------------------+
| C1CCC(C(CC2CCCCN2)C2CCCCC2)CC1 | C1CCC(C(CC2CCCCN2)C2CCCCC2)CC1 |
+-------------------------------------------------------------+-------------------------------------------------------------+
| Cc1cc2cc(Nc3ccnc4cc(-c5ccc(CNCCN6CCNCC6)cc5)sc34)ccc2[nH]1 | Cc1cc2cc(Nc3ccnc4cc(-c5ccc(CNCCN6CCNCC6)cc5)sc34)ccc2[nH]1 |
+-------------------------------------------------------------+-------------------------------------------------------------+
Filtering
The
MolFiltering
class is responsible for removing compounds that match defined
substructural alerts, including PAINS and rules defined by different
organizations, such as GSK and University of Dundee.
with open('../data/libraries/Glaxo_alerts.json') as f:
alerts_dict = json.load(f)['structural_alerts']
structural_alerts = alerts_dict.get('structural_alerts', None)
alerts_data = MolFiltering.from_df(processed_data, smiles_col='processed_smiles', alerts_dict=alerts_dict)
+----------------------------------------------------------------------------+-----------------------+---------------------------+------------------+------------------+
| _smiles | Alert_SMARTS | Alert_description | Alert_rule_set | Alert_num_hits |
+============================================================================+=======================+===========================+==================+==================+
| Cc1ncc([N+](=O)[O-])n1C/C(=N/NC(=O)c1ccc(O)cc1)c1ccc(Br)cc1 | [N;R0][N;R0]C(=O) | R17 acylhydrazide | Glaxo | 1 |
+----------------------------------------------------------------------------+-----------------------+---------------------------+------------------+------------------+
| O=NN(CCCl)C(=O)Nc1ccc2ncnc(Nc3cccc(Cl)c3)c2c1 | [Br,Cl,I][CX4;CH,CH2] | R1 Reactive alkyl halides | Glaxo | 1 |
+----------------------------------------------------------------------------+-----------------------+---------------------------+------------------+------------------+
| O=NN(CCCl)C(=O)Nc1ccc2ncnc(Nc3cccc(Cl)c3)c2c1 | [N;R0][N;R0]C(=O) | R17 acylhydrazide | Glaxo | 1 |
+----------------------------------------------------------------------------+-----------------------+---------------------------+------------------+------------------+
| O=NN(CCCl)C(=O)Nc1ccc2ncnc(Nc3cccc(Cl)c3)c2c1 | [N&D2](=O) | R21 Nitroso | Glaxo | 1 |
+----------------------------------------------------------------------------+-----------------------+---------------------------+------------------+------------------+
| CS(=O)(=O)O[C@H]1CN[C@H](C#Cc2cc3ncnc(Nc4ccc(OCc5cccc(F)c5)c(Cl)c4)c3s2)C1 | COS(=O)(=O)[C,c] | R5 Sulphonates | Glaxo | 1 |
+----------------------------------------------------------------------------+-----------------------+---------------------------+------------------+------------------+
Quinone
mol = Chem.MolFromSmiles('COC1/C=C\OC2(C)Oc3c(C)c(O)c4c(c3C2=O)C(=O)C=C(NC(=O)/C(C)=C\C=C/C(C)C(O)C(C)C(O)C(C)C(OC(C)=O)C1C)C4=O')
mol.GetSubstructMatches(Chem.MolFromSmarts('O=C1[#6]~[#6]C(=O)[#6]~[#6]1'))
mol
Cynamide
mol1 = Chem.MolFromSmiles('Cc1cccc(C[C@H](NC(=O)c2cc(C(C)(C)C)nn2C)C(=O)NCC#N)c1')
mol1.GetSubstructMatches(Chem.MolFromSmarts('N[CH2]C#N'))
mol1
R18 Quaternary C, Cl, I, P or S
mol = Chem.MolFromSmiles('CC[C@H](NC(=O)c1c([S+](C)[O-])c(-c2ccccc2)nc2ccccc12)c1ccccc1')
mol.GetSubstructMatches(Chem.MolFromSmarts('[C+,Cl+,I+,P+,S+]'))
mol
Featurization
The
MolFeaturizer
class converts SMILES into molecular descriptors. The current version
supports Morgan fingerprints, Atom Pairs, Torsion Fingerprints, RDKit
fingerprints and 200 constitutional descriptors, and MACCS keys.
fingerprinter = MolFeaturizer('rdkit2d')
X = fingerprinter.process_smiles_list(processed_data['processed_smiles'].values)
X[0:5,0:5]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for cheminftools-0.1.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9b5bcb038acc7096bde63ff3918506142e229fa18f6534f5dea23b46d032eac1 |
|
MD5 | 8fa343e1c2735222f2772177d02d5031 |
|
BLAKE2b-256 | 5e90a7a684726dfbdab168f15fb27543192dddacf7d3ae736e894e2fb63cbd3f |