Learning a CoNCISE language for small-molecule binding
Project description
CoNCISE
Learning a CoNCISE language for small-molecule binding.
Rapid advances in deep learning have improved in silico methods for drug-target interac-tion (DTI) prediction. However, current methods do not scale to the massive catalogs that list millions or billions of commercially-available small molecules. Here, we introduce CoNCISE, a method that accelerates drug-target interaction (DTI) prediction by 2-3 orders of magnitude while maintaining high accuracy. CoNCISE uses a novel vector-quantized codebook approach and a residual-learning based training of hierarchical codes.Our DTI architecture, which combines these compact ligand representations with fixed-length protein embeddings in a cross-attention framework, achieves state-of-the-art prediction accuracy at unprecedented speed.
Table of Contents
Getting Started
Currently CoNCISE is in closed access and does not have a pip wheel. We suggest installing it in a clean python environment with python >= 3.12.
Installation
pip install concise
Install from source
pip install .
Usage
CoNCISE achieves extremely fast Drug-Target Interaction (DTI) prediction by discretizing the vast space of small molecules into $32^3 = 32768$ discrete, hierarchically organized codes representation, requiring only the SMILES molecular representation as input in the process. The CoNCISE DTI pipeline consists of the following three steps:
- Use CoNCISE ligand module to convert the ligand SMILES representation to discrete codes.
- Take any protein and find its binding affinity against all $32^3$ possible codes (figure 2 above).
- Use 2 and 3 to find protein-drug pairs with high binding affinity.
All these three steps can be done through the single easy_query option, which can be invoked using the following command
concise easy_query --config-path configs/easy_query.yaml
The arguments to the easy_query option is stored in a yaml file, whose contents are described below:
# 1. fasta_file: the path to the fasta file
fasta_file: data/SwissProtMini/seqs.fasta
# 2. rec_embed_file: the path to create/load the receptor embeddings
rec_embed_file: data/SwissProtMini/receptors.h5
# 3. ligand_file: the path to the ligand file
ligand_file: data/DrugsMini/smiles.csv
# 4. lig_embed_file: the path to create/load ligand embeddings
lig_embed_file: data/DrugsMini/ligands.h5
# 5. save location
save_path: data/QueryDrugsMini/scores.csv
# 6. Search Parameters
num_codes_per_protein: 10
num_smiles_per_code: 20
# 7. device: the device to run the model on
device:
_target_: torch.device
device: 0
## HYDRA configurations. IGNORE
protein_dataset:
rec_embed_file: ${..rec_embed_file}
fasta_file: ${..fasta_file}
protein_dataset:
_target_: concise.dataset.ScoreFastaDataset
fasta_file: ${..fasta_file}
rec_embed_file: ${..rec_embed_file}
max_entries: -1
protein_dataloader:
_target_: torch.utils.data.DataLoader
dataset: ${..protein_dataset}
batch_size: 1
num_workers: 2
shuffle: false
ligand_dataset:
lig_embed_file: ${..lig_embed_file}
ligand_dataset:
_target_: concise.dataset.LigandDataset
lig_file: ${ligand_file}
lig_embed_file: ${..lig_embed_file}
ligand_dataloader:
_target_: torch.utils.data.DataLoader
dataset: ${..ligand_dataset}
batch_size: 32
num_workers: 4
shuffle: false
Data Availability and Download
The relevent datasetes are provided in the data folder. Additionally, they could also be downloaded using the option:
concise download [DOWNLOAD_PATH]
Note: this is only reccomended if you don't intend on repeated queries as intermediate results are not saved.
Advanced Usage
The tree steps in the CoNCISE pipeline could also be separately performed using the addtional options included in the CoNCISE package. We describe them in order below.
-
Converting SMILES to Codes using
smiles_to_codessmiles_to_codesrequires a configuration file specifying the location of the SMILES csv file to produce the discretized code representation. Given a configuration file, configs/smiles_to_codes.yaml, we can invoke this API the following way:concise smiles_to_codes --config-path configs/smiles_to_codes.yamlExample smiles_to_codes configuration
## configs/smiles_to_codes.yaml ## Parameters to change # 1. Working directory prefix: data/DrugsMini # the folder where the input, output and the intermediate h5py files are stored, # 2. the ligand CSV file. Should be comma separated and contain at least the header `smiles` # corresponding to the SMILES molecular representation. ligand_file: ${prefix}/smiles.csv # 3. output CSV location save_path: ${prefix}/codes.csv # 4. option to save as a SQLITE file. save_as_sqlite: True device: _target_: torch.device device: 0 # SPECIFY the device ## HYDRA component. IGNORE dataset: lig_embed_file: ${..prefix}/ligands.h5 ligand_dataset: _target_: concise.dataset.LigandDataset lig_file: ${ligand_file} lig_embed_file: ${..lig_embed_file} ligand_dataloader: _target_: torch.utils.data.DataLoader dataset: ${..ligand_dataset} batch_size: 32 num_workers: 4 shuffle: false
-
Assigning proteins to code:
Since the ligand space has been discretized into a small set of possible hierarchical codes, it is now feasible to predict the binding affinity of a protein with all possible code combinations. This is accomplished using the
protein_code_assignmentoption.Given a configuration file configs/protein_to_codes.yaml specifying the protein and other auxiliary information, we can obtain the code binding probabilities of the protein using the following command:
concise protein_code_assignment --config-path configs/protein_to_codes.yamlExample protein_to_codes configuration
## configs/protein_to_codes.yaml ## Parameters to change # 1. prefix: the path where data is stored prefix: data/SwissProtMini # 2. fasta_file: the path to the FASTA file. Can accept more than one protein in the FASTA record fasta_file: ${prefix}/seqs.fasta # 3. save location save_path: ${prefix}/scores.csv device: _target_: torch.device device: cpu # specify the device. CUDA devices recommended for faster operation ## HYDRA component. IGNORE. dataset: rec_embed_file: ${..prefix}/receptors.h5 fasta_file: ${..fasta_file} protein_dataset: _target_: concise.dataset.ScoreFastaDataset fasta_file: ${..fasta_file} rec_embed_file: ${..rec_embed_file} max_entries: -1 protein_dataloader: _target_: torch.utils.data.DataLoader dataset: ${..protein_dataset} batch_size: 1 num_workers: 16 shuffle: false
-
Querying protein with smiles. After the binding codes have been identified for each protein (using the
protein_to_codesoption), we can now associate each protein with the highest binding SMILES corresponding to the top-binding codes. This step requires that steps 1 and 2 have already been successfully performed.(Caution the query example assumes steps 1. and 2. have already been performed)
Given the configuration file pointing to the protein-to-code and smiles-to-codes assignment files, we can query the most likely protein-ligand binding parters using the following command:
concise query --config-path configs/query.yamlExample query configuration
## configs/query.yaml # 1. the codes SQLITE file, that assigns discrete codes to ligands, obtained using the `smiles_to_codes` option codes_file: data/DrugsMini/codes.sqlite # 2. the protein-ligand binding file, obtained using the `proteins_to_codes` option protein_scores_file: data/SwissProtMini/scores.csv # 3. Save URL save_path: ${prefix}/assignments.csv ## Search Parameters # maximum number of codes assigned to protein num_codes_per_protein: 10 # maximum number of SMILES, that we randomly select from the particular CODE assignment num_smiles_per_code: 20
Note that all commands have accompanying example configurations in data/ with the same name as the command. They can be used as reference for custom configurations.
License
The copyrights of this software are owned by Tufts and Duke Universities. Two licenses for this software are offered:
-
An open-source license under the CC-BY-NC-SA 4.0 license for non-commercial academic use.
-
A custom license with the two universities, for commercial use or uses without the CC-BY-NC-SA 4.0 license restrictions.
As a recipient of this software, you may choose which license to receive the code under.
To enter a custom license agreement without the CC-BY-NC-SA 4.0 license restrictions, please contact the Digital Innovations department at the Duke Office for Translation & Commercialization (OTC) (https://otc.duke.edu/digital-innovations/#DI-team) at otcquestions@duke.edu.
Please note that this software is distributed AS IS, WITHOUT ANY WARRANTY; and without the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Contact
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file concise_dti-1.0.0.tar.gz.
File metadata
- Download URL: concise_dti-1.0.0.tar.gz
- Upload date:
- Size: 22.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
44e420d1017cd6c8ae5f996f24d691fbb7f71f33c84abcdf6d63c6c419e65f8e
|
|
| MD5 |
c8c35f4d808bbaa4872da6ff6837e536
|
|
| BLAKE2b-256 |
317062ba26744145f3ce01e72a825ad7f6cd2b8edd4065705512fa560894a5a4
|
File details
Details for the file concise_dti-1.0.0-py2.py3-none-any.whl.
File metadata
- Download URL: concise_dti-1.0.0-py2.py3-none-any.whl
- Upload date:
- Size: 23.7 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cc26369b59bae9c2a2c485db0d9af3e35e252550508e75096ff7f3e55d367875
|
|
| MD5 |
e91ab6e53dc9a1734a61cc2a27cc7818
|
|
| BLAKE2b-256 |
524ad0a50b1f98c17e75ca811ebf68a79256d500d1b3858f5cc2ea4192d887b1
|