Skip to main content

Learning a CoNCISE language for small-molecule binding

Project description

Stargazers Issues MIT License


Logo

CoNCISE

Learning a CoNCISE language for small-molecule binding.

Rapid advances in deep learning have improved in silico methods for drug-target interac-tion (DTI) prediction. However, current methods do not scale to the massive catalogs that list millions or billions of commercially-available small molecules. Here, we introduce CoNCISE, a method that accelerates drug-target interaction (DTI) prediction by 2-3 orders of magnitude while maintaining high accuracy. CoNCISE uses a novel vector-quantized codebook approach and a residual-learning based training of hierarchical codes.Our DTI architecture, which combines these compact ligand representations with fixed-length protein embeddings in a cross-attention framework, achieves state-of-the-art prediction accuracy at unprecedented speed.

method

Table of Contents
  1. Getting Started
  2. Usage
  3. License
  4. Contact

Getting Started

Currently CoNCISE is in closed access and does not have a pip wheel. We suggest installing it in a clean python environment with python >= 3.12.

Installation

pip install concise

Install from source

pip install .

(back to top)

Usage

CoNCISE achieves extremely fast Drug-Target Interaction (DTI) prediction by discretizing the vast space of small molecules into $32^3 = 32768$ discrete, hierarchically organized codes representation, requiring only the SMILES molecular representation as input in the process. The CoNCISE DTI pipeline consists of the following three steps:

  1. Use CoNCISE ligand module to convert the ligand SMILES representation to discrete codes.
  2. Take any protein and find its binding affinity against all $32^3$ possible codes (figure 2 above).
  3. Use 2 and 3 to find protein-drug pairs with high binding affinity.

All these three steps can be done through the single easy_query option, which can be invoked using the following command

concise easy_query --config-path configs/easy_query.yaml

The arguments to the easy_query option is stored in a yaml file, whose contents are described below:

# 1. fasta_file: the path to the fasta file
fasta_file: data/SwissProtMini/seqs.fasta
# 2. rec_embed_file: the path to create/load the receptor embeddings
rec_embed_file: data/SwissProtMini/receptors.h5
# 3. ligand_file: the path to the ligand file
ligand_file: data/DrugsMini/smiles.csv
# 4. lig_embed_file: the path to create/load ligand embeddings
lig_embed_file: data/DrugsMini/ligands.h5
# 5. save location 
save_path: data/QueryDrugsMini/scores.csv
# 6. Search Parameters
num_codes_per_protein: 10
num_smiles_per_code: 20
# 7. device: the device to run the model on
device:
  _target_: torch.device
  device: 0


## HYDRA configurations. IGNORE
protein_dataset:
  rec_embed_file: ${..rec_embed_file}
  fasta_file: ${..fasta_file}
  protein_dataset:
    _target_: concise.dataset.ScoreFastaDataset
    fasta_file: ${..fasta_file}
    rec_embed_file: ${..rec_embed_file}
    max_entries: -1
  protein_dataloader:
    _target_: torch.utils.data.DataLoader
    dataset: ${..protein_dataset}
    batch_size: 1
    num_workers: 2
    shuffle: false

ligand_dataset:
  lig_embed_file: ${..lig_embed_file}
  ligand_dataset:
    _target_: concise.dataset.LigandDataset
    lig_file: ${ligand_file}
    lig_embed_file: ${..lig_embed_file}
  ligand_dataloader:
    _target_: torch.utils.data.DataLoader
    dataset: ${..ligand_dataset}
    batch_size: 32
    num_workers: 4
    shuffle: false

Data Availability and Download

The relevent datasetes are provided in the data folder. Additionally, they could also be downloaded using the option:

concise download [DOWNLOAD_PATH]

Note: this is only reccomended if you don't intend on repeated queries as intermediate results are not saved.


Advanced Usage

The tree steps in the CoNCISE pipeline could also be separately performed using the addtional options included in the CoNCISE package. We describe them in order below.

  1. Converting SMILES to Codes using smiles_to_codes

    smiles_to_codes requires a configuration file specifying the location of the SMILES csv file to produce the discretized code representation. Given a configuration file, configs/smiles_to_codes.yaml, we can invoke this API the following way:

    concise smiles_to_codes --config-path configs/smiles_to_codes.yaml
    
    Example smiles_to_codes configuration
    ## configs/smiles_to_codes.yaml
    ## Parameters to change
    
    # 1. Working directory
    prefix: data/DrugsMini             # the folder where the input, output and the intermediate h5py files are stored,
    
    # 2. the ligand CSV file. Should be comma separated and contain at least the header `smiles`
    # corresponding to the SMILES molecular representation. 
    ligand_file: ${prefix}/smiles.csv 
    
    # 3. output CSV location
    save_path: ${prefix}/codes.csv
    
    # 4. option to save as a SQLITE file.
    save_as_sqlite: True
    
    device:
      _target_: torch.device
      device: 0                        # SPECIFY the device
    
    ## HYDRA component. IGNORE
    dataset:
      lig_embed_file: ${..prefix}/ligands.h5
      ligand_dataset:
        _target_: concise.dataset.LigandDataset
        lig_file: ${ligand_file}
        lig_embed_file: ${..lig_embed_file}
      ligand_dataloader:
        _target_: torch.utils.data.DataLoader
        dataset: ${..ligand_dataset}
        batch_size: 32
        num_workers: 4
        shuffle: false
    
  2. Assigning proteins to code:

    Since the ligand space has been discretized into a small set of possible hierarchical codes, it is now feasible to predict the binding affinity of a protein with all possible code combinations. This is accomplished using the protein_code_assignment option.

    Given a configuration file configs/protein_to_codes.yaml specifying the protein and other auxiliary information, we can obtain the code binding probabilities of the protein using the following command:

    concise protein_code_assignment --config-path configs/protein_to_codes.yaml
    
    Example protein_to_codes configuration
    ## configs/protein_to_codes.yaml
     
    ## Parameters to change
    
    # 1. prefix: the path where data is stored
    prefix: data/SwissProtMini
     
    # 2. fasta_file: the path to the FASTA file. Can accept more than one protein in the FASTA record
    fasta_file: ${prefix}/seqs.fasta
    
    # 3. save location 
    save_path: ${prefix}/scores.csv
     
    device:
      _target_: torch.device
      device: cpu                 # specify the device. CUDA devices recommended for faster operation
     
     
    ## HYDRA component. IGNORE.
    dataset:
      rec_embed_file: ${..prefix}/receptors.h5
      fasta_file: ${..fasta_file}
      protein_dataset:
        _target_: concise.dataset.ScoreFastaDataset
        fasta_file: ${..fasta_file}
        rec_embed_file: ${..rec_embed_file}
        max_entries: -1
      protein_dataloader:
        _target_: torch.utils.data.DataLoader
        dataset: ${..protein_dataset}
        batch_size: 1
        num_workers: 16
        shuffle: false
    
  3. Querying protein with smiles. After the binding codes have been identified for each protein (using the protein_to_codes option), we can now associate each protein with the highest binding SMILES corresponding to the top-binding codes. This step requires that steps 1 and 2 have already been successfully performed.

    (Caution the query example assumes steps 1. and 2. have already been performed)

    Given the configuration file pointing to the protein-to-code and smiles-to-codes assignment files, we can query the most likely protein-ligand binding parters using the following command:

    concise query --config-path configs/query.yaml
    
    Example query configuration
    ## configs/query.yaml
    
    # 1. the codes SQLITE file, that assigns discrete codes to ligands, obtained using the `smiles_to_codes` option
    codes_file: data/DrugsMini/codes.sqlite
    # 2. the protein-ligand binding file, obtained using the `proteins_to_codes` option
    protein_scores_file: data/SwissProtMini/scores.csv
    # 3. Save URL
    save_path: ${prefix}/assignments.csv
      
    ## Search Parameters
    # maximum number of codes assigned to protein 
    num_codes_per_protein: 10
    # maximum number of SMILES, that we randomly select from the particular CODE assignment
    num_smiles_per_code: 20
    

Note that all commands have accompanying example configurations in data/ with the same name as the command. They can be used as reference for custom configurations.

(back to top)


License

The copyrights of this software are owned by Tufts and Duke Universities. Two licenses for this software are offered:

  1. An open-source license under the CC-BY-NC-SA 4.0 license for non-commercial academic use.

  2. A custom license with the two universities, for commercial use or uses without the CC-BY-NC-SA 4.0 license restrictions. 

As a recipient of this software, you may choose which license to receive the code under.

To enter a custom license agreement without the CC-BY-NC-SA 4.0 license restrictions, please contact the Digital Innovations department at the Duke Office for Translation & Commercialization (OTC) (https://otc.duke.edu/digital-innovations/#DI-team) at otcquestions@duke.edu.

Please note that this software is distributed AS IS, WITHOUT ANY WARRANTY; and without the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

(back to top)

Contact

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

concise_dti-1.0.0.tar.gz (22.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

concise_dti-1.0.0-py2.py3-none-any.whl (23.7 kB view details)

Uploaded Python 2Python 3

File details

Details for the file concise_dti-1.0.0.tar.gz.

File metadata

  • Download URL: concise_dti-1.0.0.tar.gz
  • Upload date:
  • Size: 22.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for concise_dti-1.0.0.tar.gz
Algorithm Hash digest
SHA256 44e420d1017cd6c8ae5f996f24d691fbb7f71f33c84abcdf6d63c6c419e65f8e
MD5 c8c35f4d808bbaa4872da6ff6837e536
BLAKE2b-256 317062ba26744145f3ce01e72a825ad7f6cd2b8edd4065705512fa560894a5a4

See more details on using hashes here.

File details

Details for the file concise_dti-1.0.0-py2.py3-none-any.whl.

File metadata

  • Download URL: concise_dti-1.0.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 23.7 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for concise_dti-1.0.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 cc26369b59bae9c2a2c485db0d9af3e35e252550508e75096ff7f3e55d367875
MD5 e91ab6e53dc9a1734a61cc2a27cc7818
BLAKE2b-256 524ad0a50b1f98c17e75ca811ebf68a79256d500d1b3858f5cc2ea4192d887b1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page