concise-dti

Learning a CoNCISE language for small-molecule binding

These details have not been verified by PyPI

Project links

Homepage

Project description

CoNCISE

Learning a CoNCISE language for small-molecule binding.

Rapid advances in deep learning have improved in silico methods for drug-target interac-tion (DTI) prediction. However, current methods do not scale to the massive catalogs that list millions or billions of commercially-available small molecules. Here, we introduce CoNCISE, a method that accelerates drug-target interaction (DTI) prediction by 2-3 orders of magnitude while maintaining high accuracy. CoNCISE uses a novel vector-quantized codebook approach and a residual-learning based training of hierarchical codes.Our DTI architecture, which combines these compact ligand representations with fixed-length protein embeddings in a cross-attention framework, achieves state-of-the-art prediction accuracy at unprecedented speed.

method

Table of Contents

Getting Started
- Installation
Usage
License
Contact

Getting Started

Currently CoNCISE is in closed access and does not have a pip wheel. We suggest installing it in a clean python environment with python >= 3.12.

Installation

pip install concise

Install from source

pip install .

(back to top)

Usage

CoNCISE achieves extremely fast Drug-Target Interaction (DTI) prediction by discretizing the vast space of small molecules into $32^3 = 32768$ discrete, hierarchically organized codes representation, requiring only the SMILES molecular representation as input in the process. The CoNCISE DTI pipeline consists of the following three steps:

Use CoNCISE ligand module to convert the ligand SMILES representation to discrete codes.
Take any protein and find its binding affinity against all $32^3$ possible codes (figure 2 above).
Use 2 and 3 to find protein-drug pairs with high binding affinity.

All these three steps can be done through the single easy_query option, which can be invoked using the following command

concise easy_query --config-path configs/easy_query.yaml

The arguments to the easy_query option is stored in a yaml file, whose contents are described below:

# 1. fasta_file: the path to the fasta file
fasta_file: data/SwissProtMini/seqs.fasta
# 2. rec_embed_file: the path to create/load the receptor embeddings
rec_embed_file: data/SwissProtMini/receptors.h5
# 3. ligand_file: the path to the ligand file
ligand_file: data/DrugsMini/smiles.csv
# 4. lig_embed_file: the path to create/load ligand embeddings
lig_embed_file: data/DrugsMini/ligands.h5
# 5. save location 
save_path: data/QueryDrugsMini/scores.csv
# 6. Search Parameters
num_codes_per_protein: 10
num_smiles_per_code: 20
# 7. device: the device to run the model on
device:
  _target_: torch.device
  device: 0


## HYDRA configurations. IGNORE
protein_dataset:
  rec_embed_file: ${..rec_embed_file}
  fasta_file: ${..fasta_file}
  protein_dataset:
    _target_: concise.dataset.ScoreFastaDataset
    fasta_file: ${..fasta_file}
    rec_embed_file: ${..rec_embed_file}
    max_entries: -1
  protein_dataloader:
    _target_: torch.utils.data.DataLoader
    dataset: ${..protein_dataset}
    batch_size: 1
    num_workers: 2
    shuffle: false

ligand_dataset:
  lig_embed_file: ${..lig_embed_file}
  ligand_dataset:
    _target_: concise.dataset.LigandDataset
    lig_file: ${ligand_file}
    lig_embed_file: ${..lig_embed_file}
  ligand_dataloader:
    _target_: torch.utils.data.DataLoader
    dataset: ${..ligand_dataset}
    batch_size: 32
    num_workers: 4
    shuffle: false

Data Availability and Download

The relevent datasetes are provided in the data folder. Additionally, they could also be downloaded using the option:

concise download [DOWNLOAD_PATH]

Note: this is only reccomended if you don't intend on repeated queries as intermediate results are not saved.

Advanced Usage

The tree steps in the CoNCISE pipeline could also be separately performed using the addtional options included in the CoNCISE package. We describe them in order below.

Converting SMILES to Codes using smiles_to_codes

smiles_to_codes requires a configuration file specifying the location of the SMILES csv file to produce the discretized code representation. Given a configuration file, configs/smiles_to_codes.yaml, we can invoke this API the following way:

concise smiles_to_codes --config-path configs/smiles_to_codes.yaml

Example smiles_to_codes configuration

## configs/smiles_to_codes.yaml
## Parameters to change

# 1. Working directory
prefix: data/DrugsMini             # the folder where the input, output and the intermediate h5py files are stored,

# 2. the ligand CSV file. Should be comma separated and contain at least the header `smiles`
# corresponding to the SMILES molecular representation. 
ligand_file: ${prefix}/smiles.csv 

# 3. output CSV location
save_path: ${prefix}/codes.csv

# 4. option to save as a SQLITE file.
save_as_sqlite: True

device:
  _target_: torch.device
  device: 0                        # SPECIFY the device

## HYDRA component. IGNORE
dataset:
  lig_embed_file: ${..prefix}/ligands.h5
  ligand_dataset:
    _target_: concise.dataset.LigandDataset
    lig_file: ${ligand_file}
    lig_embed_file: ${..lig_embed_file}
  ligand_dataloader:
    _target_: torch.utils.data.DataLoader
    dataset: ${..ligand_dataset}
    batch_size: 32
    num_workers: 4
    shuffle: false

Assigning proteins to code:

Since the ligand space has been discretized into a small set of possible hierarchical codes, it is now feasible to predict the binding affinity of a protein with all possible code combinations. This is accomplished using the protein_code_assignment option.

Given a configuration file configs/protein_to_codes.yaml specifying the protein and other auxiliary information, we can obtain the code binding probabilities of the protein using the following command:

concise protein_code_assignment --config-path configs/protein_to_codes.yaml

Example protein_to_codes configuration

## configs/protein_to_codes.yaml
 
## Parameters to change

# 1. prefix: the path where data is stored
prefix: data/SwissProtMini
 
# 2. fasta_file: the path to the FASTA file. Can accept more than one protein in the FASTA record
fasta_file: ${prefix}/seqs.fasta

# 3. save location 
save_path: ${prefix}/scores.csv
 
device:
  _target_: torch.device
  device: cpu                 # specify the device. CUDA devices recommended for faster operation
 
 
## HYDRA component. IGNORE.
dataset:
  rec_embed_file: ${..prefix}/receptors.h5
  fasta_file: ${..fasta_file}
  protein_dataset:
    _target_: concise.dataset.ScoreFastaDataset
    fasta_file: ${..fasta_file}
    rec_embed_file: ${..rec_embed_file}
    max_entries: -1
  protein_dataloader:
    _target_: torch.utils.data.DataLoader
    dataset: ${..protein_dataset}
    batch_size: 1
    num_workers: 16
    shuffle: false

Querying protein with smiles. After the binding codes have been identified for each protein (using the protein_to_codes option), we can now associate each protein with the highest binding SMILES corresponding to the top-binding codes. This step requires that steps 1 and 2 have already been successfully performed.

(Caution the query example assumes steps 1. and 2. have already been performed)

Given the configuration file pointing to the protein-to-code and smiles-to-codes assignment files, we can query the most likely protein-ligand binding parters using the following command:

concise query --config-path configs/query.yaml

Example query configuration

## configs/query.yaml

# 1. the codes SQLITE file, that assigns discrete codes to ligands, obtained using the `smiles_to_codes` option
codes_file: data/DrugsMini/codes.sqlite
# 2. the protein-ligand binding file, obtained using the `proteins_to_codes` option
protein_scores_file: data/SwissProtMini/scores.csv
# 3. Save URL
save_path: ${prefix}/assignments.csv
  
## Search Parameters
# maximum number of codes assigned to protein 
num_codes_per_protein: 10
# maximum number of SMILES, that we randomly select from the particular CODE assignment
num_smiles_per_code: 20

Note that all commands have accompanying example configurations in data/ with the same name as the command. They can be used as reference for custom configurations.

(back to top)

License

The copyrights of this software are owned by Tufts and Duke Universities. Two licenses for this software are offered:

An open-source license under the CC-BY-NC-SA 4.0 license for non-commercial academic use.
A custom license with the two universities, for commercial use or uses without the CC-BY-NC-SA 4.0 license restrictions.

As a recipient of this software, you may choose which license to receive the code under.

To enter a custom license agreement without the CC-BY-NC-SA 4.0 license restrictions, please contact the Digital Innovations department at the Duke Office for Translation & Commercialization (OTC) (https://otc.duke.edu/digital-innovations/#DI-team) at otcquestions@duke.edu.

Please note that this software is distributed AS IS, WITHOUT ANY WARRANTY; and without the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

(back to top)

Contact

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.0.0

Jan 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

concise_dti-1.0.0.tar.gz (22.7 kB view details)

Uploaded Jan 13, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

concise_dti-1.0.0-py2.py3-none-any.whl (23.7 kB view details)

Uploaded Jan 13, 2025 Python 2Python 3

File details

Details for the file concise_dti-1.0.0.tar.gz.

File metadata

Download URL: concise_dti-1.0.0.tar.gz
Upload date: Jan 13, 2025
Size: 22.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for concise_dti-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`44e420d1017cd6c8ae5f996f24d691fbb7f71f33c84abcdf6d63c6c419e65f8e`
MD5	`c8c35f4d808bbaa4872da6ff6837e536`
BLAKE2b-256	`317062ba26744145f3ce01e72a825ad7f6cd2b8edd4065705512fa560894a5a4`

See more details on using hashes here.

File details

Details for the file concise_dti-1.0.0-py2.py3-none-any.whl.

File metadata

Download URL: concise_dti-1.0.0-py2.py3-none-any.whl
Upload date: Jan 13, 2025
Size: 23.7 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for concise_dti-1.0.0-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`cc26369b59bae9c2a2c485db0d9af3e35e252550508e75096ff7f3e55d367875`
MD5	`e91ab6e53dc9a1734a61cc2a27cc7818`
BLAKE2b-256	`524ad0a50b1f98c17e75ca811ebf68a79256d500d1b3858f5cc2ea4192d887b1`

See more details on using hashes here.

concise-dti 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CoNCISE

Getting Started

Installation

Usage

Advanced Usage

License

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes