Skip to main content

A dynamic, high-performance cheminformatics framework integrating 6 distinct molecular embeddings into a robust unified latent representation.

Project description

Chemical Dice Integrator (CDI)

CDI (Chemical Dice Integrator) is a high-performance deep learning framework designed to unify heterogeneous chemical representations into a single, high information rich latent space. By fusing six complementary molecular embeddings, CDI produces a consolidated vector optimized for large-scale cheminformatics, bioinformatics, and AI-driven molecular discovery tasks.

Overview

CDI extends the Chemical Dice Integrator featurization ecosystem by performing unsupervised integration of six distinct molecular embeddings:

  • Quantum Descriptors
  • Bioactivity Signatures
  • Language Model Embeddings
  • Graph-Derived Representations
  • Physicochemical Profiles
  • 2D Molecular Image Features

Each compound’s six feature types are combined to create a single latent embedding that captures chemical, structural, and biological semantics. These embeddings can be directly used for tasks such as QSAR modeling, virtual screening, drug-target interaction prediction, and bioactivity clustering.

Installation

1. Prerequisites & System Requirements

2. Install Python Dependencies

Open terminal or jupyter notebook run the following command to install all required python packages.

pip install numpy pandas rdkit tqdm requests

3. Install the ChemicalDice Python Package

pip install -i https://test.pypi.org/simple/ ChemicalDice

Usage

Feature Extraction from a CSV File

The primary function, smiles_to_embeddings, processes a CSV file containing SMILES strings, validates and canonicalizes them, and streams the data to the ChemicalDice API to generate molecular embeddings.

Step 1: Prepare Your Input CSV

Your input file must meet the following requirements:

  • Column Name: The file must contain a column named exactly SMILES.
  • File Size: The input file size must not exceed 20 MB.

Example smiles.csv:

SMILES,Compound_ID
CCO,Ethanol
Cc1ccccc1,Toluene
C1CCCCC1,Cyclohexane

Step 2: Run the Feature Extraction

from ChemicalDice import smiles_to_embeddings

# Generate embeddings from CSV 
CDI_embeddings = smiles_to_embeddings.collect_features_from_csv(
    filepath="smiles.csv",
    convert_to_canonical=True
)

# CDI_embeddings is a pandas.DataFrame;
# Save to CSV
CDI_embeddings.to_csv("CDI_embeddings.csv", index=False)

Function Details: smiles_to_embeddings.collect_features_from_csv

  • Purpose: Processes a CSV file to generate molecular feature embeddings.
  • Input: Path to a CSV file with a SMILES column.
  • Process:
    1. Validation: Uses RDKit to validate each SMILES string. Invalid entries are flagged and skipped.
    2. Canonicalization(Optional): The original SMILES column in your input CSV is converted to canonical SMILES. In case you do not want canonicalization you can set convert_to_canonical argument to False.
    3. Feature Extraction: The CSV is streamed to the ChemicalDice API, which returns a data frame of molecular features.
  • Output: A dataframe where the first column contains the input SMILES, other columns correspond to the extracted features, and rows correspond to successfully processed molecules.
    This standardized output can be used directly for downstream tasks such as QSAR modeling, clustering, virtual screening, or integration into machine learning pipelines.

Troubleshooting & Notes

  • Backup Your Data: The input CSV file is modified in-place. Always work on a copy of your original data to prevent data loss.
  • Invalid SMILES: Molecules with invalid SMILES will be skipped during processing and will not appear in the output feature dataframe. Check the function's messages or your overwritten CSV for details on which entries were invalid in column is_valid.
  • Network Connection: A stable internet connection is required to communicate with the ChemicalDice API.

For technical issues, please ensure all prerequisites are met and your configuration is correct. For API-related problems, contact the ChemicalDice service administrators.


CDI Bot

Chemical Dice Integrator — Conversational Molecular Embedding Platform

CDI Bot is a fully containerised, LLM-powered web application that gives researchers and chemists a natural-language interface to the Chemical Dice Integrator (CDI).

[!TIP] Watch the CDI Bot in action: Watch the video


For all other detailed information, please visit our complete documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chemicaldice-1.0.1.tar.gz (218.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chemicaldice-1.0.1-py3-none-any.whl (260.2 kB view details)

Uploaded Python 3

File details

Details for the file chemicaldice-1.0.1.tar.gz.

File metadata

  • Download URL: chemicaldice-1.0.1.tar.gz
  • Upload date:
  • Size: 218.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for chemicaldice-1.0.1.tar.gz
Algorithm Hash digest
SHA256 bca1c3928a27b8ace142a9fd551b32efa3b628551501d27def4072552f20420c
MD5 7d7b987e9adf45d88788246b0d7b5638
BLAKE2b-256 e11b47049063f6f47aaceb1ddcbc0607e5a90f927001451df8c31dd9a0841b71

See more details on using hashes here.

File details

Details for the file chemicaldice-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: chemicaldice-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 260.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for chemicaldice-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c36997056163663f93e5ef5c84b25afbe49336626971b59227699557b4653aa3
MD5 2417bf3a4e4eab792b5a3ad8d7b4f038
BLAKE2b-256 cf0383965873c12640a7f216347c9948e9034c0aff7e889c22f7bf4bd41a8dfb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page