Skip to main content

A tool for the evaluation of molecules smiles

Project description

icon

Overview

mol_eval is a tool for evaluating SMILES data, particularly for distinguishing between real and fake SMILES sequences. It uses configurable thresholds and molecular descriptors to assess similarity and other properties such as solubility.

Coverage PyPI Latest Release Unit Tests Powered by TaccLab License


Features

  • Real vs Fake SMILES Evaluation: Compare real and synthetic SMILES sequences based on various similarity thresholds.
  • Similarity Metrics: Uses Levenshtein distance, Tanimoto coefficient, and molecular descriptors for comparison.
  • Configurable Analysis: Easily tweak similarity thresholds, solubility labels, and molecular descriptors through a configuration file.
  • Reports: Generate detailed evaluation reports based on the results.

Installation

To install mol_eval, you can use pip:

pip install mol_eval

Configuration

Before running the tool, you'll need to prepare your dataset and configuration file.

Step 1: Prepare Your Dataset Files

real_data.csv: This file should contain two columns:
    cmpd_name: The name of the compound.
    smile: The SMILES string representing the molecule.
fake_data.csv: This file should contain one column:
    smile: The SMILES string of synthetic molecules.

Step 2: Configuration File (config.json)

The configuration file allows you to set various thresholds and other parameters used in the evaluation. Here's an example configuration file:

{
    "LEVENSHTEIN_THRESHOLD": 0.5,
    "VERY_HIGH_SIMILARITY_THRESHOLD": 0.9,
    "HIGH_SIMILARITY_THRESHOLD": 0.88,
    "LOW_SIMILARITY_THRESHOLD": 0.3,
    "SOLUBILITY_THRESHOLDS": {
        "VERY_HIGH": -1,
        "HIGH": 0,
        "MODERATE": 2,
        "LOW": 4,
        "VERY_LOW": "Infinity"
    },
    "RELEVANT_DESCRIPTORS": [
        "fr_Al_COO", "fr_NH1", "fr_ketone", "fr_halogen",
        "MaxEStateIndex", "MinEStateIndex", "MinPartialCharge", "MaxPartialCharge",
        "fr_COO", "fr_Ar_N", "fr_Ar_OH",
        "MolWt", "ExactMolWt", "HeavyAtomCount", "NumRotatableBonds",
        "FractionCSP3", "LabuteASA", "RingCount",
        "MolLogP", "TPSA",
        "SlogP_VSA1", "SlogP_VSA2", "SlogP_VSA3", "SlogP_VSA4",
        "SlogP_VSA5", "SlogP_VSA6", "SlogP_VSA7", "SlogP_VSA8", "SlogP_VSA9", "SlogP_VSA10",
        "PEOE_VSA1", "PEOE_VSA2", "PEOE_VSA3", "PEOE_VSA4", "PEOE_VSA5", "PEOE_VSA6",
        "PEOE_VSA7", "PEOE_VSA8", "PEOE_VSA9", "PEOE_VSA10", "PEOE_VSA11", "PEOE_VSA12",
        "PEOE_VSA13", "PEOE_VSA14",
        "NumAromaticRings", "NumSaturatedRings", "fr_benzene", "fr_bicyclic",
        "Chi0", "Chi0n", "Chi0v", "Chi1", "Chi1n", "Chi1v",
        "Chi2n", "Chi2v", "Chi3n", "Chi3v", "Chi4n", "Chi4v", "HallKierAlpha"
    ],
    "TANIMOTO_THRESHOLDS": {
        "VERY_HIGH": 0.9,
        "HIGH": 0.88,
        "MODERATE": 0.3
    },
    "VALID_SOLUBILITY_LABELS": ["VERY_HIGH", "HIGH", "MODERATE"],
    "VALID_TANIMOTO_LABELS": ["HIGH", "MODERATE", "LOW"],
    "MAX_SUBSTRUCTURES_MATCHES": 0,
    "REPORT_FOLDER": "./report"
}

Key Parameters Explained:

  • Thresholds: Customize similarity and solubility thresholds for better evaluation.
  • Descriptors: Choose molecular descriptors for evaluation, such as molecular weight (MolWt), logP (MolLogP), and polar surface area (TPSA).
  • Tanimoto and Levenshtein: Fine-tune the thresholds for calculating molecular similarity.
  • Solubility Labels: Define the solubility categories based on the solubility values.
  • Report Folder: Define where to save evaluation reports.

Usage

After installing the package and preparing your dataset and configuration file, you can run the evaluation tool via the command line. Run the Evaluation

Use the following command to evaluate your datasets:

mol_eval --real_data /path/to/real_data.csv --fake_data /path/to/fake_data.csv --configs /path/to/config.json
usage: mol_eval [-h] --real_data REAL_DATA --fake_data FAKE_DATA --configs CONFIGS

Molecule Evaluator: Evaluate real and fake SMILES data using a configuration file.

options:
  -h, --help            Show this help message and exit.
  --real_data REAL_DATA Path to the real SMILES data file (CSV).
  --fake_data FAKE_DATA Path to the fake SMILES data file (CSV).
  --configs CONFIGS     Path to the configuration JSON file.

Report Generation

The tool generates a report in the folder specified by REPORT_FOLDER in the configuration file (default is ./report). The report contains detailed information on the evaluation of the SMILES sequences, including similarity metrics, solubility predictions, and substructure matching.


Contributing

Contributions are welcome! Feel free to open issues or submit pull requests. Please ensure all tests pass and that the code follows the PEP 8 style guide.


License

This project is licensed under the terms of the GNU General Public License, Version 3.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mol_eval-0.1.3.tar.gz (55.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mol_eval-0.1.3-py3-none-any.whl (37.3 kB view details)

Uploaded Python 3

File details

Details for the file mol_eval-0.1.3.tar.gz.

File metadata

  • Download URL: mol_eval-0.1.3.tar.gz
  • Upload date:
  • Size: 55.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.21

File hashes

Hashes for mol_eval-0.1.3.tar.gz
Algorithm Hash digest
SHA256 3af3c5db9196ceccf44fa273f6039eed60c9b905943179040a0728cf646e41f5
MD5 937b85b7038003a734cd43db9668290a
BLAKE2b-256 c556ba340cd0ec31b64c34445525bf4ec987d9b166fa5ddca8f3e5fc42ab129b

See more details on using hashes here.

File details

Details for the file mol_eval-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: mol_eval-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 37.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.21

File hashes

Hashes for mol_eval-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 03f2ecca0be62c16bac3e05587b0649517be2f21a08f4c2f4b1f00079d9ed242
MD5 8ec98dcf6d18103d1d0516d81056fcc5
BLAKE2b-256 c26ff9d9f6abda2a3e756ef5fd01763f51651fd75f839adf8a73ee48be75f069

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page