Skip to main content

FAME3R: a re-implementation of the FAME3 model

Project description

FAME3R: a re-implementation of the FAME3 model.

FAME3R is a random forest model predicting the phase 1 and phase 2 sites of metabolism (SOMs) in small organic molecules.

Installation

  1. Create a conda environment with the required python version:
conda create --name fame3r-env python=3.10
  1. Activate the environment:
conda activate fame3r-env
  1. Install package:
pip install fame3r

Usage

Input data

The input data must be provided in SD file format. Any number and type of molecular properties are accepted. For labeled data, the true sites of metabolism (SOMs) should be specified as a list of atom indices under the soms property. For example, if atoms 1 and 6 are SOMs, the soms property should be written as [1, 6]. For unlabeled data, the soms property can be omitted.

Each of the core scripts (cv_hp_search.py, train.py, test.py, and infer.py) automatically computes FAME descriptors for each atom in the input molecules. These descriptors are saved to *_descriptors.csv files in the output directory to ensure transparency and reproducibility.

For more information on FAME descriptors, we refer the reader to Šícho, Martin, et al. "FAME 3: predicting the sites of metabolism in synthetic compounds and natural products for phase 1 and phase 2 metabolic enzymes." Journal of chemical information and modeling 59.8 (2019): 3400-3412.

Determining the optimal hyperparameters via k-fold cross-validation

Different training datasets may require hyperparameters that differ from the provided defaults. To identify the optimal hyperparameters for your specific data, you can run K-fold cross-validation with a grid search using the cv_hp_search.py script.

The search space is defined in the param_grid dictionary within the script. After running, the script saves:

  • The best hyperparameter set (based on validation performance) to a text file.

  • The optimal binary decision threshold — chosen to maximize the Matthews Correlation Coefficient (MCC) on the validation set — based on the majority vote across folds.

  • The mean and standard deviation of the model’s performance metrics across all folds.

Note:

  • The atom environment radius is not part of the hyperparameter search but can be set via the --radius command-line argument (default: 5).

  • The number of folds used in cross-validation can be set with the --num_folds command-line argument (default: 10).

fame3r-cv-hp-search -i INPUT_FILE -o OUTPUT_FOLDER -r RADIUS[OPTIONAL, DEFAULT=5] -n NUM_FOLDS[OPTIONAL, DEFAULT=10]

Training a model

Use the train.py to train a random forest classifier with pre-defined hyperparameters.

The trained model is saved as a .joblib file in the specified output folder.

Note:

  • This script does not perform hyperparameter optimization or radius tuning. For that, see the section "Determining the optimal hyperparameters via K-fold cross-validation."

  • You can manually adjust the model's hyperparameters in the RandomForestClassifier constructor within the script.

  • The atom environment radius can be set via the --radius command-line argument (default: 5).

fame3r-train -i INPUT_FILE -o OUTPUT_FOLDER -r RADIUS[OPTIONAL, DEFAULT=5]

Testing a trained model on labeled test data

Use the test.py script to evaluate a trained model on labeled test data.

After execution, the script saves:

  • Test performance metrics are saved to a text file. The metrics include the Area Under the Receiver Operating Characteristic curve (AUROC), the area under the precision-recall curve (average precision), the F1 score, the Matthews Correlation Coefficient (MCC), precision, recall, and the top-2 correctness rate. The top-2 correctness rate represents the percentage of molecules for which at least one true site of metabolism (SOM) is ranked among the top two atoms in the molecule based on predicted SOM probabilities.

  • Per-atom predictions (including probabilities, binary classifications, and true labels) to a CSV file.

Note:

  • This script performs bootstrapping to estimate the uncertainty in the metrics. The number of bootstraps can be set by changing the NUM_BOOTSTRAPS variable. Default is 1000.

  • The radius of the atom environment is not part of the hyperparameter search, but can be set by changing the --radius command-line argument. Default is 5.

  • The decision threshold can be changed by changing the --threshold command-line argument. Default is 0.3.

  • The script also computes FAME scores if the -fs flag is set. FAME scores are an indication of the well-representedness of the inference data compared to the training data and is defined as the Tanimoto similarity to the three nearest neighbors in the training data, computed on FAME descriptors. The higher the score, the most trustworthy the predictions.

fame3r-test -i INPUT_FILE -m MODEL_FOLDER -o OUTPUT_FOLDER -r RADIUS[OPTIONAL, DEFAULT=5] -t THRESHOLD[OPTIONAL, DEFAULT=0.3] -fs[OPTIONAL]

Inference mode: computing the SOMs of unlabeled data

The inference.py script applies a trained model to unlabeled input data and saves the per-atom predictions to a CSV file. Each row contains the predicted SOM probability and its corresponding binary classification based on a decision threshold. If the --compute_fame_scores (-fs) flag is set, the script also computes FAME scores, which indicate how well each atom's environment is represented in the training data. These scores are calculated as the average Tanimoto similarity to the three nearest neighbors in the training set, based on FAME descriptors. The higher the score, the most trustworthy the predictions. The radius of the atom environment can be specified using the --radius argument (default: 5), and the decision threshold can be set via the --threshold argument (default: 0.3).

fame3r-infer -i INPUT_FILE -m MODEL_FOLDER -o OUTPUT_FOLDER -r RADIUS[OPTIONAL, DEFAULT=5] -t THRESHOLD[OPTIONAL, DEFAULT=0.3] -fs[OPTIONAL]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fame3r-1.0.4.tar.gz (11.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fame3r-1.0.4-py3-none-any.whl (17.5 kB view details)

Uploaded Python 3

File details

Details for the file fame3r-1.0.4.tar.gz.

File metadata

  • Download URL: fame3r-1.0.4.tar.gz
  • Upload date:
  • Size: 11.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for fame3r-1.0.4.tar.gz
Algorithm Hash digest
SHA256 373d5838d655285db2c4b8b7689b4dae2dff912dfac39511d5358a41743b087d
MD5 73addd4c57b618b3eccef941c99f1a41
BLAKE2b-256 3c35cb648374060f3e5aaa0e012f8462f712e90724bef6f324f5848936d85d2e

See more details on using hashes here.

File details

Details for the file fame3r-1.0.4-py3-none-any.whl.

File metadata

  • Download URL: fame3r-1.0.4-py3-none-any.whl
  • Upload date:
  • Size: 17.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for fame3r-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 dfeb60286786ac7af6eb52f9195770d76c60b19e06b124c92f155bc58d00424b
MD5 cb83b223ee3c330478bd51a814fe47da
BLAKE2b-256 4c56b9620abb9f62450791dc3d6e12a10c50538efea47571223dc912de93ef58

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page