Skip to main content

Marginalized Graph Kernel Library for Molecular Property Prediction

Project description

mgktools

Python 3.10+ License

mgktools is a Python package for molecular property prediction using Marginalized Graph Kernels (MGK). It provides a comprehensive framework for training Gaussian Process and Support Vector Machine models on molecular datasets, with built-in support for hyperparameter optimization and model interpretability.

Features

  • Graph Kernel Methods: Marginalized Graph Kernel (MGK) for molecular similarity computation
  • Multiple Model Types: Gaussian Process Regression (GPR), Gaussian Process Classification (GPC), and SVM models
  • Hyperparameter Optimization: Bayesian optimization via Optuna and gradient-based optimization
  • Model Interpretability: Atomic and molecular attribution for understanding predictions
  • Flexible Molecular Representations: Support for 13+ fingerprint types and molecular descriptors
  • Scalable Methods: Nystrom approximation and local expert models for large datasets
  • Cross-Validation: Built-in k-fold, leave-one-out, and Monte Carlo cross-validation

Installation

Requirements

  • Python == 3.12
  • GCC == 9 or 11
  • CUDA >= 11 (No CPU implementation for graph kernel computation)

Install from PyPI

# Install graphdot dependency (required)
pip install git+https://gitlab.com/Xiangyan93/graphdot.git@v0.8.2

# Install mgktools from PyPI
pip install mgktools

Install from Source

# Install graphdot dependency (required)
pip install git+https://gitlab.com/Xiangyan93/graphdot.git@v0.8.2

git clone https://github.com/Xiangyan93/mgktools.git
cd mgktools
pip install -e .

Quick Start with Google Colab Tutorial

GPU is required to compute graph kernels. Try the interactive tutorial:

Command-Line Tools

mgktools provides several CLI commands:

mgk_cache_data

Pre-cache graph objects and molecular features for faster processing.

mgk_cache_data --data_paths data.csv --smiles_columns smiles --cache_graph --cache_path cache.pkl

mgk_kernel_calc

Compute and save kernel matrices.

mgk_kernel_calc --save_dir output/ --data_path data.csv --smiles_columns smiles \
    --graph_kernel_type graph --graph_hyperparameters additive-pnorm.json

mgk_cross_validation

Run cross-validation experiments.

mgk_cross_validation --save_dir output/ --data_path data.csv --smiles_columns smiles \
    --targets_columns target --graph_kernel_type graph --graph_hyperparameters additive-pnorm.json \
    --model_type gpr --task_type regression --cross_validation kFold --n_splits 5 \
    --alpha 0.01 --metric rmse

mgk_optuna

Bayesian hyperparameter optimization with Optuna.

mgk_optuna --save_dir output/ --data_path data.csv --smiles_columns smiles \
    --targets_columns target --graph_kernel_type graph --graph_hyperparameters additive-pnorm.json \
    --model_type gpr --task_type regression --cross_validation leave-one-out \
    --alpha 0.01 --alpha_bounds 0.001 0.1 --metric rmse --num_iters 100

mgk_embedding

Compute molecular embeddings (t-SNE or kPCA).

mgk_embedding --save_dir output/ --data_path data.csv --smiles_columns smiles \
    --targets_columns target --graph_kernel_type graph --graph_hyperparameters additive-pnorm.json \
    --embedding_algorithm tSNE --n_components 2 --perplexity 30

Basic Usage

import pandas as pd
from mgktools.data.data import Dataset
from mgktools.kernels.utils import get_kernel_config
from mgktools.evaluators.cross_validation import Evaluator

# Load data from DataFrame
df = pd.DataFrame({
    'smiles': ['CCO', 'CCC', 'CCCC', 'CCCCC', 'CCCCCC'],
    'target': [1.0, 2.0, 3.0, 4.0, 5.0]
})

# Create dataset
dataset = Dataset.from_df(
    df,
    smiles_columns=['smiles'],
    targets_columns=['target']
)

# Set up for graph kernel computation
dataset.set_status(graph_kernel_type='graph')
dataset.create_graphs(n_jobs=4)
dataset.unify_datatype()

# Configure kernel (using additive kernel with p-normalization)
kernel_config = get_kernel_config(
    dataset=dataset,
    graph_kernel_type='graph',
    mgk_hyperparameters_files=['mgktools/hyperparameters/configs/additive-pnorm.json']
)

# Run cross-validation
evaluator = Evaluator(
    dataset=dataset,
    model_type='gpr',
    task_type='regression',
    kernel_config=kernel_config,
    split_type='random',
    n_splits=5,
    metrics=['rmse', 'r2']
)
results = evaluator.run_cross_validation()

Using Molecular Fingerprints

from mgktools.features_mol.features_generators import FeaturesGenerator

# Create feature generator
fg = FeaturesGenerator('morgan', radius=2, num_bits=2048)

# Set up dataset with molecular features
dataset.set_status(
    graph_kernel_type='graph',
    features_generators=[fg],
    features_combination='concat'
)
dataset.create_graphs(n_jobs=4)
dataset.create_features_mol(n_jobs=4)

Hyperparameter Optimization with Optuna

from mgktools.hyperparameters.optuna import bayesian_optimization

# Run Bayesian optimization
best_params, best_score = bayesian_optimization(
    dataset_train=train_dataset,
    dataset_val=val_dataset,
    kernel_config=kernel_config,
    model_type='gpr',
    task_type='regression',
    metric='rmse',
    n_trials=100
)

Package Structure

mgktools/
├── data/               # Dataset handling and caching
│   ├── data.py        # CachedDict, Datapoint, Dataset classes
│   └── split.py       # Train/val/test splitting utilities
├── kernels/           # Kernel implementations
│   ├── base.py        # MicroKernel, BaseKernelConfig
│   ├── GraphKernel.py # MGK and GraphKernelConfig
│   ├── FeatureKernel.py
│   ├── HybridKernel.py
│   └── utils.py       # get_kernel_config factory
├── models/            # Model implementations
│   ├── regression/    # GPR, NLE, ensemble models
│   └── classification/# GPC, SVM classifiers
├── evaluators/        # Evaluation utilities
│   ├── cross_validation.py  # Evaluator class
│   └── metric.py      # Metric computation
├── features_mol/      # Molecular feature generators
│   └── features_generators.py
├── graph/             # Graph conversion utilities
│   └── hashgraph.py   # HashGraph class
├── interpret/         # Model interpretability
│   └── interpret.py   # Atomic/molecular attribution
├── hyperparameters/   # Optimization and configs
│   ├── optuna.py      # Bayesian optimization
│   └── configs/       # Pre-defined kernel configs
└── exe/               # CLI entry points
    └── run.py         # Command-line tools

Supported Feature Generators

Name Description Default Size
morgan Binary Morgan fingerprint 2048
morgan_count Count-based Morgan fingerprint 2048
rdkit_2d RDKit 2D descriptors ~200
rdkit_2d_normalized Normalized RDKit 2D descriptors ~200
rdkit_208 RDKit 208 descriptors ~210
rdkit_topol RDKit topological fingerprint 2048
layered Layered fingerprint 2048
torsion Topological torsion fingerprint 2048
atom_pair Atom pair fingerprint 2048
avalon Avalon fingerprint 2048
avalon_count Count-based Avalon fingerprint 2048
maccskey MACCS keys fingerprint 167
pattern Pattern fingerprint 2048

Hyperparameter Configurations

Pre-defined kernel configurations are available in mgktools/hyperparameters/configs/:

  • Additive kernels: additive-norm.json, additive-pnorm.json, additive-msnorm.json
  • Product kernels: product-norm.json, product-pnorm.json, product-msnorm.json
  • Feature kernels: rbf.json, dot_product.json

Testing

# Run all tests
pytest tests/

# Run specific test file
pytest tests/kernel/test_kernel.py

# Run with verbose output
pytest tests/ -v

Related Publications

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mgktools-3.3.1.tar.gz (68.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mgktools-3.3.1-py3-none-any.whl (85.1 kB view details)

Uploaded Python 3

File details

Details for the file mgktools-3.3.1.tar.gz.

File metadata

  • Download URL: mgktools-3.3.1.tar.gz
  • Upload date:
  • Size: 68.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.15

File hashes

Hashes for mgktools-3.3.1.tar.gz
Algorithm Hash digest
SHA256 ae801c9723b194ad37b70dd15e89dbc255fc33cd7c1899936de67349d87987a3
MD5 572ea5112f162a64d9d4c195f9b49086
BLAKE2b-256 e104e266760c9cff8fb1665c60c639837bf9d7338a2adc6256885fe1adfbee97

See more details on using hashes here.

File details

Details for the file mgktools-3.3.1-py3-none-any.whl.

File metadata

  • Download URL: mgktools-3.3.1-py3-none-any.whl
  • Upload date:
  • Size: 85.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.15

File hashes

Hashes for mgktools-3.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 dec2b71eb4b2c20898986c56b0579a2f4e2ef31ac03253e78cec8a0590a972cc
MD5 c739ead495d6ca2e090b8682bd6a1d3d
BLAKE2b-256 33b13a739928bd748b62ea56239ad228bbd522b0cb82dbca31f25de37c22f5b1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page