Skip to main content

SynFrag: A Synthetic Accessibility Predictor based Fragment Assembly autoRegressive pretrain

Project description

AIDD PyPI GitHubEmail License: MIT

SynFrag: Synthetic Accessibility via Fragment Assembly Generation

Predict the synthetic accessibility of molecules like an experienced synthetic chemist

🎯 What Makes SynFrag Different

SynFrag revolutionizes synthetic accessibility prediction through Pre-training strategy for generating molecules via fragment autoregressive assembly. Unlike traditional approaches that directly learn synthesis patterns, SynFrag first masters molecular construction fundamentals—understanding how molecules are assembled from fragments—then applies this knowledge to predict synthetic accessibility.

Two-Stage Learning:

  • Stage 1: Pretrain on 9.2M unlabeled molecules to learn molecular assembly patterns
  • Stage 2: Finetune on 800K labeled molecules for synthetic accessibility prediction

This mirrors human chemical intuition: experienced chemists understand molecular construction before assessing synthetic difficulty.

✨ Key Features

  • Easy Integration - Simple CSV input/output format
  • Batch Prediction - One-click synthetic accessibility scoring
  • High Accuracy - Achieves SOTA performance on multiple test sets with key metrics including accuracy, AUROC and specificity.

🌐 Online Service

Instant molecular synthesis prediction in the cloud. Simply upload your CSV file with SMILES and receive AI-powered synthetic accessibility scores in seconds.

🚀 Quick Start

1. Installation

    # Clone repository
    git clone https://github.com/simmzx/SynFrag.git
    cd ../SynFrag

    # Create environment and install dependencies
    conda create -n SynFrag python=3.8
    conda activate SynFrag
    pip install -r requirements.txt

2. Prepare Data

Create CSV file with "smiles" field:

molecule_id smiles
Palbociclib CC1=C(C(=O)N(C2=NC(=NC=C12)NC3=NC=C(C=C3)N4CCNCC4)C5CCCC5)C(=O)C
(+)-Eburnamonine [C@]12(C3=C4CCN1CCC[C@@]2(CC(=O)N3C1C4=CC=CC=1)CC)[H]

3. Run Prediction

CSV File Mode

    python synfrag.py --input_file example.csv

Direct SMILES Mode

    # Single molecule
    python synfrag.py --smiles "CCO"
    # Multiple molecules
    python synfrag.py --smiles "CCO" "CC(=O)O" "c1ccccc1"

4. View Results

Output file will contain SynFrag values:

molecule_id smiles synfrag
Palbociclib CC1=C(C(=O)N(C2=NC(=NC=C12)NC3=NC=C(C=C3)N4CCNCC4)C5CCCC5)C(=O)C 0.9453
(+)-Eburnamonine [C@]12(C3=C4CCN1CCC[C@@]2(CC(=O)N3C1C4=CC=CC=1)CC)[H] 0.0286

SynFrag Interpretation:

  • Close to 1: Easy to synthesize
  • Close to 0: Hard to synthesize
  • Threshold 0.5: Binary classification cutoff

📖 Advanced Usage

Custom Pretraining and Finetuning task

Pretrain Model

    python synfrag_pretrain.py \
        --dataset smiles.txt \
        --vocab fragment.txt 

Note: smiles.txt contains unlabeled molecules, fragment.txt is a fragment vocabulary generated by ./scripts/utils/mol/cls.py from smiles.txt for fragment assembly autoregressive pretrain.

Finetune Model

    python synfrag_finetune.py \
        --input_model_file gnn_pretrained.pth \
        --dataset dataset.csv

Note: gnn_pretrained.pth is a model saved in pretraining stage, dataset.csv contains labeled molecules for finetune on specific downstream task.

🔧 Requirements

  • Python 3.8-3.10
  • CUDA-enabled GPU (recommended)
  • Key dependencies: PyTorch, RDKit, DGL, DeepChem

📄 Citation

If this program is useful to you, please cite our paper:

📧 Contact

For questions, please contact: Xiang Zhang (Email: zhangxiang@simm.ac.cn)


🌟 Like this project? Give us a Star

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

synfrag-1.0.0.tar.gz (14.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

synfrag-1.0.0-py3-none-any.whl (14.5 MB view details)

Uploaded Python 3

File details

Details for the file synfrag-1.0.0.tar.gz.

File metadata

  • Download URL: synfrag-1.0.0.tar.gz
  • Upload date:
  • Size: 14.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for synfrag-1.0.0.tar.gz
Algorithm Hash digest
SHA256 645e2db03ad4012073de0dec21b2a05c32232135dd672d74a9c7b85812fa5fca
MD5 678b98af16249df06e530362ae066523
BLAKE2b-256 078987678768043df42af64acf15b17cfc722d5cb8322f1af11bb16cef017b2f

See more details on using hashes here.

File details

Details for the file synfrag-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: synfrag-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 14.5 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for synfrag-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bae31b4f481f5d5ee3bca01144bcfbeeda0a023d809cbc5b80b1bbd63e13a566
MD5 883f675e79bd12fe477bdac9286ca814
BLAKE2b-256 2df97cf571b773c41d71328b8f449657e7c4ef7fd1bc1af5dd56862a592a30a3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page