SynFrag: A Synthetic Accessibility Predictor based Fragment Assembly autoRegressive pretrain
Project description
SynFrag: Synthetic Accessibility via Fragment Assembly Generation
Predict the synthetic accessibility of molecules like an experienced synthetic chemist
🎯 What Makes SynFrag Different
SynFrag revolutionizes synthetic accessibility prediction through Pre-training strategy for generating molecules via fragment autoregressive assembly. Unlike traditional approaches that directly learn synthesis patterns, SynFrag first masters molecular construction fundamentals—understanding how molecules are assembled from fragments—then applies this knowledge to predict synthetic accessibility.
Two-Stage Learning:
- Stage 1: Pretrain on 9.2M unlabeled molecules to learn molecular assembly patterns
- Stage 2: Finetune on 800K labeled molecules for synthetic accessibility prediction
This mirrors human chemical intuition: experienced chemists understand molecular construction before assessing synthetic difficulty.
✨ Key Features
- Easy Integration - Simple CSV input/output format
- Batch Prediction - One-click synthetic accessibility scoring
- High Accuracy - Achieves SOTA performance on multiple test sets with key metrics including accuracy, AUROC and specificity.
🌐 Online Service
Instant molecular synthesis prediction in the cloud. Simply upload your CSV file with SMILES and receive AI-powered synthetic accessibility scores in seconds.
🚀 Quick Start
1. Installation
# Clone repository
git clone https://github.com/simmzx/SynFrag.git
cd ../SynFrag
# Create environment and install dependencies
conda create -n SynFrag python=3.8
conda activate SynFrag
pip install -r requirements.txt
2. Prepare Data
Create CSV file with "smiles" field:
| molecule_id | smiles |
|---|---|
| Palbociclib | CC1=C(C(=O)N(C2=NC(=NC=C12)NC3=NC=C(C=C3)N4CCNCC4)C5CCCC5)C(=O)C |
| (+)-Eburnamonine | [C@]12(C3=C4CCN1CCC[C@@]2(CC(=O)N3C1C4=CC=CC=1)CC)[H] |
3. Run Prediction
CSV File Mode
python synfrag.py --input_file example.csv
Direct SMILES Mode
# Single molecule
python synfrag.py --smiles "CCO"
# Multiple molecules
python synfrag.py --smiles "CCO" "CC(=O)O" "c1ccccc1"
4. View Results
Output file will contain SynFrag values:
| molecule_id | smiles | synfrag |
|---|---|---|
| Palbociclib | CC1=C(C(=O)N(C2=NC(=NC=C12)NC3=NC=C(C=C3)N4CCNCC4)C5CCCC5)C(=O)C | 0.9453 |
| (+)-Eburnamonine | [C@]12(C3=C4CCN1CCC[C@@]2(CC(=O)N3C1C4=CC=CC=1)CC)[H] | 0.0286 |
SynFrag Interpretation:
- Close to 1: Easy to synthesize
- Close to 0: Hard to synthesize
- Threshold 0.5: Binary classification cutoff
📖 Advanced Usage
Custom Pretraining and Finetuning task
Pretrain Model
python synfrag_pretrain.py \
--dataset smiles.txt \
--vocab fragment.txt
Note: smiles.txt contains unlabeled molecules, fragment.txt is a fragment vocabulary generated by ./scripts/utils/mol/cls.py from smiles.txt for fragment assembly autoregressive pretrain.
Finetune Model
python synfrag_finetune.py \
--input_model_file gnn_pretrained.pth \
--dataset dataset.csv
Note: gnn_pretrained.pth is a model saved in pretraining stage, dataset.csv contains labeled molecules for finetune on specific downstream task.
🔧 Requirements
- Python 3.8-3.10
- CUDA-enabled GPU (recommended)
- Key dependencies: PyTorch, RDKit, DGL, DeepChem
📄 Citation
If this program is useful to you, please cite our paper:
📧 Contact
For questions, please contact: Xiang Zhang (Email: zhangxiang@simm.ac.cn)
🌟 Like this project? Give us a Star
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file synfrag-1.0.0.tar.gz.
File metadata
- Download URL: synfrag-1.0.0.tar.gz
- Upload date:
- Size: 14.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
645e2db03ad4012073de0dec21b2a05c32232135dd672d74a9c7b85812fa5fca
|
|
| MD5 |
678b98af16249df06e530362ae066523
|
|
| BLAKE2b-256 |
078987678768043df42af64acf15b17cfc722d5cb8322f1af11bb16cef017b2f
|
File details
Details for the file synfrag-1.0.0-py3-none-any.whl.
File metadata
- Download URL: synfrag-1.0.0-py3-none-any.whl
- Upload date:
- Size: 14.5 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bae31b4f481f5d5ee3bca01144bcfbeeda0a023d809cbc5b80b1bbd63e13a566
|
|
| MD5 |
883f675e79bd12fe477bdac9286ca814
|
|
| BLAKE2b-256 |
2df97cf571b773c41d71328b8f449657e7c4ef7fd1bc1af5dd56862a592a30a3
|