ANNalog — a SMILES-to-SMILES seq2seq model for medchem analogue generation
Project description
ANNalog
ANNalog is a SMILES-to-SMILES generative model for medicinal chemistry analogue design.
- Paper (ChemRxiv): https://chemrxiv.org/doi/10.26434/chemrxiv-2025-9c1v6
ANNalog is a transformer-based sequence-to-sequence (Seq2Seq) model designed to generate medicinal-chemistry-relevant analogues of an input molecule. It supports:
- local chemical-space exploration (small, SAR-like modifications), and
- scaffold hopping (changing the core scaffold while remaining chemically relevant).
Dependencies / environment (recommended)
A tested dependency set is provided in seq2seq_environment.yml in this repo (recommended for reproducibility).
Notes:
- The PyPI package does not pin or install PyTorch for you. Please install a PyTorch build that matches your system (CPU/CUDA).
- The provided conda YAML is the recommended environment for ANNalog generation and includes the required
chembl-gen-checkdependency. chembl-gen-checkrequires Python>=3.10; the provided environment targets Python 3.12.
Google Colab page
A place where you could try to generate some molecules online: https://colab.research.google.com/drive/1aJhaBOG7xuYFwMGzfUmbMsLe8T462Ptc#scrollTo=Ss1QOzXjzKSP
Installation
Option A — Install from PyPI (recommended for “just use it”)
pip install annalog
After installation, you can use the installed CLI:
annalog-generate -h
Option B — Install from GitHub (recommended for development / editing code)
git clone https://github.com/DVNecromancer/ANNalog.git
cd ANNalog
Conda (recommended):
conda env create -f seq2seq_environment.yml
conda activate <env_name_from_yml>
pip install -e .
Generating molecules
You have two ways to generate:
- Installed CLI (works after
pip install annalog):annalog-generate ... - Repo script
generation.py(works from a cloned repo; easy to modify)
Both share the same core options:
- decoding methods:
beam,BF-beam,sampling - exploration modes:
normal,variants,recursive - post-generation structural checking with
--check/--no-check(on by default) - TSV/CSV output
Decoding methods (what they mean)
-m beam
Classical beam search. Keeps the top-k partial sequences at each decoding step.
-m BF-beam
Best-first beam search. Expands the current best partial sequence while keeping unexplored partial sequences in memory. This is usually slower and more memory-hungry than classical beam search.
-m sampling
Samples each next token from the model probability distribution.
Exploration methods (what they mean)
-e normal (default)
Generate directly from the input SMILES.
-e variants
- Create
--variant-numberSMILES variants of the same molecule by randomizing atom order and writing non-canonical SMILES (i.e., different syntactic representations of the same structure). - Run generation from each variant and pool all results.
-e recursive
Run generation in multiple rounds. In round 1 you generate from the input SMILES.
In round 2, you generate again using the round-1 outputs as new inputs, and so on for --loops rounds.
Main generation options
--temperature: sampling temperature. Higher values increase diversity; lower values make sampling more conservative. Used only with-m sampling.--seed: random seed for reproducible sampling. Used only with-m sampling.--prefix: fixed starting prefix for generation. This can be either:- an integer number of starting characters taken from the beginning of the current input SMILES, or
- a literal starting string that must match the beginning of the current input SMILES.
--keep-invalid: keep invalid generated SMILES instead of filtering them out.--max-length: maximum generated sequence length.--variant-number: number of variants to create in-e variantsmode.--loops: number of recursive rounds in-e recursivemode.--check/--no-check: enable or disablechembl-gen-checkannotation of generated SMILES. Checking is on by default.
chembl-gen-check is a lightweight structural sanity-check package for rapid verification of scaffold, generic scaffold (here reported as skeleton), and ring-system precedent in ChEMBL, and it can also report structural alerts and LACAN-related uncommon-bond information.
Reference: chembl-gen-check (PyPI)
A) Using the installed CLI (PyPI / installed package)
Help:
annalog-generate -h
Quick start (single SMILES, beam, 50 outputs; checks on by default):
annalog-generate -i "CCO" -n 50 -m beam -o gen.tsv
Sampling (10 outputs):
annalog-generate -i "CC(Cl)Br" -n 10 -m sampling --temperature 1.2 --seed 42 -o gen.tsv
Variants exploration:
annalog-generate -i "CCO" -n 20 -e variants --variant-number 10 -o gen_variants.tsv
Recursive exploration (2 loops):
annalog-generate -i "CCO" -n 10 -e recursive --loops 2 -o gen_recursive.tsv
Disable structural checking:
annalog-generate -i "CCO" -n 50 -m beam --no-check -o gen_no_check.tsv
You can also invoke the same CLI via Python module form:
python -m annalog.cli -i "CCO" -n 50 -o gen.tsv
Resources (ckpt + vocab):
- For the installed CLI, the checkpoint + vocab are shipped inside the package and used by default.
- You can still override them if needed using
--resources-diror--checkpoint/--vocab.
B) Using the repo script (generation.py)
From the repo root (after pip install -e .), you can run:
python generation.py -h
Note about resources in the repo:
In this repository the checkpoint/vocab live under:
annalog/ckpt_and_vocab/
So when running generation.py, point it explicitly:
python generation.py \
-i "CCO" \
-n 50 \
-m beam \
--resources-dir annalog/ckpt_and_vocab \
-o gen.tsv
Disable structural checking:
python generation.py \
-i "CCO" \
-n 50 \
-m beam \
--no-check \
--resources-dir annalog/ckpt_and_vocab \
-o gen_no_check.tsv
Output format
The output file always includes these base columns:
input_smilesrank(1-based)generated_smilesscore
By default, structural checking is enabled, so five additional columns are also appended:
check_scaffoldcheck_skeletoncheck_ring_systemscheck_structural_alertscheck_lacan
If you use --no-check, the output contains only the four base columns.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file annalog-1.0.5.tar.gz.
File metadata
- Download URL: annalog-1.0.5.tar.gz
- Upload date:
- Size: 15.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f9f5ddde62d3100ef55268252ea231ff530db52c19024acc4362401200f522bf
|
|
| MD5 |
5d4bbf7a8e36ac9117db98bc9440fa6f
|
|
| BLAKE2b-256 |
c7f23933db0d7796a679c4ccb40b3f06ea3db33b0e61a00450eb1708f122a1ff
|
File details
Details for the file annalog-1.0.5-py3-none-any.whl.
File metadata
- Download URL: annalog-1.0.5-py3-none-any.whl
- Upload date:
- Size: 15.2 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cbbd5250bf2ca5b4f53ade336424cfbbeaf465260fd3a0e903c76a1ffb22254a
|
|
| MD5 |
6abadb76f098cb03dddd498a2b62ffe3
|
|
| BLAKE2b-256 |
870ebec4ae74f3d3af75f7fa6bee3014287925a89e0e4309277003c9a3afdd91
|