Skip to main content

ANNalog — a SMILES-to-SMILES seq2seq model for medchem analogue generation

Project description

ANNalog

ANNalog is a SMILES-to-SMILES generative model for medicinal chemistry analogue design.

ANNalog is a transformer-based sequence-to-sequence (Seq2Seq) model designed to generate medicinal-chemistry-relevant analogues of an input molecule. It supports:

  • local chemical-space exploration (small, SAR-like modifications), and
  • scaffold hopping (changing the core scaffold while remaining chemically relevant).

Dependencies / environment (recommended)

A tested dependency set is provided in seq2seq_environment.yml in this repo (recommended for reproducibility).

Notes:

  • The PyPI package does not pin or install PyTorch for you. Please install a PyTorch build that matches your system (CPU/CUDA).
  • The provided conda YAML is the recommended environment for ANNalog generation and includes the required chembl-gen-check dependency.
  • chembl-gen-check requires Python >=3.10; the provided environment targets Python 3.12.

Google Colab page

A place where you could try to generate some molecules online: https://colab.research.google.com/drive/1aJhaBOG7xuYFwMGzfUmbMsLe8T462Ptc#scrollTo=Ss1QOzXjzKSP


Installation

Option A — Install from PyPI (recommended for “just use it”)

pip install numpy==2.4.2 pandas==3.0.1 tqdm==4.67.3 torch==2.10.0 torchvision==0.25.0 rdkit==2025.09.6 scikit-learn==1.8.0 annalog

After installation, you can use the installed CLI:

annalog-generate -h

Option B — Install from GitHub (recommended for development / editing code)

git clone https://github.com/DVNecromancer/ANNalog.git
cd ANNalog

Conda (recommended):

conda env create -f seq2seq_environment.yml
conda activate <env_name_from_yml>
pip install -e .

Generating molecules

You have two ways to generate:

  1. Installed CLI (works after pip install annalog): annalog-generate ...
  2. Repo script generation.py (works from a cloned repo; easy to modify)

Both share the same core options:

  • decoding methods: beam, BF-beam, sampling
  • exploration modes: normal, variants, recursive
  • post-generation structural checking with --check / --no-check (on by default)
  • TSV/CSV output

Decoding methods (what they mean)

-m beam

Classical beam search. Keeps the top-k partial sequences at each decoding step.

-m BF-beam

Best-first beam search. Expands the current best partial sequence while keeping unexplored partial sequences in memory. This is usually slower and more memory-hungry than classical beam search.

-m sampling

Samples each next token from the model probability distribution.


Exploration methods (what they mean)

-e normal (default)

Generate directly from the input SMILES.

-e variants

  1. Create --variant-number SMILES variants of the same molecule by randomizing atom order and writing non-canonical SMILES (i.e., different syntactic representations of the same structure).
  2. Run generation from each variant and pool all results.

-e recursive

Run generation in multiple rounds. In round 1 you generate from the input SMILES.
In round 2, you generate again using the round-1 outputs as new inputs, and so on for --loops rounds.


Main generation options

  • --temperature: sampling temperature. Higher values increase diversity; lower values make sampling more conservative. Used only with -m sampling.
  • --seed: random seed for reproducible sampling. Used only with -m sampling.
  • --prefix: fixed starting prefix for generation. This can be either:
    • an integer number of starting characters taken from the beginning of the current input SMILES, or
    • a literal starting string that must match the beginning of the current input SMILES.
  • --keep-invalid: keep invalid generated SMILES instead of filtering them out.
  • --max-length: maximum generated sequence length.
  • --variant-number: number of variants to create in -e variants mode.
  • --loops: number of recursive rounds in -e recursive mode.
  • --check / --no-check: enable or disable chembl-gen-check annotation of generated SMILES. Checking is on by default.

chembl-gen-check is a lightweight structural sanity-check package for rapid verification of scaffold, generic scaffold (here reported as skeleton), and ring-system precedent in ChEMBL, and it can also report structural alerts and LACAN-related uncommon-bond information.

Reference: chembl-gen-check (PyPI)


A) Using the installed CLI (PyPI / installed package)

Help:

annalog-generate -h

Quick start (single SMILES, beam, 50 outputs; checks on by default):

annalog-generate -i "CCO" -n 50 -m beam -o gen.tsv

Sampling (10 outputs):

annalog-generate -i "CC(Cl)Br" -n 10 -m sampling --temperature 1.2 --seed 42 -o gen.tsv

Variants exploration:

annalog-generate -i "CCO" -n 20 -e variants --variant-number 10 -o gen_variants.tsv

Recursive exploration (2 loops):

annalog-generate -i "CCO" -n 10 -e recursive --loops 2 -o gen_recursive.tsv

Disable structural checking:

annalog-generate -i "CCO" -n 50 -m beam --no-check -o gen_no_check.tsv

You can also invoke the same CLI via Python module form:

python -m annalog.cli -i "CCO" -n 50 -o gen.tsv

Resources (ckpt + vocab):

  • For the installed CLI, the checkpoint + vocab are shipped inside the package and used by default.
  • You can still override them if needed using --resources-dir or --checkpoint/--vocab.

B) Using the repo script (generation.py)

From the repo root (after pip install -e .), you can run:

python generation.py -h

Note about resources in the repo:
In this repository the checkpoint/vocab live under:

annalog/ckpt_and_vocab/

So when running generation.py, point it explicitly:

python generation.py \
  -i "CCO" \
  -n 50 \
  -m beam \
  --resources-dir annalog/ckpt_and_vocab \
  -o gen.tsv

Disable structural checking:

python generation.py \
  -i "CCO" \
  -n 50 \
  -m beam \
  --no-check \
  --resources-dir annalog/ckpt_and_vocab \
  -o gen_no_check.tsv

Output format

The output file always includes these base columns:

  • input_smiles
  • rank (1-based)
  • generated_smiles
  • score

By default, structural checking is enabled, so five additional columns are also appended:

  • check_scaffold
  • check_skeleton
  • check_ring_systems
  • check_structural_alerts
  • check_lacan

If you use --no-check, the output contains only the four base columns.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

annalog-1.0.6.tar.gz (15.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

annalog-1.0.6-py3-none-any.whl (15.2 MB view details)

Uploaded Python 3

File details

Details for the file annalog-1.0.6.tar.gz.

File metadata

  • Download URL: annalog-1.0.6.tar.gz
  • Upload date:
  • Size: 15.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.6

File hashes

Hashes for annalog-1.0.6.tar.gz
Algorithm Hash digest
SHA256 edb427c901b989749c11261ff6d72d91ffc53f8e8b51f42a4d5754d76407750e
MD5 89b5c0395bde8485bb30c55182799592
BLAKE2b-256 042b917842d7092b2c3dd236316f8fe015a9cfffdcfdd4a73e4d9733e9ab911a

See more details on using hashes here.

File details

Details for the file annalog-1.0.6-py3-none-any.whl.

File metadata

  • Download URL: annalog-1.0.6-py3-none-any.whl
  • Upload date:
  • Size: 15.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.6

File hashes

Hashes for annalog-1.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 49cf13abf67b5cfed1b8b9071bfebe357964583ea1e22d6ed17a1310deaf54d8
MD5 a2dcc89894148f07311592f2eaf4d7c5
BLAKE2b-256 3f3097bffb8aec0585e3119a216d96f91a666614746f0051cb7fbb33f97b74fe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page