Skip to main content

ANNalog — a SMILES-to-SMILES seq2seq model for medchem analogue generation

Project description

ANNalog

ANNalog is a SMILES-to-SMILES generative model for medicinal chemistry analogue design.

ANNalog is a transformer-based sequence-to-sequence (Seq2Seq) model designed to generate medicinal-chemistry-relevant analogues of an input molecule. It supports:

  • local chemical-space exploration (small, SAR-like modifications), and
  • scaffold hopping (changing the core scaffold while remaining chemically relevant).

Dependencies / environment (recommended)

A tested dependency set is provided in seq2seq_environment.yml in this repo (recommended for reproducibility).

Notes:

  • The PyPI package does not pin or install PyTorch for you. Please install a PyTorch build that matches your system (CPU/CUDA).
  • The provided conda YAML is the recommended environment for ANNalog generation and includes the required chembl-gen-check dependency.
  • chembl-gen-check requires Python >=3.10; the provided environment targets Python 3.12.

Google Colab page

A place where you could try to generate some molecules online: https://colab.research.google.com/drive/1aJhaBOG7xuYFwMGzfUmbMsLe8T462Ptc#scrollTo=Ss1QOzXjzKSP


Installation

Option A — Install from PyPI (recommended for “just use it”)

pip install annalog

After installation, you can use the installed CLI:

annalog-generate -h

Option B — Install from GitHub (recommended for development / editing code)

git clone https://github.com/DVNecromancer/ANNalog.git
cd ANNalog

Conda (recommended):

conda env create -f seq2seq_environment.yml
conda activate <env_name_from_yml>
pip install -e .

Generating molecules

You have two ways to generate:

  1. Installed CLI (works after pip install annalog): annalog-generate ...
  2. Repo script generation.py (works from a cloned repo; easy to modify)

Both share the same core options:

  • decoding methods: beam, BF-beam, sampling
  • exploration modes: normal, variants, recursive
  • post-generation structural checking with --check / --no-check (on by default)
  • TSV/CSV output

Decoding methods (what they mean)

-m beam

Classical beam search. Keeps the top-k partial sequences at each decoding step.

-m BF-beam

Best-first beam search. Expands the current best partial sequence while keeping unexplored partial sequences in memory. This is usually slower and more memory-hungry than classical beam search.

-m sampling

Samples each next token from the model probability distribution.


Exploration methods (what they mean)

-e normal (default)

Generate directly from the input SMILES.

-e variants

  1. Create --variant-number SMILES variants of the same molecule by randomizing atom order and writing non-canonical SMILES (i.e., different syntactic representations of the same structure).
  2. Run generation from each variant and pool all results.

-e recursive

Run generation in multiple rounds. In round 1 you generate from the input SMILES.
In round 2, you generate again using the round-1 outputs as new inputs, and so on for --loops rounds.


Main generation options

  • --temperature: sampling temperature. Higher values increase diversity; lower values make sampling more conservative. Used only with -m sampling.
  • --seed: random seed for reproducible sampling. Used only with -m sampling.
  • --prefix: fixed starting prefix for generation. This can be either:
    • an integer number of starting characters taken from the beginning of the current input SMILES, or
    • a literal starting string that must match the beginning of the current input SMILES.
  • --keep-invalid: keep invalid generated SMILES instead of filtering them out.
  • --max-length: maximum generated sequence length.
  • --variant-number: number of variants to create in -e variants mode.
  • --loops: number of recursive rounds in -e recursive mode.
  • --check / --no-check: enable or disable chembl-gen-check annotation of generated SMILES. Checking is on by default.

chembl-gen-check is a lightweight structural sanity-check package for rapid verification of scaffold, generic scaffold (here reported as skeleton), and ring-system precedent in ChEMBL, and it can also report structural alerts and LACAN-related uncommon-bond information.

Reference: chembl-gen-check (PyPI)


A) Using the installed CLI (PyPI / installed package)

Help:

annalog-generate -h

Quick start (single SMILES, beam, 50 outputs; checks on by default):

annalog-generate -i "CCO" -n 50 -m beam -o gen.tsv

Sampling (10 outputs):

annalog-generate -i "CC(Cl)Br" -n 10 -m sampling --temperature 1.2 --seed 42 -o gen.tsv

Variants exploration:

annalog-generate -i "CCO" -n 20 -e variants --variant-number 10 -o gen_variants.tsv

Recursive exploration (2 loops):

annalog-generate -i "CCO" -n 10 -e recursive --loops 2 -o gen_recursive.tsv

Disable structural checking:

annalog-generate -i "CCO" -n 50 -m beam --no-check -o gen_no_check.tsv

You can also invoke the same CLI via Python module form:

python -m annalog.cli -i "CCO" -n 50 -o gen.tsv

Resources (ckpt + vocab):

  • For the installed CLI, the checkpoint + vocab are shipped inside the package and used by default.
  • You can still override them if needed using --resources-dir or --checkpoint/--vocab.

B) Using the repo script (generation.py)

From the repo root (after pip install -e .), you can run:

python generation.py -h

Note about resources in the repo:
In this repository the checkpoint/vocab live under:

annalog/ckpt_and_vocab/

So when running generation.py, point it explicitly:

python generation.py \
  -i "CCO" \
  -n 50 \
  -m beam \
  --resources-dir annalog/ckpt_and_vocab \
  -o gen.tsv

Disable structural checking:

python generation.py \
  -i "CCO" \
  -n 50 \
  -m beam \
  --no-check \
  --resources-dir annalog/ckpt_and_vocab \
  -o gen_no_check.tsv

Output format

The output file always includes these base columns:

  • input_smiles
  • rank (1-based)
  • generated_smiles
  • score

By default, structural checking is enabled, so five additional columns are also appended:

  • check_scaffold
  • check_skeleton
  • check_ring_systems
  • check_structural_alerts
  • check_lacan

If you use --no-check, the output contains only the four base columns.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

annalog-1.0.5.tar.gz (15.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

annalog-1.0.5-py3-none-any.whl (15.2 MB view details)

Uploaded Python 3

File details

Details for the file annalog-1.0.5.tar.gz.

File metadata

  • Download URL: annalog-1.0.5.tar.gz
  • Upload date:
  • Size: 15.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.6

File hashes

Hashes for annalog-1.0.5.tar.gz
Algorithm Hash digest
SHA256 f9f5ddde62d3100ef55268252ea231ff530db52c19024acc4362401200f522bf
MD5 5d4bbf7a8e36ac9117db98bc9440fa6f
BLAKE2b-256 c7f23933db0d7796a679c4ccb40b3f06ea3db33b0e61a00450eb1708f122a1ff

See more details on using hashes here.

File details

Details for the file annalog-1.0.5-py3-none-any.whl.

File metadata

  • Download URL: annalog-1.0.5-py3-none-any.whl
  • Upload date:
  • Size: 15.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.6

File hashes

Hashes for annalog-1.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 cbbd5250bf2ca5b4f53ade336424cfbbeaf465260fd3a0e903c76a1ffb22254a
MD5 6abadb76f098cb03dddd498a2b62ffe3
BLAKE2b-256 870ebec4ae74f3d3af75f7fa6bee3014287925a89e0e4309277003c9a3afdd91

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page