Skip to main content

ANNalog — a SMILES-to-SMILES seq2seq model for medchem analogue generation

Project description

ANNalog

ANNalog is a SMILES-to-SMILES generative model for medicinal chemistry analogue design.

ANNalog is a transformer-based sequence-to-sequence (Seq2Seq) model designed to generate medicinal-chemistry-relevant analogues of an input molecule. It supports:

  • local chemical-space exploration (small, SAR-like modifications), and
  • scaffold hopping (changing the core scaffold while remaining chemically relevant).

Dependencies / environment (recommended)

A tested dependency set is provided in seq2seq_environment.yml in this repo (recommended for reproducibility).

Notes:

  • The PyPI package does not pin or install PyTorch for you. Please install a PyTorch build that matches your system (CPU/CUDA).
  • The provided conda YAML is the recommended environment for ANNalog generation and includes the required chembl-gen-check dependency.
  • chembl-gen-check requires Python >=3.10; the provided environment targets Python 3.12.

Google Colab page

A place where you could try to generate some molecules online: https://colab.research.google.com/drive/1aJhaBOG7xuYFwMGzfUmbMsLe8T462Ptc#scrollTo=Ss1QOzXjzKSP


Installation

Option A — Install from PyPI (recommended for “just use it”)

pip install annalog

After installation, you can use the installed CLI:

annalog-generate -h

Option B — Install from GitHub (recommended for development / editing code)

git clone https://github.com/DVNecromancer/ANNalog.git
cd ANNalog

Conda (recommended):

conda env create -f seq2seq_environment.yml
conda activate <env_name_from_yml>
pip install -e .

Generating molecules

You have two ways to generate:

  1. Installed CLI (works after pip install annalog): annalog-generate ...
  2. Repo script generation.py (works from a cloned repo; easy to modify)

Both share the same core options:

  • decoding methods: beam, BF-beam, sampling
  • exploration modes: normal, variants, recursive
  • optional post-generation annotation with --cgc
  • TSV/CSV output

Decoding methods (what they mean)

-m beam

Classical beam search. Keeps the top-k partial sequences at each decoding step.

-m BF-beam

Best-first beam search. Expands the current best partial sequence while keeping unexplored partial sequences in memory. This is usually slower and more memory-hungry than classical beam search.

-m sampling

Samples each next token from the model probability distribution.


Exploration methods (what they mean)

-e normal (default)

Generate directly from the input SMILES.

-e variants

  1. Create --variant-number SMILES variants of the same molecule by randomizing atom order and writing non-canonical SMILES (i.e., different syntactic representations of the same structure).
  2. Run generation from each variant and pool all results.

-e recursive

Run generation in multiple rounds. In round 1 you generate from the input SMILES.
In round 2, you generate again using the round-1 outputs as new inputs, and so on for --loops rounds.


Main generation options

  • --temperature: sampling temperature. Higher values increase diversity; lower values make sampling more conservative. Used only with -m sampling.
  • --seed: random seed for reproducible sampling. Used only with -m sampling.
  • --prefix: fixed starting prefix for generation. This can be either:
    • an integer number of starting characters taken from the beginning of the input SMILES, or
    • a literal starting string that must match the beginning of the input SMILES.
  • --keep-invalid: keep invalid generated SMILES instead of filtering them out.
  • --max-length: maximum generated sequence length.
  • --variant-number: number of variants to create in -e variants mode.
  • --loops: number of recursive rounds in -e recursive mode.
  • --cgc: run chembl-gen-check on each generated SMILES after generation and append five extra output columns:
    • cgc_scaffold
    • cgc_skeleton
    • cgc_ring_systems
    • cgc_structural_alerts
    • cgc_lacan

chembl-gen-check is a lightweight structural sanity-check package for rapid verification of scaffold, generic scaffold (here reported as skeleton), and ring-system precedent in ChEMBL, and it can also report structural alerts and LACAN-related uncommon-bond information.

Reference: chembl-gen-check (PyPI)


A) Using the installed CLI (PyPI / installed package)

Help:

annalog-generate -h

Quick start (single SMILES, beam, 50 outputs):

annalog-generate -i "CCO" -n 50 -m beam -o gen.tsv

Sampling (10 outputs):

annalog-generate -i "CC(Cl)Br" -n 10 -m sampling --temperature 1.2 --seed 42 -o gen.tsv

Variants exploration:

annalog-generate -i "CCO" -n 20 -e variants --variant-number 10 -o gen_variants.tsv

Recursive exploration (2 loops):

annalog-generate -i "CCO" -n 10 -e recursive --loops 2 -o gen_recursive.tsv

Generation with chembl-gen-check annotation:

annalog-generate -i "CCO" -n 50 -m beam --cgc -o gen_with_cgc.tsv

You can also invoke the same CLI via Python module form:

python -m annalog.cli -i "CCO" -n 50 -o gen.tsv

Resources (ckpt + vocab):

  • For the installed CLI, the checkpoint + vocab are shipped inside the package and used by default.
  • You can still override them if needed using --resources-dir or --checkpoint/--vocab.

B) Using the repo script (generation.py)

From the repo root (after pip install -e .), you can run:

python generation.py -h

Note about resources in the repo:
In this repository the checkpoint/vocab live under:

annalog/ckpt_and_vocab/

So when running generation.py, point it explicitly:

python generation.py \
  -i "CCO" \
  -n 50 \
  -m beam \
  --resources-dir annalog/ckpt_and_vocab \
  -o gen.tsv

Output format

The output file includes a header row with columns:

  • input_smiles
  • rank (1-based)
  • generated_smiles
  • score

When --cgc is enabled, five additional columns are appended:

  • cgc_scaffold
  • cgc_skeleton
  • cgc_ring_systems
  • cgc_structural_alerts
  • cgc_lacan

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

annalog-1.0.4.tar.gz (15.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

annalog-1.0.4-py3-none-any.whl (15.2 MB view details)

Uploaded Python 3

File details

Details for the file annalog-1.0.4.tar.gz.

File metadata

  • Download URL: annalog-1.0.4.tar.gz
  • Upload date:
  • Size: 15.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for annalog-1.0.4.tar.gz
Algorithm Hash digest
SHA256 1f9c8e6fd818a7134b0da660e5ca1be837cea49d9e9a9eedbba59ebf5dea32a8
MD5 96beb0876f23e974013c8885e7122c0f
BLAKE2b-256 bf501f468b1282987eabd365efb1623078d8e01b971fad87b0302a34e9c8b152

See more details on using hashes here.

File details

Details for the file annalog-1.0.4-py3-none-any.whl.

File metadata

  • Download URL: annalog-1.0.4-py3-none-any.whl
  • Upload date:
  • Size: 15.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for annalog-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 2b355bcb30506e0204d2414cb6c68d7b51083aa64c098a13534c080187e7c53e
MD5 92fe462db9352117b0c885323e1597c5
BLAKE2b-256 a72891aefcc16d0fade9367bd38fad7ce27eb092820d00ea6f3cf4b608dddb89

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page