Skip to main content

ANNalog — a SMILES-to-SMILES seq2seq model for medchem analogue generation

Project description

ANNalog

ANNalog, a SMILES-to-SMILES generative model for medicinal chemistry analogue design.

Introduction

ANNalog is a transformer-based sequence-to-sequence (Seq2Seq) model designed to generate medicinal-chemistry-relevant analogues of an input molecule. It supports:

  • local chemical-space exploration (small, SAR-like modifications), and
  • scaffold hopping (changing the core scaffold while remaining chemically relevant).

The accompanying preprint describes training on pairs of molecules drawn from the same bioactivity assay (extracted from ChEMBL), Levenshtein distance–guided SMILES alignment to improve learning of transformations, and a prefix-control feature to constrain generation.

PAPER (ChemRxiv)

https://chemrxiv.org/doi/10.26434/chemrxiv-2025-9c1v6

INSTALLATION (Conda, recommended)

This repository includes a conda environment file (e.g. seq2seq_environment.yml).

  1. Create the environment: conda env create -f seq2seq_environment.yml

  2. Activate it (env name comes from the yml, e.g. "annalog"): conda activate annalog

  3. Install ANNalog into the environment: pip install -e .

Note:

  • If conda solving fails due to strict channel priority, try: conda config --set channel_priority flexible then re-run the environment creation.

GENERATION (generation.py)

generation.py generates candidate SMILES strings from an input SMILES using a trained checkpoint + vocab.

RESOURCES (checkpoint + vocab)

By default, the script looks relative to generation.py:

ckpt_and_vocab/Lev_extended.pt ckpt_and_vocab/stereo_experiment_vocab_ttf.pkl

If your files are elsewhere, use --resources-dir or override --checkpoint/--vocab.

QUICK START

Single SMILES (sampling, 10 outputs): python generation.py -i "CC(Cl)Br" -m sampling -n 10 -p 0 -f tsv -o gen_single.tsv --temperature 1.2 --seed 42

Batch file (.smi, one SMILES per line): python generation.py -i inputs.smi -m beam -n 100 -o gen_batch.tsv

REQUIRED ARGUMENTS

  • -i, --input Input SMILES string OR a path to a .smi file (one SMILES per line).

  • -n, --generation-number Number to generate (beam width or number of samples). REQUIRED.

OPTIONAL ARGUMENTS

Generation:

  • -m, --method {beam, BF-beam, sampling} (default: beam)
  • --temperature FLOAT (sampling only; default: 1.2)
  • --seed INT (sampling only; default: 42)
  • -p, --prefix PREFIX (default: 0)
    • 0 = no prefix constraint
    • integer like 5 = use first 5 characters of the input as prefix
    • string like "CC" = literal prefix (must match the start of the input)
  • -k, --keep-invalid Keep invalid SMILES (disables invalid filtering). By default, invalid filtering is ON.
  • --max-length INT (default: 102)

Model/resources:

  • --resources-dir PATH (default: <script_dir>/ckpt_and_vocab)
  • --checkpoint PATH / --ckpt PATH (default: /Lev_extended.pt)
  • --vocab PATH (default: /stereo_experiment_vocab_ttf.pkl)

Output:

  • -f, --format {tsv,csv} (default: tsv)
  • -o, --out PATH output path, or '-' for stdout (default: -)

Device:

  • --device {cpu,cuda} force device (default: auto-detect)

OUTPUT FORMAT

The output includes a header row with: input_smiles, rank (1-based), generated_smiles, score

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

annalog-1.0.3.tar.gz (15.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

annalog-1.0.3-py3-none-any.whl (15.2 MB view details)

Uploaded Python 3

File details

Details for the file annalog-1.0.3.tar.gz.

File metadata

  • Download URL: annalog-1.0.3.tar.gz
  • Upload date:
  • Size: 15.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.2

File hashes

Hashes for annalog-1.0.3.tar.gz
Algorithm Hash digest
SHA256 9e05d9eae02aceb98d43450088a291c4b74bc7d1223c6d79b74ef03250d14d2d
MD5 e292032cc66e97b909258affdd0a5fda
BLAKE2b-256 91103a36f3ca947c407ce852ad9763e62c7297eee42d9c88895f6c17336e8cf6

See more details on using hashes here.

File details

Details for the file annalog-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: annalog-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 15.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.2

File hashes

Hashes for annalog-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 72cba303e4ed36459963602aa2ec3765d4ffeb527921fcb0bdea7e046d181139
MD5 a808cffc66cdb4633f5d1ce345ad6208
BLAKE2b-256 cf86b8d6125b39e2c13cf66357b54acaa37f2d5c4b61432c8beb33743c7cc964

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page