Skip to main content

A lightweight toolkit for de novo molecular generation (SMILES/SELFIES; CharRNN, MolGPT, VAE)

Project description

molgen — lightweight de novo molecular generation: SMILES and SELFIES tokenizers, Transformer β-TC-VAE / CharRNN / MolGPT generators, MOSES-style evaluation.

CI Python License: MIT Code style: ruff

A lightweight, modern toolkit for de novo molecular generation with deep sequence models. It provides atom-level SMILES and SELFIES tokenizers, several generator architectures, a mixed-precision training loop, configurable sampling, and a MOSES-style evaluation suite — small enough to train on a single GPU in minutes, but reflecting current practice.

molgen — a molecular generation pipeline: input, tokenize, model, sample, evaluate

Features

  • Representations — atom-aware regex SMILES tokenizer and a SELFIES tokenizer (every sequence decodes to a valid molecule).
  • Models — a Transformer β-TC-VAE, a GRU/LSTM CharRNN, and a decoder-only MolGPT.
  • Training — teacher-forced loop with AdamW, gradient clipping, and automatic mixed precision (AMP) on CUDA.
  • Sampling — autoregressive generation with temperature, top-k, and top-p (nucleus) filtering.
  • Metrics — validity, uniqueness, novelty, internal diversity, unique scaffolds, SNN, and QED / logP / MW / SA-score property summaries.
  • Toolingmolgen CLI, a bundled sample dataset, tests, CI, and ruff.

Installation

git clone https://github.com/DaoyuanLi2816/Molecule-Generator.git
cd Molecule-Generator
pip install -e .            # add ".[selfies]" for SELFIES, ".[dev]" for tests

Quickstart (Python)

from molgen.data import build_dataloaders, load_sample_smiles
from molgen.tokenizers import SmilesTokenizer
from molgen.molgpt import MolGPT
from molgen.trainer import TrainConfig, train_language_model
from molgen.sampling import sample
from molgen.metrics import evaluate_generation

smiles = load_sample_smiles()                      # bundled sample, or your own list
tokenizer = SmilesTokenizer.from_smiles(smiles)
train_loader, val_loader = build_dataloaders(smiles, tokenizer, augment=True)

model = MolGPT(tokenizer.vocab_size, pad_idx=tokenizer.pad_id)
train_language_model(model, train_loader, val_loader, TrainConfig(epochs=20), pad_idx=tokenizer.pad_id)

generated = sample(model, tokenizer, num_samples=1000, top_p=0.95)
print(evaluate_generation(generated, reference=smiles))

Quickstart (CLI)

molgen train  --data molecules.smi --model molgpt --epochs 20 --out model.pt
molgen sample --checkpoint model.pt --num 1000 --top-p 0.95 --out generated.smi
molgen eval   --generated generated.smi --reference molecules.smi

Example output

Training MolGPT on the bundled (synthetic) sample and sampling 300 molecules produces a report like:

n_generated: 300
validity: 0.30
uniqueness: 0.96
novelty: 0.90
internal_diversity: 0.90
unique_scaffolds: 0.32
snn: 0.47
properties: {'qed': 0.52, 'logp': 1.71, 'mol_weight': 133.2, 'sa_score': 2.70}

These numbers reflect the tiny bundled sample — train on MOSES/QM9/ZINC for stronger models. (SELFIES mode guarantees 100% validity.)

Visualizations

Both figures come from real model output and are reproducible with python scripts/make_figures.py (trains a SELFIES MolGPT on the bundled sample).

Generated molecules — structures sampled directly from the trained model:

Molecules generated by the model

Goal-directed generation — from a single base model, fine-tuning toward the most (or least) drug-like molecules steers the generated QED distribution in both directions (a ~0.15 QED span) and moves the samples through QED-vs-SA property space. Generation can be steered toward a target, not just imitated:

Bidirectional QED steering and movement through QED–SA property space

Models

Model Module Description
CharRNN molgen.char_rnn GRU/LSTM next-token language model (classic strong baseline)
MolGPT molgen.molgpt Decoder-only Transformer with causal attention
BetaTCVAE molgen.vae Transformer VAE for reconstruction and latent interpolation

Both CharRNN and MolGPT train and sample through the same trainer/sampler.

Latent-space exploration (VAE)

The original VAE workflow is still available for generating molecules near a seed or interpolating between two molecules in latent space:

VAE encoder, latent space, and decoder

python -m molgen.synthetic     # build a synthetic dataset (molecules.csv)
python -m molgen.vae           # train the VAE
python -m molgen.generate      # perturb the latent space
python -m molgen.interpolate   # interpolate between two molecules

Project structure

molgen/
├── chem.py              # validity / canonicalization / randomization (RDKit)
├── tokenizers.py        # atom-level regex SMILES tokenizer
├── selfies_tokenizer.py # SELFIES tokenizer (always-valid decoding)
├── data.py              # SmilesDataset, padding collate, augmentation, sample loader
├── synthetic.py         # synthetic dataset generators
├── vae.py               # Transformer β-TC-VAE
├── char_rnn.py          # GRU/LSTM language model
├── molgpt.py            # decoder-only Transformer
├── trainer.py           # AMP training loop
├── sampling.py          # temperature / top-k / top-p decoding
├── metrics.py           # validity, novelty, diversity, scaffolds, SNN, report
├── properties.py        # QED / logP / MW / SA score
├── checkpoint.py        # save & load model + tokenizer
├── cli.py               # `molgen` command-line interface
└── datasets/            # bundled sample SMILES

Notes

The bundled load_sample_smiles() set is synthetic (assembled from fragments) and intended for examples and tests; for real results, train on a dataset such as MOSES, QM9, or ZINC. SELFIES mode guarantees 100% validity; SMILES mode tends to learn the data distribution more faithfully.

Contributing

Contributions are welcome — see CONTRIBUTING.md. Please run ruff check ., ruff format ., and pytest before opening a pull request.

Citation

If you use this toolkit in your work, please cite it via the Cite this repository button on GitHub (metadata in CITATION.cff).

License

This project is licensed under the MIT License. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

molgen-0.1.0.tar.gz (38.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

molgen-0.1.0-py3-none-any.whl (33.5 kB view details)

Uploaded Python 3

File details

Details for the file molgen-0.1.0.tar.gz.

File metadata

  • Download URL: molgen-0.1.0.tar.gz
  • Upload date:
  • Size: 38.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for molgen-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f01c08d855f97e37184b5f7c306774cc6244e5297512d9e29c1a98ec3fa2efdf
MD5 02134fcb5e77dfb65a2e8c84ce04b9d1
BLAKE2b-256 791217a99096661decd5d453b6be905f09c0dcfa46248b0b3a16e2e847b0a3fd

See more details on using hashes here.

File details

Details for the file molgen-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: molgen-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 33.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for molgen-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bf71f57cd716192d273b29634b91858127031638650bd1c4efd04b02c3f32229
MD5 960d3b6add4086aef24925f931936e9c
BLAKE2b-256 4843f0a47b34529a3f044d7c3403e606c09d575d6d7f6276f797d25a03ad82a8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page