molgen

A lightweight toolkit for de novo molecular generation (SMILES/SELFIES; CharRNN, MolGPT, VAE)

These details have not been verified by PyPI

Project links

Project description

molgen — lightweight de novo molecular generation: SMILES and SELFIES tokenizers, Transformer β-TC-VAE / CharRNN / MolGPT generators, MOSES-style evaluation.

A lightweight, modern toolkit for de novo molecular generation with deep sequence models. It provides atom-level SMILES and SELFIES tokenizers, several generator architectures, a mixed-precision training loop, configurable sampling, and a MOSES-style evaluation suite — small enough to train on a single GPU in minutes, but reflecting current practice.

molgen — a molecular generation pipeline: input, tokenize, model, sample, evaluate

Features

Representations — atom-aware regex SMILES tokenizer and a SELFIES tokenizer (every sequence decodes to a valid molecule).
Models — a Transformer β-TC-VAE, a GRU/LSTM CharRNN, and a decoder-only MolGPT.
Training — teacher-forced loop with AdamW, gradient clipping, and automatic mixed precision (AMP) on CUDA.
Sampling — autoregressive generation with temperature, top-k, and top-p (nucleus) filtering.
Metrics — validity, uniqueness, novelty, internal diversity, unique scaffolds, SNN, and QED / logP / MW / SA-score property summaries.
Tooling — molgen CLI, a bundled sample dataset, tests, CI, and ruff.

Installation

git clone https://github.com/DaoyuanLi2816/Molecule-Generator.git
cd Molecule-Generator
pip install -e .            # add ".[selfies]" for SELFIES, ".[dev]" for tests

Quickstart (Python)

from molgen.data import build_dataloaders, load_sample_smiles
from molgen.tokenizers import SmilesTokenizer
from molgen.molgpt import MolGPT
from molgen.trainer import TrainConfig, train_language_model
from molgen.sampling import sample
from molgen.metrics import evaluate_generation

smiles = load_sample_smiles()                      # bundled sample, or your own list
tokenizer = SmilesTokenizer.from_smiles(smiles)
train_loader, val_loader = build_dataloaders(smiles, tokenizer, augment=True)

model = MolGPT(tokenizer.vocab_size, pad_idx=tokenizer.pad_id)
train_language_model(model, train_loader, val_loader, TrainConfig(epochs=20), pad_idx=tokenizer.pad_id)

generated = sample(model, tokenizer, num_samples=1000, top_p=0.95)
print(evaluate_generation(generated, reference=smiles))

Quickstart (CLI)

molgen train  --data molecules.smi --model molgpt --epochs 20 --out model.pt
molgen sample --checkpoint model.pt --num 1000 --top-p 0.95 --out generated.smi
molgen eval   --generated generated.smi --reference molecules.smi

Example output

Training MolGPT on the bundled (synthetic) sample and sampling 300 molecules produces a report like:

n_generated: 300
validity: 0.30
uniqueness: 0.96
novelty: 0.90
internal_diversity: 0.90
unique_scaffolds: 0.32
snn: 0.47
properties: {'qed': 0.52, 'logp': 1.71, 'mol_weight': 133.2, 'sa_score': 2.70}

These numbers reflect the tiny bundled sample — train on MOSES/QM9/ZINC for stronger models. (SELFIES mode guarantees 100% validity.)

Visualizations

Both figures come from real model output and are reproducible with python scripts/make_figures.py (trains a SELFIES MolGPT on the bundled sample).

Generated molecules — structures sampled directly from the trained model:

Molecules generated by the model

Goal-directed generation — from a single base model, fine-tuning toward the most (or least) drug-like molecules steers the generated QED distribution in both directions (a ~0.15 QED span) and moves the samples through QED-vs-SA property space. Generation can be steered toward a target, not just imitated:

Bidirectional QED steering and movement through QED–SA property space

Models

Model	Module	Description
`CharRNN`	`molgen.char_rnn`	GRU/LSTM next-token language model (classic strong baseline)
`MolGPT`	`molgen.molgpt`	Decoder-only Transformer with causal attention
`BetaTCVAE`	`molgen.vae`	Transformer VAE for reconstruction and latent interpolation

Both CharRNN and MolGPT train and sample through the same trainer/sampler.

Latent-space exploration (VAE)

The original VAE workflow is still available for generating molecules near a seed or interpolating between two molecules in latent space:

VAE encoder, latent space, and decoder

python -m molgen.synthetic     # build a synthetic dataset (molecules.csv)
python -m molgen.vae           # train the VAE
python -m molgen.generate      # perturb the latent space
python -m molgen.interpolate   # interpolate between two molecules

Project structure

molgen/
├── chem.py              # validity / canonicalization / randomization (RDKit)
├── tokenizers.py        # atom-level regex SMILES tokenizer
├── selfies_tokenizer.py # SELFIES tokenizer (always-valid decoding)
├── data.py              # SmilesDataset, padding collate, augmentation, sample loader
├── synthetic.py         # synthetic dataset generators
├── vae.py               # Transformer β-TC-VAE
├── char_rnn.py          # GRU/LSTM language model
├── molgpt.py            # decoder-only Transformer
├── trainer.py           # AMP training loop
├── sampling.py          # temperature / top-k / top-p decoding
├── metrics.py           # validity, novelty, diversity, scaffolds, SNN, report
├── properties.py        # QED / logP / MW / SA score
├── checkpoint.py        # save & load model + tokenizer
├── cli.py               # `molgen` command-line interface
└── datasets/            # bundled sample SMILES

Notes

The bundled load_sample_smiles() set is synthetic (assembled from fragments) and intended for examples and tests; for real results, train on a dataset such as MOSES, QM9, or ZINC. SELFIES mode guarantees 100% validity; SMILES mode tends to learn the data distribution more faithfully.

Contributing

Contributions are welcome — see CONTRIBUTING.md. Please run ruff check ., ruff format ., and pytest before opening a pull request.

Citation

If you use this toolkit in your work, please cite it via the Cite this repository button on GitHub (metadata in CITATION.cff).

License

This project is licensed under the MIT License. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

Jun 11, 2026

This version

0.1.0

Jun 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

molgen-0.1.0.tar.gz (38.5 kB view details)

Uploaded Jun 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

molgen-0.1.0-py3-none-any.whl (33.5 kB view details)

Uploaded Jun 11, 2026 Python 3

File details

Details for the file molgen-0.1.0.tar.gz.

File metadata

Download URL: molgen-0.1.0.tar.gz
Upload date: Jun 11, 2026
Size: 38.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for molgen-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`f01c08d855f97e37184b5f7c306774cc6244e5297512d9e29c1a98ec3fa2efdf`
MD5	`02134fcb5e77dfb65a2e8c84ce04b9d1`
BLAKE2b-256	`791217a99096661decd5d453b6be905f09c0dcfa46248b0b3a16e2e847b0a3fd`

See more details on using hashes here.

File details

Details for the file molgen-0.1.0-py3-none-any.whl.

File metadata

Download URL: molgen-0.1.0-py3-none-any.whl
Upload date: Jun 11, 2026
Size: 33.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for molgen-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bf71f57cd716192d273b29634b91858127031638650bd1c4efd04b02c3f32229`
MD5	`960d3b6add4086aef24925f931936e9c`
BLAKE2b-256	`4843f0a47b34529a3f044d7c3403e606c09d575d6d7f6276f797d25a03ad82a8`

See more details on using hashes here.

molgen 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Features

Installation

Quickstart (Python)

Quickstart (CLI)

Example output

Visualizations

Models

Latent-space exploration (VAE)

Project structure

Notes

Contributing

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes