Skip to main content

Extended cheminformatics toolkit: median molecules, chemical subspace enumeration, and evolutionary molecular optimization.

Project description

logo

TYCHE

License: MIT Maintenance

Main developers: Robert Pollice, AkshatKumar Nigam


A Python toolkit for SMILES randomization, median molecule generation, local chemical subspace construction, and evolutionary molecular optimization — core tools for molecular machine learning and cheminformatics research.

This repository contains two pip-installable packages:

Package Purpose Dependencies
tyche-core Lightweight SMILES randomization selfies only
tyche-tools Extended cheminformatics: median molecules, subspace enumeration, optimization tyche-core, selfies, rdkit, numpy (optional: torch, pyyaml)

tyche-core is the minimal core — it provides fast, high-quality SMILES randomization with no RDKit dependency. If all you need is data augmentation via randomized SMILES, install tyche alone.

tyche-tools is the extended toolkit — it provides median molecule generation, local chemical subspace construction, and evolutionary molecular optimization. It depends on tyche for randomization and adds RDKit, NumPy, and optionally PyTorch for the full feature set.


Project status

This package is still under development with several planned features still to come. Nevertheless, the basic functionality is considered feature-complete. We are open for community modifications and new feature requests.

Installation

Core package only (randomization)

pip install tyche-core

Or from source:

git clone https://git.lwp.rug.nl/pollice-research-group/artificial-design/tyche.git
cd TYCHE
pip install -e .

Extended package (all functionality)

pip install tyche-tools

Or from source:

git clone https://git.lwp.rug.nl/pollice-research-group/artificial-design/tyche.git
cd TYCHE
pip install -e .           # install tyche-core first
pip install -e ./tools     # install tyche-tools

To include the neural network classifier (requires PyTorch):

pip install -e "./tools[nn]"

Which package should I install?

  • Just need randomized SMILES for data augmentation? Install tyche-core. It is lightweight, depends only on selfies, and has no RDKit requirement.
  • Need median molecules, chemical subspace enumeration, or molecular optimization? Install tyche-tools. It pulls in tyche-core automatically as a dependency.

Overview

tyche-core (core)

Function Description
randomize_smiles_tyche Generate multiple randomized SMILES representations of a molecule

tyche-tools (extended)

Function Description
get_median_mols Find median molecules that interpolate between two input structures
get_local_chemical_subspace Enumerate a large set of structurally related molecules around a given structure
optimize_molecules Evolve a population of molecules toward a user-defined fitness objective

randomize_smiles_tyche

Generates randomized SMILES strings for a given molecule. Randomized SMILES represent the same molecule but with a different atom ordering, which is useful for data augmentation in molecular deep learning.

from tyche import randomize_smiles_tyche

results = randomize_smiles_tyche(smiles, n, unique=True)

Parameters

Parameter Type Default Description
smiles str Input SMILES string
n int Number of randomized SMILES to return
unique bool True If True, returns exactly n distinct SMILES; if False, returns n samples (may contain duplicates)

Returns

list[str] — List of n randomized SMILES strings.

Examples

Basic usage — 5 unique randomized SMILES for aspirin:

from tyche import randomize_smiles_tyche

smiles = "CC(=O)Oc1ccccc1C(=O)O"  # aspirin
results = randomize_smiles_tyche(smiles, n=5, unique=True)

for smi in results:
    print(smi)
OC(=O)c1ccccc1OC(C)=O
O=C(O)c1ccccc1OC(=O)C
CC(=O)Oc1ccccc1C(=O)O
O=C(C)Oc1ccccc1C(=O)O
OC(=O)c1ccccc1OC(=O)C

Allow duplicates — exactly n sampling calls:

results = randomize_smiles_tyche(smiles, n=10, unique=False)
print(len(results))  # always 10, may contain repeats

Data augmentation for a molecular dataset:

from tyche import randomize_smiles_tyche

dataset = ["c1ccccc1", "CCO", "CC(=O)O"]
augmented = []
for smi in dataset:
    augmented.extend(randomize_smiles_tyche(smi, n=10))

print(f"Original: {len(dataset)} | Augmented: {len(augmented)}")
# Original: 3 | Augmented: 30

Parallel generation for large sample counts:

from tyche import randomize_smiles_tyche

smiles = "CC(=O)Oc1ccccc1C(=O)O"

if __name__ == "__main__":
    # Generate 1 million samples using all CPU cores
    results = randomize_smiles_tyche(smiles, n=1_000_000, unique=False, parallel=True)
    
    # Or specify the number of workers
    results = randomize_smiles_tyche(smiles, n=1_000_000, unique=False, parallel=True, num_workers=8)

Note: parallel=True only applies when unique=False, since unique collection requires coordinated deduplication across workers. This code assumes that the code snippet above is run via a separate .py file.


get_median_mols

Finds "median molecules" — structures that are chemically intermediate between two input molecules. Uses SELFIES-space interpolation across an ensemble of randomized SMILES orderings to generate diverse candidate structures, then ranks them by a joint Tanimoto similarity score that rewards proximity to both endpoints.

from tyche_tools import get_median_mols

best_smiles, best_scores = get_median_mols(starting_smile, target_smile)

Parameters

Parameter Type Default Description
starting_smile str SMILES string for the source molecule
target_smile str SMILES string for the target molecule
num_tries int 25 Number of interpolation path attempts per SMILES ordering pair
num_random_samples int 25 Number of randomized SMILES orderings generated per molecule
collect_bidirectional bool True If True, also collects target → starting paths, doubling coverage
num_top_iter int 100 Number of top-ranked candidates to return

Returns

  • best_smileslist[str]: Top-ranked median molecule SMILES, sorted by descending score.
  • best_scoreslist[float]: Corresponding joint similarity scores (higher = more central).

How it works

  1. Generates an ensemble of randomized SMILES orderings for both molecules.
  2. For each pair of orderings, encodes both into SELFIES and constructs random interpolation paths by swapping tokens one at a time.
  3. Decodes all intermediate SELFIES back to SMILES and canonicalizes them.
  4. Scores each candidate by its average Tanimoto similarity to both endpoints, penalized by the gap between the two scores (favouring structures equidistant from both).

Examples

Find median molecules between two drug-like structures:

from tyche_tools import get_median_mols

# Dihydroergotamine and prinomastat (from original STONED paper)
smi_start = r"[H][C@]56C[C@@H](C(=O)N[C@]1(C)O[C@]4(O)N(C1=O)[C@@H](Cc2ccccc2)C(=O)N3CCC[C@]34[H])CN(C)[C@]5([H])Cc7c[nH]c8cccc6c78"
smi_target = r"CC1([C@@H](N(CCS1)S(=O)(=O)C2=CC=C(C=C2)OC3=CC=NC=C3)C(=O)NO)C"

best_smiles, best_scores = get_median_mols(
    smi_start,
    smi_target,
    num_tries=25,
    num_random_samples=25,
)

print(f"Found {len(best_smiles)} median molecule candidates")
print(f"\nTop 5 results:")
for smi, score in zip(best_smiles[:5], best_scores[:5]):
    print(f"  {score:.4f}  {smi}")

Quick exploratory run with reduced compute:

best_smiles, best_scores = get_median_mols(
    smi_start,
    smi_target,
    num_tries=5,
    num_random_samples=5,
    collect_bidirectional=False,
    num_top_iter=20,
)

Retrieve only the top candidate:

best_smiles, best_scores = get_median_mols(smi_start, smi_target, num_top_iter=1)
median_molecule = best_smiles[0]
print(f"Best median: {median_molecule}  (score: {best_scores[0]:.4f})")

get_local_chemical_subspace

Constructs a large, diverse set of molecules in the local chemical neighbourhood of a given structure. This is useful for exhaustive analogue enumeration, property landscape mapping, and building training sets for molecular machine learning models.

from tyche_tools import get_local_chemical_subspace

smiles_list, scores = get_local_chemical_subspace(smiles)

Parameters

Parameter Type Default Description
smiles str Input SMILES string for the center molecule
num_random_samples int 1_000_000 Number of unique randomized SMILES orderings to generate before mutation
num_mutation_ls list of int [1, 2, 3, 4, 5] Mutation depths to explore; results from all depths are pooled
fp_type str "ECFP4" Fingerprint type for similarity scoring
top_n int or None None If set, return only the top n highest-scoring molecules
min_score float or None None If set, discard molecules with Tanimoto similarity below this threshold
output_file str or None None If set, write the final sorted and filtered SMILES to this file after computation finishes

Returns

  • smiles_listlist[str]: Unique canonical SMILES, sorted by descending Tanimoto similarity to the input. Filtered by min_score and truncated to top_n if specified.
  • scoreslist[float]: Tanimoto similarity of each molecule to the input, in the same order as smiles_list.

How it works

  1. Randomization — Generates num_random_samples unique randomized SMILES orderings of the input molecule. Each ordering encodes the same structure but with a different atom traversal order, producing a distinct SELFIES string and a distinct starting point for mutation. A larger value explores a wider variety of encodings before any mutations are applied.

  2. SELFIES encoding — Each randomized SMILES is converted to a SELFIES string, a robust molecular representation guaranteed to decode to a valid molecule.

  3. Mutation — For each depth d in num_mutation_ls, every SELFIES string undergoes d sequential random mutations (insert, replace, or delete a single SELFIES token). Depth 1 produces close structural neighbours — molecules differing by approximately one atom or bond from the input. Depth 5 allows larger structural departures while remaining within the same general chemical neighbourhood. All depths are explored and their outputs pooled together.

  4. Filtering and scoring — Mutated SELFIES are decoded to SMILES, canonicalized, and deduplicated. Each unique structure is scored by Tanimoto similarity to the original molecule.

Supported fingerprint types

fp_type Description
ECFP4 Extended connectivity, radius 2 (default)
ECFP6 Extended connectivity, radius 3
FCFP4 Feature-based Morgan, radius 2
FCFP6 Feature-based Morgan, radius 3
AP Atom pair fingerprint
PATH RDKit path-based fingerprint
PHCO 2D pharmacophore (Gobbi)
BPF Burden-CAS-University of Texas fingerprint
BTF BT fingerprint

Examples

Quick exploration with reduced compute:

from tyche_tools import get_local_chemical_subspace

smi = "CC(C)(C)NCC(c1ccc(O)c(CO)c1)O"  # albuterol

smiles_list, scores = get_local_chemical_subspace(
    smi,
    num_random_samples=50000,
    num_mutation_ls=[1, 2, 3],
    fp_type="ECFP4",
)

# Results are sorted best to worst automatically
print(f"Generated {len(smiles_list)} unique molecules")

Exhaustive enumeration (default settings):

smiles_list, scores = get_local_chemical_subspace(smi)
# num_random_samples=1_000_000, num_mutation_ls=[1,2,3,4,5]

print(f"Generated {len(smiles_list)} unique molecules")

Return only the top 100 closest analogues:

smiles_list, scores = get_local_chemical_subspace(
    smi,
    num_random_samples=50000,
    top_n=100,
)

print(f"Top 100 scores: {scores[0]:.4f} to {scores[-1]:.4f}")

Filter by minimum similarity threshold:

smiles_list, scores = get_local_chemical_subspace(
    smi,
    num_random_samples=100000,
    min_score=0.4,
)

print(f"Molecules with Tanimoto >= 0.4: {len(smiles_list)}")

Combine top-n and min-score (min_score is applied first, then top_n):

smiles_list, scores = get_local_chemical_subspace(
    smi,
    num_random_samples=100000,
    min_score=0.3,
    top_n=50,
)

Write the sorted, filtered results to a file:

smiles_list, scores = get_local_chemical_subspace(
    smi,
    num_random_samples=50000,
    min_score=0.4,
    output_file="albuterol_analogues.smi",
)
# albuterol_analogues.smi contains one canonical SMILES per line, best first

Use with a different fingerprint for scoring:

smiles_list, scores = get_local_chemical_subspace(
    smi,
    num_random_samples=50000,
    fp_type="FCFP4",
)

optimize_molecules

Evolves a population of molecules toward a user-defined property objective using a genetic algorithm over SELFIES string representations. The algorithm alternates between two phases each generation:

  • Exploration — mutates and crosses over the current population to discover structurally diverse new candidates. Crossover uses get_median_mols to generate chemically intermediate structures between parent molecules. Optionally, a neural network classifier (trained on all previously evaluated molecules) biases selection toward high-predicted-fitness candidates.
  • Exploitation — performs an intensive local search around the current best molecule(s) using get_local_chemical_subspace, then injects the top results back into the main population.
from tyche_tools import optimize_molecules

results = optimize_molecules(fitness_function, start_population)

Required parameters

Parameter Type Description
fitness_function callable Maps a SMILES string to a float. Higher is better.
start_population list[str] or str List of SMILES, or path to a file with one SMILES per line. Must contain at least generation_size valid molecules.

Optional parameters

Parameter Default Description
work_dir "tyche_output" Directory for output files. Created automatically.
verbose_out False Save per-generation sub-directories with population and fitness files.
custom_filter None Optional callable (SMILES → bool). Molecules returning False are discarded.
alphabet None Custom SELFIES tokens for the mutation alphabet. Combined with fragment tokens when use_fragments=True.
use_gpu True Use CUDA for neural network training if available.
num_workers CPU count Parallel worker processes for fragment generation and mutations.
generations 200 Number of evolutionary iterations.
generation_size 5000 Molecules maintained in the exploration population.
num_exchanges 5 Top local-search molecules injected into exploration each generation.
use_fragments True Extend mutation alphabet with SELFIES fragments from the starting population (radius-3 atom environments).
num_sample_frags 200 Fragment tokens sampled from the extended alphabet per mutation step.
use_classifier True Use a neural network classifier to bias exploration selection. Requires PyTorch; falls back to random sampling if unavailable.
explr_num_random_samples 5 Randomized SMILES orderings per molecule during exploration mutation.
explr_num_mutations 5 Sequential mutations per ordering during exploration.
crossover_num_random_samples 1 SMILES orderings used by get_median_mols per crossover pair.
exploit_num_random_samples 400 Randomized SMILES orderings used by get_local_chemical_subspace during exploitation.
exploit_num_mutations 400 Mutation depth during exploitation. 400 × 400 = 160,000 candidates around the best molecule per generation.
top_mols 1 Number of top molecules subjected to local search each generation.

Returns

A dict with three keys:

  • best_per_generationlist[(str, float)]: the best (SMILES, fitness) at the end of each generation.
  • final_populationlist[(str, float)]: the exploration population from the last generation, sorted by descending fitness.
  • smiles_collectordict: maps every evaluated SMILES to [fitness, eval_count].

Output files

All files are written to work_dir:

File Contents
hparams.yml All hyperparameter values (requires PyYAML)
init_mols.txt Initial population after fitness sorting
generation_all_best.txt Best molecule and fitness appended each generation
fitness_explore.txt Exploration fitness values (overwritten each generation)
population_explore.txt Exploration SMILES (overwritten each generation)
fitness_local_search.txt Exploitation fitness values (overwritten each generation)
population_local_search.txt Exploitation SMILES (overwritten each generation)

When verbose_out=True, per-generation sub-directories (0_DATA/, 1_DATA/, …) are created, preserving every generation's population and fitness files.

Examples

Minimize synthetic accessibility (SA) score:

from rdkit.Chem import RDConfig
import os, sys
sys.path.append(os.path.join(RDConfig.RDContribDir, 'SA_Score'))
import sascorer

from tyche_tools import optimize_molecules

def fitness(smi):
    from rdkit import Chem
    mol = Chem.MolFromSmiles(smi)
    if mol is None:
        return 0.0
    sa = sascorer.calculateScore(mol)
    return -sa  # minimize SA score → maximize negative SA

results = optimize_molecules(
    fitness_function=fitness,
    start_population="starting_molecules.smi",  # one SMILES per line
    work_dir="sa_optimization",
    generations=50,
    generation_size=100,
)

best_smiles, best_score = results['best_per_generation'][-1]
print(f"Best molecule: {best_smiles}  (SA score: {-best_score:.2f})")

Maximize logP with a molecular weight filter:

from rdkit.Chem import Descriptors
from tyche_tools import optimize_molecules

def logp_fitness(smi):
    from rdkit import Chem
    mol = Chem.MolFromSmiles(smi)
    return Descriptors.MolLogP(mol) if mol else 0.0

def mw_filter(smi):
    from rdkit import Chem
    mol = Chem.MolFromSmiles(smi)
    return mol is not None and Descriptors.MolWt(mol) <= 500

results = optimize_molecules(
    fitness_function=logp_fitness,
    start_population=my_smiles_list,    # list of SMILES strings
    work_dir="logp_run",
    custom_filter=mw_filter,
    generations=100,
    generation_size=500,
    use_classifier=True,                # NN-guided selection (requires PyTorch)
)

# Print best molecule per generation
for i, (smi, score) in enumerate(results['best_per_generation']):
    print(f"Gen {i + 1}: {score:.3f}  {smi}")

Quick test run (small population, few generations):

results = optimize_molecules(
    fitness_function=logp_fitness,
    start_population=my_smiles_list,
    generations=5,
    generation_size=50,
    use_classifier=False,               # skip NN (no PyTorch needed)
    exploit_num_random_samples=50,
    exploit_num_mutations=50,
    explr_num_random_samples=3,
    explr_num_mutations=3,
)

Inspect all evaluated molecules:

collector = results['smiles_collector']
# Sort all evaluated molecules by fitness
ranked = sorted(collector.items(), key=lambda x: x[1][0], reverse=True)
for smi, (fitness, count) in ranked[:10]:
    print(f"{fitness:.4f}  (evaluated {count}x)  {smi}")

Background

TYCHE is a package for randomizing SMILES and SELFIES strings. Randomization happens at the spanning tree, starting node, branch priorities, kekulization, and stereochemical labels. The underlying algorithm operates both on the graph and at the string level. SMILES randomization is a core building block for molecular data augmentation and generative model training, while the optimization framework enables guided exploration of chemical space toward any user-defined property objective. For chemical space exploration, TYCHE builds on the STONED algorithm (Nigam et al., 2021) and the genetic algorithm JANUS (Nigam et al., 2022), which demonstrated that mutating and interpolating through SELFIES space produces chemically valid, diverse molecular structures.


Support

In case you encounter problems, please open an issue, describe your python environment, and provide detailed instructions that allow reproducing the problems.

Version History

The version history is detailed in the CHANGELOG.

Credits

No additional credits at this point in time.

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tyche_tools-0.1.1.tar.gz (33.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tyche_tools-0.1.1-py3-none-any.whl (29.6 kB view details)

Uploaded Python 3

File details

Details for the file tyche_tools-0.1.1.tar.gz.

File metadata

  • Download URL: tyche_tools-0.1.1.tar.gz
  • Upload date:
  • Size: 33.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for tyche_tools-0.1.1.tar.gz
Algorithm Hash digest
SHA256 dc1f5294230bcd374de9ad6bea7d41a41013df82c7e7333a01e64a5d1c6a231c
MD5 7a9d8f751259594e7d7449c01275c786
BLAKE2b-256 a13ffd726aab2095b2cc827067185ee14cbc5a2b5d2268af7817363057c32bc1

See more details on using hashes here.

File details

Details for the file tyche_tools-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: tyche_tools-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 29.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for tyche_tools-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 765645164e0b650d387f6e615fd4669a191f021f2cf758b3e1111cdebcb6532b
MD5 3f84e7dd49f18cd0c3aedeef174ab621
BLAKE2b-256 765675703632bb0f1299e9fcfdf5194a2323ccf3385e3dd677869f0f4fa3173e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page