Fast approximate Tree Edit Distance prediction for RNA secondary structures

These details have not been verified by PyPI

Project links

Project description

predTED

Fast approximate prediction of pairwise Tree Edit Distances (TED) for RNA secondary structures in dot-bracket notation. Uses 144 structural features with a LightGBM regression model to avoid the expensive exact O(n^4) computation.

Up to 100x faster than RNAdistance for pairwise distance matrices, with multi-threaded computation and SIMD acceleration. Designed for large-scale RNA structure comparisons with 100K+ structures.

Metric	Value
R^2	0.927
MAE	4.9
Pearson r	0.963

Predicted vs True TED

Installation

From PyPI (recommended)

pip install predted

This automatically compiles a C extension for ~28x faster feature computation. Falls back to pure Python if compilation fails.

Requirements: Python >= 3.9, numpy, lightgbm

From source

git clone https://github.com/syseitz/predTED.git
cd predTED
pip install .

C command-line tool

For high-throughput batch processing on clusters (multi-threaded, SIMD-accelerated):

# Build the CLI binary (requires LightGBM C library + OpenMP)
make cli

On macOS, the Makefile auto-detects LightGBM from the Python package and OpenMP from Homebrew (brew install libomp). On Linux, LightGBM must be available via pkg-config.

Usage

Python

import predted

# Single pair prediction
ted = predted.predict("((..))", "(())..")
print(ted)  # 20

# Raw float prediction (before rounding)
ted_float = predted.predict_float("((..))", "(())..")
print(ted_float)  # 19.6176

# Pairwise distance matrix
structures = ["((..))", "(())..", "...((..)).."]
matrix = predted.predict_matrix(structures)
print(matrix)
# [[ 0 20 22]
#  [20  0 21]
#  [22 21  0]]

# Float matrix
matrix_f = predted.predict_matrix(structures, dtype=float)

C command-line tool

Reads dot-bracket structures from stdin (one per line), outputs the pairwise distance matrix:

echo -e "((..))\n(())..\n...((..)).." | bin/predted

Options:

--threads, -t N        Set number of threads (default: auto)
--upper-only, -u       Output only upper triangle (j > i)
--float, -f            Output float values instead of integers
--binary, -b           Output raw binary bytes instead of text
--topk K, -k K         KNN mode: keep only K closest neighbours per row
--tau T, -T T          Distance threshold for KNN (default: 300)
--knn-prefix P, -K P   File prefix for KNN output (.idx + .dst memmaps)
--max-len-diff L, -L L Skip pairs differing by >L in length (prefilter)
--subsample S, -s S    Predict only ~1/S pairs (deterministic subsampling)

How it works

For each RNA structure, 36 features are extracted from the dot-bracket notation:

Structural counts: internal loops, multiloops, bulges, hairpin loops, stacked pairs
Depth statistics: mean/variance of nesting depth (overall, paired, unpaired), peaks
Stem properties: count, max/mean/variance of stem length
Loop properties: mean/variance of loop size, max loop size
Size features: structure length, paired/unpaired base counts
Base pair distances: mean and max
Hairpin/internal loop sizes: mean and max
Bigram frequencies: 8 dinucleotide pattern frequencies from {(, ), .}
Graph properties: centrality, tree depth

For each pair of structures, these 36 base features are combined into 144 pairwise features (|diff|, sum, min, max), which are fed into a LightGBM regression model to predict the TED.

Project structure

predted/              Python package
  __init__.py           API: predict(), predict_float(), predict_matrix()
  features.py           Feature computation (C extension with Python fallback)
c_src/                C source code
  predted_features.c    Feature computation (single source of truth, StructureContext)
  predted_features.h    Header (StructureContext typedef, feature functions)
  _features_module.c    Python C extension wrapper
  predTED.c             CLI binary (multi-threaded, SIMD, chunk-based parallelism)
  model.h               Embedded model for CLI (generated from model.txt)
model.txt             LightGBM model file
data/                 Training data
  structures.txt        1500 RNA structures
  ted_matrix.txt        Ground truth TED matrix
weights/              Training scripts and evaluation results
benchmarks/           Speed benchmarks and batch-size tuning

Training your own model

# Generate training data (requires ViennaRNA)
cd data && python generate_train_data.py

# Train with rich features (144)
python train_rich_model.py

# Regenerate C header for CLI
xxd -i model.txt c_src/model.h

Performance

Multi-threading (CLI)

The CLI uses OpenMP for chunk-based parallel computation. Each thread gets its own LightGBM booster instance for lock-free prediction. Measured on 1500 structures (M4 Mac, 1.1M pairs):

Threads	Time	Speedup
1	11.8s	1.0x
2	6.2s	1.9x
4	3.4s	3.5x
8	2.2s	5.4x

On SLURM clusters with 32+ cores, expect near-linear scaling.

vs RNAdistance

N	Pairs	predted (1 thread)	RNAdistance	Speedup
50	1,225	0.01s	0.6s	60x
100	4,950	0.03s	2.4s	80x
500	124,750	0.43s	44.1s	100x

Memory usage

The CLI streams results in chunks (ROW_CHUNK, default 256 rows), keeping RAM bounded:

N structures	Feature array	Chunk buffer	Total
10,000	2.7 MB	5 MB	~10 MB
100,000	27 MB	49 MB	~80 MB
500,000	137 MB	244 MB	~400 MB

The Python API uses row-by-row streaming with NumPy vectorisation (O(N) memory per row).

Running tests

pip install .[dev]
make cli       # optional, for CLI tests
make test

Batch-size tuning

The CLI processes LightGBM predictions in batches (default 8192). To find the optimal batch size for your system:

make bench-batch

Override at compile time: make cli CFLAGS="-O2 -march=native -DBATCH_SIZE=16384"

Licence

This project is licensed under the MIT Licence. See the LICENSE file for details.

Contact

For questions or issues, please open an issue on the GitHub repository.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Apr 5, 2026

0.1.0

Feb 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

predted-0.2.0.tar.gz (16.1 MB view details)

Uploaded Apr 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

predted-0.2.0-cp310-cp310-macosx_11_0_arm64.whl (905.9 kB view details)

Uploaded Apr 5, 2026 CPython 3.10macOS 11.0+ ARM64

File details

Details for the file predted-0.2.0.tar.gz.

File metadata

Download URL: predted-0.2.0.tar.gz
Upload date: Apr 5, 2026
Size: 16.1 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.6

File hashes

Hashes for predted-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`04456ba5146b0fac18a6dd73f12684c6fc741b370e3d75d9fbc818a758c7243f`
MD5	`251d5ffcfe7c255724562146e3ab2156`
BLAKE2b-256	`90e50bb30b8d4bdd61edc81d94b955e3a91149427c627a6fdbea5373e1889aa1`

See more details on using hashes here.

File details

Details for the file predted-0.2.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

Download URL: predted-0.2.0-cp310-cp310-macosx_11_0_arm64.whl
Upload date: Apr 5, 2026
Size: 905.9 kB
Tags: CPython 3.10, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.6

File hashes

Hashes for predted-0.2.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`4ae1791059f0194b80b76a3ee62ffdeedeccc406c5249862cb6e8aaa3e4ff436`
MD5	`0bac1cbe8f88771c778015b011d28145`
BLAKE2b-256	`a0b5c77f895c305f541d96fae01cd1e128f70b2e8e563b37e932f929183c89fd`

See more details on using hashes here.

predted 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

predTED

Installation

From PyPI (recommended)

From source

C command-line tool

Usage

Python

C command-line tool

How it works

Project structure

Training your own model

Performance

Multi-threading (CLI)

vs RNAdistance

Memory usage

Running tests

Batch-size tuning

Licence

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes