Suffix smoothing classifier: three research-backed methods, conformal prediction, streaming.

These details have not been verified by PyPI

Project links

Project description

suffix-smoother

A lightweight, production-ready sequence classifier using recursive suffix smoothing.

Zero neural networks. Zero model files. Zero corpus downloads. Handles any unseen input via progressive backoff — the same technique that powered the TnT POS tagger (Brants 2000) before deep learning.

Why

Most sequence classifiers fail silently on out-of-vocabulary inputs. A neural tagger trained on English news text will output garbage when it sees a neologism, a domain-specific term, or a malformed token. This library degrades gracefully: if the full context is unseen, it backs off to a shorter suffix, then shorter still, until it reaches the uniform prior. It always returns a calibrated probability.

The same algorithm works across domains because the math is domain-agnostic. You provide (context_tuple, label_id) pairs — the library doesn't care whether those represent characters in a word, nucleotides in a genome, or event codes in a server log.

Install

pip install suffix-smoother

Quick Start

from suffix_smoother import SuffixSmoother, SuffixConfig

config = SuffixConfig(max_suffix_length=5, n_classes=2)
smoother = SuffixSmoother(config)

# Training: (context_tuple, label_id) pairs
smoother.train([
    ((101, 102, 103), 0),   # Normal sequence
    ((404, 404, 500), 1),   # Anomaly sequence
])

# Predict
label, confidence = smoother.predict((101, 102, 103))
# → (0, 0.87)

# Full distribution
dist = smoother.predict_distribution((101, 102))
# → {0: 0.72, 1: 0.28}

# Uncertainty in bits (0 = certain, log2(n_classes) = random)
bits = smoother.uncertainty((101, 102))

# Fraction of maximum uncertainty eliminated
reduction = smoother.uncertainty_reduction((101, 102))

The Math

P(label | seq_k) = λ · P_MLE(label | seq_k) + (1-λ) · P(label | seq_{k-1})

Base case: P(label | ∅) = 1 / n_classes

Where seq_k is the last k symbols of the input sequence. The recursion blends the maximum likelihood estimate at each suffix level with the estimate from shorter contexts, all the way back to a uniform prior. This is Jelinek-Mercer smoothing applied to suffix trees.

Use Cases

NLP — POS Tagging

from suffix_smoother import SuffixSmoother, SuffixConfig

TAGS = {"NOUN": 0, "VERB": 1, "ADJ": 2, "ADV": 3}

config = SuffixConfig(max_suffix_length=6, n_classes=len(TAGS))
smoother = SuffixSmoother(config)

# Encode suffix as char codes
def encode(word, maxlen=6):
    return tuple(ord(c) % 32 for c in word[-maxlen:])

# Train on (suffix_encoding, tag_id) pairs
smoother.train([
    (encode("running"), TAGS["VERB"]),
    (encode("quickly"), TAGS["ADV"]),
    (encode("creation"), TAGS["NOUN"]),
    # ... more training pairs
])

# Predict — works on any word, including OOV
label, conf = smoother.predict(encode("antidisestablishmentarianism"))
# Backs off: "ism" → suffix known → NOUN

Benchmark (UD English-EWT corpus):

Overall accuracy: 81.12%
OOV accuracy: 78.57% — nearly identical to in-vocabulary performance

Log Anomaly Detection

# Event codes: 101=LOGIN, 102=VIEW, 103=LOGOUT, 404=ERROR, 500=CRASH
config = SuffixConfig(max_suffix_length=4, n_classes=2)
smoother = SuffixSmoother(config)

smoother.train([
    ((101, 102, 103), 0),        # Normal
    ((101, 102, 102, 103), 0),   # Normal variant
    ((101, 404, 404, 404), 1),   # Anomaly: repeated errors
    ((102, 102, 500), 1),        # Anomaly: crash after double view
])

# Novel sequence — unseen but suffix matches
label, conf = smoother.predict((999, 102, 103))
# Backs off from full context to (102, 103) → NORMAL

Advantage: handles new service names, new error codes, and new event IDs without retraining.

Genomics — Pathogenicity Prediction

BASE = {"A": 0, "T": 1, "G": 2, "C": 3, "N": 4}

def encode_kmer(seq, k=6):
    return tuple(BASE.get(b.upper(), 4) for b in seq[:k])

# Classes: 0=BENIGN, 1=LIKELY_BENIGN, ... 4=PATHOGENIC
config = SuffixConfig(max_suffix_length=6, n_classes=8)
smoother = SuffixSmoother(config)

smoother.train(clinvar_training_data)  # (kmer_tuple, class_id) pairs

label, conf = smoother.predict(encode_kmer("TGCGAT"))

Benchmark (ClinVar, real hg38 flanking sequences via Ensembl REST API):

Pathogenic recall: 69.23% vs 0% naive baseline (always predict BENIGN)

API Reference

`SuffixConfig`

Parameter	Default	Description
`max_suffix_length`	`5`	Maximum context length
`smoothing_lambda`	`0.7`	λ weight: MLE vs backoff
`n_classes`	`16`	Number of output labels
`min_count`	`1`	Minimum observations to trust a suffix level

`SuffixSmoother`

Method	Returns	Description
`train(data)`	`dict`	Train on list of `(tuple, int)` pairs
`predict(seq)`	`(int, float)`	Best label and confidence
`predict_distribution(seq)`	`dict[int, float]`	Full probability distribution
`uncertainty(seq)`	`float`	Entropy in bits
`uncertainty_reduction(seq)`	`float`	Fraction of max entropy eliminated
`max_uncertainty()`	`float`	`log2(n_classes)` — theoretical maximum
`n_nodes`	`int`	Number of suffix nodes built

Performance

Inference: < 2ms per sequence (Python, no JIT)
Training throughput: > 50,000 samples/second
Memory: scales with unique suffix patterns observed, not total vocabulary size
Dependencies: numpy only

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.0

Mar 12, 2026

0.3.0

Mar 12, 2026

0.2.1

Mar 11, 2026

This version

0.2.0

Mar 11, 2026

0.1.0

Mar 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

suffix_smoother-0.2.0.tar.gz (14.9 kB view details)

Uploaded Mar 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

suffix_smoother-0.2.0-py3-none-any.whl (12.2 kB view details)

Uploaded Mar 11, 2026 Python 3

File details

Details for the file suffix_smoother-0.2.0.tar.gz.

File metadata

Download URL: suffix_smoother-0.2.0.tar.gz
Upload date: Mar 11, 2026
Size: 14.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for suffix_smoother-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`4f210b4f877ad25b6f0ced2f3b60ed7eb06671d34c8027415048c94057d60dd3`
MD5	`eecb443f7e9f9f2c20af3b1ede022667`
BLAKE2b-256	`ba0c2a5581b2771b5e093ca9a06fefcf20d87a0755f3e5318ff08e52c4f6a20f`

See more details on using hashes here.

File details

Details for the file suffix_smoother-0.2.0-py3-none-any.whl.

File metadata

Download URL: suffix_smoother-0.2.0-py3-none-any.whl
Upload date: Mar 11, 2026
Size: 12.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for suffix_smoother-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5f6ea9a4613d602cf7b4158360a5761511c9c11ca66314e5bc90460d56a23ef0`
MD5	`93a0798de5b5d5cc8e0d96840e3a518b`
BLAKE2b-256	`28c884b27addf7c8130331c843d6877a349500917686dd333e6ea7958f1bc96f`

See more details on using hashes here.

suffix-smoother 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

suffix-smoother

Why

Install

Quick Start

The Math

Use Cases

NLP — POS Tagging

Log Anomaly Detection

Genomics — Pathogenicity Prediction

API Reference

`SuffixConfig`

`SuffixSmoother`

Performance

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes