Skip to main content

Suffix smoothing classifier: three research-backed methods, conformal prediction, streaming.

Project description

suffix-smoother

A lightweight, production-ready sequence classifier using recursive suffix smoothing.

Zero neural networks. Zero model files. Zero corpus downloads. Handles any unseen input via progressive backoff — the same technique that powered the TnT POS tagger (Brants 2000) before deep learning.


Why

Most sequence classifiers fail silently on out-of-vocabulary inputs. A neural tagger trained on English news text will output garbage when it sees a neologism, a domain-specific term, or a malformed token. This library degrades gracefully: if the full context is unseen, it backs off to a shorter suffix, then shorter still, until it reaches the uniform prior. It always returns a calibrated probability.

The same algorithm works across domains because the math is domain-agnostic. You provide (context_tuple, label_id) pairs — the library doesn't care whether those represent characters in a word, nucleotides in a genome, or event codes in a server log.


Install

pip install suffix-smoother

Quick Start

from suffix_smoother import SuffixSmoother, SuffixConfig

config = SuffixConfig(max_suffix_length=5, n_classes=2)
smoother = SuffixSmoother(config)

# Training: (context_tuple, label_id) pairs
smoother.train([
    ((101, 102, 103), 0),   # Normal sequence
    ((404, 404, 500), 1),   # Anomaly sequence
])

# Predict
label, confidence = smoother.predict((101, 102, 103))
# → (0, 0.87)

# Full distribution
dist = smoother.predict_distribution((101, 102))
# → {0: 0.72, 1: 0.28}

# Uncertainty in bits (0 = certain, log2(n_classes) = random)
bits = smoother.uncertainty((101, 102))

# Fraction of maximum uncertainty eliminated
reduction = smoother.uncertainty_reduction((101, 102))

The Math

P(label | seq_k) = λ · P_MLE(label | seq_k) + (1-λ) · P(label | seq_{k-1})

Base case: P(label | ∅) = 1 / n_classes

Where seq_k is the last k symbols of the input sequence. The recursion blends the maximum likelihood estimate at each suffix level with the estimate from shorter contexts, all the way back to a uniform prior. This is Jelinek-Mercer smoothing applied to suffix trees.


Use Cases

NLP — POS Tagging

from suffix_smoother import SuffixSmoother, SuffixConfig

TAGS = {"NOUN": 0, "VERB": 1, "ADJ": 2, "ADV": 3}

config = SuffixConfig(max_suffix_length=6, n_classes=len(TAGS))
smoother = SuffixSmoother(config)

# Encode suffix as char codes
def encode(word, maxlen=6):
    return tuple(ord(c) % 32 for c in word[-maxlen:])

# Train on (suffix_encoding, tag_id) pairs
smoother.train([
    (encode("running"), TAGS["VERB"]),
    (encode("quickly"), TAGS["ADV"]),
    (encode("creation"), TAGS["NOUN"]),
    # ... more training pairs
])

# Predict — works on any word, including OOV
label, conf = smoother.predict(encode("antidisestablishmentarianism"))
# Backs off: "ism" → suffix known → NOUN

Benchmark (UD English-EWT corpus):

  • Overall accuracy: 81.12%
  • OOV accuracy: 78.57% — nearly identical to in-vocabulary performance

Log Anomaly Detection

# Event codes: 101=LOGIN, 102=VIEW, 103=LOGOUT, 404=ERROR, 500=CRASH
config = SuffixConfig(max_suffix_length=4, n_classes=2)
smoother = SuffixSmoother(config)

smoother.train([
    ((101, 102, 103), 0),        # Normal
    ((101, 102, 102, 103), 0),   # Normal variant
    ((101, 404, 404, 404), 1),   # Anomaly: repeated errors
    ((102, 102, 500), 1),        # Anomaly: crash after double view
])

# Novel sequence — unseen but suffix matches
label, conf = smoother.predict((999, 102, 103))
# Backs off from full context to (102, 103) → NORMAL

Advantage: handles new service names, new error codes, and new event IDs without retraining.

Genomics — Pathogenicity Prediction

BASE = {"A": 0, "T": 1, "G": 2, "C": 3, "N": 4}

def encode_kmer(seq, k=6):
    return tuple(BASE.get(b.upper(), 4) for b in seq[:k])

# Classes: 0=BENIGN, 1=LIKELY_BENIGN, ... 4=PATHOGENIC
config = SuffixConfig(max_suffix_length=6, n_classes=8)
smoother = SuffixSmoother(config)

smoother.train(clinvar_training_data)  # (kmer_tuple, class_id) pairs

label, conf = smoother.predict(encode_kmer("TGCGAT"))

Benchmark (ClinVar, real hg38 flanking sequences via Ensembl REST API):

  • Pathogenic recall: 69.23% vs 0% naive baseline (always predict BENIGN)

API Reference

SuffixConfig

Parameter Default Description
max_suffix_length 5 Maximum context length
smoothing_lambda 0.7 λ weight: MLE vs backoff
n_classes 16 Number of output labels
min_count 1 Minimum observations to trust a suffix level

SuffixSmoother

Method Returns Description
train(data) dict Train on list of (tuple, int) pairs
predict(seq) (int, float) Best label and confidence
predict_distribution(seq) dict[int, float] Full probability distribution
uncertainty(seq) float Entropy in bits
uncertainty_reduction(seq) float Fraction of max entropy eliminated
max_uncertainty() float log2(n_classes) — theoretical maximum
n_nodes int Number of suffix nodes built

Performance

  • Inference: < 2ms per sequence (Python, no JIT)
  • Training throughput: > 50,000 samples/second
  • Memory: scales with unique suffix patterns observed, not total vocabulary size
  • Dependencies: numpy only

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

suffix_smoother-0.2.0.tar.gz (14.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

suffix_smoother-0.2.0-py3-none-any.whl (12.2 kB view details)

Uploaded Python 3

File details

Details for the file suffix_smoother-0.2.0.tar.gz.

File metadata

  • Download URL: suffix_smoother-0.2.0.tar.gz
  • Upload date:
  • Size: 14.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for suffix_smoother-0.2.0.tar.gz
Algorithm Hash digest
SHA256 4f210b4f877ad25b6f0ced2f3b60ed7eb06671d34c8027415048c94057d60dd3
MD5 eecb443f7e9f9f2c20af3b1ede022667
BLAKE2b-256 ba0c2a5581b2771b5e093ca9a06fefcf20d87a0755f3e5318ff08e52c4f6a20f

See more details on using hashes here.

File details

Details for the file suffix_smoother-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for suffix_smoother-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5f6ea9a4613d602cf7b4158360a5761511c9c11ca66314e5bc90460d56a23ef0
MD5 93a0798de5b5d5cc8e0d96840e3a518b
BLAKE2b-256 28c884b27addf7c8130331c843d6877a349500917686dd333e6ea7958f1bc96f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page