Skip to main content

Fast C-backed trigram language model for word prediction and sentence completion

Project description

trigram-llm ๐Ÿง 

A fast, production-ready Python library for next-word prediction and sentence completion, powered by a hand-written C engine using a Prefix Trie, DJB2 HashMap, and Stupid Backoff smoothing.

Sub-millisecond predictions ยท Zero dependencies ยท Pure ctypes ยท Thread-safe


Features

Feature Description
train_from_text(text) Train from any Python string
train_from_file(path) Train from a text file (incremental)
train_from_list(words) Train from a pre-tokenised word list
predict_next(w1, w2) Greedy single-word prediction (< 1ms)
predict_top_n(w1, w2, n, temperature) Top-N predictions with probabilities
complete_sentence(prompt, num_words, beam_width) Beam search sentence generation
greedy_generate(prompt, num_words) Fastest sentence completion
perplexity(text) Evaluate model quality on held-out text
vocabulary() Returns all known words as a Python set
get_stats() Dict with trigram count, vocab size, etc.
save(path) / TrigramModel.load(path) Binary model persistence
reset() Clear model and retrain from scratch
"the quick" in model Check if a bigram context was seen
len(model) Total number of stored trigrams
Thread-safe All predictions guarded by a threading.Lock
Context manager with TrigramModel.load(path) as m:

Installation

Prerequisites

  • Python 3.8+
  • GCC (macOS: xcode-select --install, Ubuntu: sudo apt install gcc)

Install (one command)

cd /path/to/Trigrams
pip install -e .

This compiles the C engine into trigram/_trigram_c.dylib (or .so on Linux) and installs the package in editable mode.


Quickstart

from trigram import TrigramModel

# 1. Create and train
model = TrigramModel()
model.train_from_text("""
    The quick brown fox jumps over the lazy dog.
    The quick brown fox was nimble and swift.
    The lazy dog slept peacefully under the old oak tree.
""")

# 2. Predict next word (greedy)
word = model.predict_next("the", "quick")
print(word)  # โ†’ "brown"

# 3. Top-N predictions with probabilities
preds = model.predict_top_n("the", "quick", n=3, temperature=1.0)
# [{"word": "brown", "probability": 0.75, "count": 2},
#  {"word": "red",   "probability": 0.25, "count": 1}]

# 4. Sentence completion (beam search)
completions = model.complete_sentence("the quick", num_words=4, beam_width=3)
# [{"sentence": "the quick brown fox jumps", "probability": 0.012}, ...]

# 5. Greedy generation (fastest)
sentence = model.greedy_generate("the quick", num_words=3)
# "the quick brown fox"

# 6. Evaluate quality
ppl = model.perplexity("the quick brown fox")
print(f"Perplexity: {ppl:.2f}")

# 7. Inspect model
print(len(model))          # โ†’ total trigrams
print("the quick" in model)  # โ†’ True
print(model.vocabulary())  # โ†’ {"the", "quick", "brown", ...}
print(model.get_stats())   # โ†’ {"total_trigrams": 7, "unique_first_words": 3, ...}

Training from a File

model = TrigramModel()
model.train_from_file("path/to/my_corpus.txt")

# Incremental training โ€” add more data later
model.train_from_file("path/to/more_data.txt")

Saving and Loading Models

# Save
model.save("my_model.bin")

# Load (class method)
model2 = TrigramModel.load("my_model.bin")

# Context manager (auto-frees on exit)
with TrigramModel.load("my_model.bin") as m:
    print(m.predict_next("the", "quick"))

Temperature Sampling

The temperature parameter controls how creative predictions are:

# Deterministic โ€” always picks the most common word
model.predict_top_n("the", "quick", temperature=0.1)

# Standard probability distribution
model.predict_top_n("the", "quick", temperature=1.0)

# More diverse / creative
model.predict_top_n("the", "quick", temperature=2.0)

Advanced Usage

Train from a word list (custom tokenisation)

import nltk
tokens = nltk.word_tokenize("The quick brown fox")
tokens = [t.lower() for t in tokens if t.isalpha()]

model = TrigramModel()
model.train_from_list(tokens)

Thread-safe batch prediction

import threading

def worker(model, results, idx):
    results[idx] = model.predict_top_n("the", "quick", n=5)

model = TrigramModel.load("model.bin")
results = [None] * 10
threads = [threading.Thread(target=worker, args=(model, results, i)) for i in range(10)]
for t in threads: t.start()
for t in threads: t.join()

Check if a context exists before predicting

if "the quick" in model:
    result = model.predict_next("the", "quick")

API Reference

TrigramModel()

Creates a new empty model.

train_from_text(text: str) โ†’ int

Train on a raw text string. Returns trigrams inserted.

train_from_file(path) โ†’ int

Train from a text file. Returns trigrams inserted.

train_from_list(words: list) โ†’ int

Train from a pre-tokenised word list. Returns trigrams inserted.

predict_next(w1, w2) โ†’ str | None

Return the single most-likely next word or None.

predict_top_n(w1, w2, n=5, temperature=1.0) โ†’ list[dict]

Return up to N predictions sorted by probability descending. Each dict: {"word": str, "probability": float, "count": int}.

complete_sentence(prompt, num_words=5, beam_width=3) โ†’ list[dict]

Generate sentence completions via beam search. Each dict: {"sentence": str, "probability": float}.

greedy_generate(prompt, num_words=5) โ†’ str

Fastest sentence completion using greedy decoding.

perplexity(text) โ†’ float

Compute per-token perplexity on held-out text. Lower = better.

vocabulary() โ†’ set[str]

All words seen in the first-word position of training trigrams.

get_stats() โ†’ dict

{"total_trigrams": int, "unique_first_words": int, "vocabulary_size": int}.

save(path) โ†’ None

Save model to binary file. Compatible with the C CLI tool.

TrigramModel.load(path) โ†’ TrigramModel (classmethod)

Load a pre-trained binary model. Supports context manager protocol.

reset() โ†’ None

Clear all training data.

len(model) โ†’ int

Total stored trigrams.

"w1 w2" in model / ("w1", "w2") in model โ†’ bool

Check if a bigram context exists.

repr(model)

TrigramModel(trigrams=11,062,203, vocab=97,277)


Performance

Operation Latency
Single word prediction < 1ms
Top-5 predictions 1โ€“2ms
Beam search (5 words, width 3) 5โ€“10ms
Training (1M words) ~30s

Running Tests

pip install pytest
pytest tests/ -v

Project Structure

Trigrams/
โ”œโ”€โ”€ trigram/                  # Python library
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ _lib.py               # ctypes bindings
โ”‚   โ”œโ”€โ”€ model.py              # TrigramModel class
โ”‚   โ”œโ”€โ”€ utils.py              # Text preprocessing
โ”‚   โ””โ”€โ”€ _trigram_c.dylib      # Compiled C engine (auto-generated)
โ”œโ”€โ”€ trigram_llm/
โ”‚   โ”œโ”€โ”€ src/                  # C source files
โ”‚   โ””โ”€โ”€ include/              # C headers
โ”œโ”€โ”€ tests/                    # pytest test suite
โ”œโ”€โ”€ setup.py                  # Build script
โ””โ”€โ”€ pyproject.toml

License

MIT License โ€” feel free to use, modify, and distribute.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trigram_llm-0.1.0.tar.gz (55.0 kB view details)

Uploaded Source

File details

Details for the file trigram_llm-0.1.0.tar.gz.

File metadata

  • Download URL: trigram_llm-0.1.0.tar.gz
  • Upload date:
  • Size: 55.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for trigram_llm-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6e5da04f2cfc928b2681e719fe6180e914fab96c3a13f6b29a18c63a2729a518
MD5 0d82bf8a619a160bbbe0024e616620df
BLAKE2b-256 d7f456a31465332dba005e77771ae98adc8da48784594d74b2fe389ba2bfbfbf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page