Fast C-backed trigram language model for word prediction and sentence completion
Project description
trigram-llm ๐ง
A fast, production-ready Python library for next-word prediction and sentence completion, powered by a hand-written C engine using a Prefix Trie, DJB2 HashMap, and Stupid Backoff smoothing.
Sub-millisecond predictions ยท Zero dependencies ยท Pure ctypes ยท Thread-safe
Features
| Feature | Description |
|---|---|
train_from_text(text) |
Train from any Python string |
train_from_file(path) |
Train from a text file (incremental) |
train_from_list(words) |
Train from a pre-tokenised word list |
predict_next(w1, w2) |
Greedy single-word prediction (< 1ms) |
predict_top_n(w1, w2, n, temperature) |
Top-N predictions with probabilities |
complete_sentence(prompt, num_words, beam_width) |
Beam search sentence generation |
greedy_generate(prompt, num_words) |
Fastest sentence completion |
perplexity(text) |
Evaluate model quality on held-out text |
vocabulary() |
Returns all known words as a Python set |
get_stats() |
Dict with trigram count, vocab size, etc. |
save(path) / TrigramModel.load(path) |
Binary model persistence |
reset() |
Clear model and retrain from scratch |
"the quick" in model |
Check if a bigram context was seen |
len(model) |
Total number of stored trigrams |
| Thread-safe | All predictions guarded by a threading.Lock |
| Context manager | with TrigramModel.load(path) as m: |
Installation
Prerequisites
- Python 3.8+
- GCC (macOS:
xcode-select --install, Ubuntu:sudo apt install gcc)
Install (one command)
cd /path/to/Trigrams
pip install -e .
This compiles the C engine into trigram/_trigram_c.dylib (or .so on Linux) and installs the package in editable mode.
Quickstart
from trigram import TrigramModel
# 1. Create and train
model = TrigramModel()
model.train_from_text("""
The quick brown fox jumps over the lazy dog.
The quick brown fox was nimble and swift.
The lazy dog slept peacefully under the old oak tree.
""")
# 2. Predict next word (greedy)
word = model.predict_next("the", "quick")
print(word) # โ "brown"
# 3. Top-N predictions with probabilities
preds = model.predict_top_n("the", "quick", n=3, temperature=1.0)
# [{"word": "brown", "probability": 0.75, "count": 2},
# {"word": "red", "probability": 0.25, "count": 1}]
# 4. Sentence completion (beam search)
completions = model.complete_sentence("the quick", num_words=4, beam_width=3)
# [{"sentence": "the quick brown fox jumps", "probability": 0.012}, ...]
# 5. Greedy generation (fastest)
sentence = model.greedy_generate("the quick", num_words=3)
# "the quick brown fox"
# 6. Evaluate quality
ppl = model.perplexity("the quick brown fox")
print(f"Perplexity: {ppl:.2f}")
# 7. Inspect model
print(len(model)) # โ total trigrams
print("the quick" in model) # โ True
print(model.vocabulary()) # โ {"the", "quick", "brown", ...}
print(model.get_stats()) # โ {"total_trigrams": 7, "unique_first_words": 3, ...}
Training from a File
model = TrigramModel()
model.train_from_file("path/to/my_corpus.txt")
# Incremental training โ add more data later
model.train_from_file("path/to/more_data.txt")
Saving and Loading Models
# Save
model.save("my_model.bin")
# Load (class method)
model2 = TrigramModel.load("my_model.bin")
# Context manager (auto-frees on exit)
with TrigramModel.load("my_model.bin") as m:
print(m.predict_next("the", "quick"))
Temperature Sampling
The temperature parameter controls how creative predictions are:
# Deterministic โ always picks the most common word
model.predict_top_n("the", "quick", temperature=0.1)
# Standard probability distribution
model.predict_top_n("the", "quick", temperature=1.0)
# More diverse / creative
model.predict_top_n("the", "quick", temperature=2.0)
Advanced Usage
Train from a word list (custom tokenisation)
import nltk
tokens = nltk.word_tokenize("The quick brown fox")
tokens = [t.lower() for t in tokens if t.isalpha()]
model = TrigramModel()
model.train_from_list(tokens)
Thread-safe batch prediction
import threading
def worker(model, results, idx):
results[idx] = model.predict_top_n("the", "quick", n=5)
model = TrigramModel.load("model.bin")
results = [None] * 10
threads = [threading.Thread(target=worker, args=(model, results, i)) for i in range(10)]
for t in threads: t.start()
for t in threads: t.join()
Check if a context exists before predicting
if "the quick" in model:
result = model.predict_next("the", "quick")
API Reference
TrigramModel()
Creates a new empty model.
train_from_text(text: str) โ int
Train on a raw text string. Returns trigrams inserted.
train_from_file(path) โ int
Train from a text file. Returns trigrams inserted.
train_from_list(words: list) โ int
Train from a pre-tokenised word list. Returns trigrams inserted.
predict_next(w1, w2) โ str | None
Return the single most-likely next word or None.
predict_top_n(w1, w2, n=5, temperature=1.0) โ list[dict]
Return up to N predictions sorted by probability descending.
Each dict: {"word": str, "probability": float, "count": int}.
complete_sentence(prompt, num_words=5, beam_width=3) โ list[dict]
Generate sentence completions via beam search.
Each dict: {"sentence": str, "probability": float}.
greedy_generate(prompt, num_words=5) โ str
Fastest sentence completion using greedy decoding.
perplexity(text) โ float
Compute per-token perplexity on held-out text. Lower = better.
vocabulary() โ set[str]
All words seen in the first-word position of training trigrams.
get_stats() โ dict
{"total_trigrams": int, "unique_first_words": int, "vocabulary_size": int}.
save(path) โ None
Save model to binary file. Compatible with the C CLI tool.
TrigramModel.load(path) โ TrigramModel (classmethod)
Load a pre-trained binary model. Supports context manager protocol.
reset() โ None
Clear all training data.
len(model) โ int
Total stored trigrams.
"w1 w2" in model / ("w1", "w2") in model โ bool
Check if a bigram context exists.
repr(model)
TrigramModel(trigrams=11,062,203, vocab=97,277)
Performance
| Operation | Latency |
|---|---|
| Single word prediction | < 1ms |
| Top-5 predictions | 1โ2ms |
| Beam search (5 words, width 3) | 5โ10ms |
| Training (1M words) | ~30s |
Running Tests
pip install pytest
pytest tests/ -v
Project Structure
Trigrams/
โโโ trigram/ # Python library
โ โโโ __init__.py
โ โโโ _lib.py # ctypes bindings
โ โโโ model.py # TrigramModel class
โ โโโ utils.py # Text preprocessing
โ โโโ _trigram_c.dylib # Compiled C engine (auto-generated)
โโโ trigram_llm/
โ โโโ src/ # C source files
โ โโโ include/ # C headers
โโโ tests/ # pytest test suite
โโโ setup.py # Build script
โโโ pyproject.toml
License
MIT License โ feel free to use, modify, and distribute.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file trigram_llm-0.1.0.tar.gz.
File metadata
- Download URL: trigram_llm-0.1.0.tar.gz
- Upload date:
- Size: 55.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6e5da04f2cfc928b2681e719fe6180e914fab96c3a13f6b29a18c63a2729a518
|
|
| MD5 |
0d82bf8a619a160bbbe0024e616620df
|
|
| BLAKE2b-256 |
d7f456a31465332dba005e77771ae98adc8da48784594d74b2fe389ba2bfbfbf
|