Variable-length n-gram language models using suffix arrays
Project description
Infinigram: Variable-Length N-gram Language Models
Efficient variable-length n-gram language models using suffix arrays for O(m log n) pattern matching. Byte-level (UTF-8), works with any text.
Overview
Infinigram is a corpus-based language model where the corpus IS the model — no gradient descent, no training loop. It uses suffix arrays to find variable-length patterns and predict continuations.
Key properties:
- Instant training: Build a model by pointing at text
- Variable-length patterns: Automatically uses longest matching context
- Exact matching: Every prediction traces to actual corpus occurrences
- O(m log n) queries: Binary search on suffix arrays
- Byte-level: Fixed 256-token vocabulary (UTF-8 bytes 0-255)
- Mmap-backed: Memory-mapped files for corpora from 1KB to 100GB+
Quick Start
Installation
pip install py-infinigram
Or install from source:
git clone https://github.com/queelius/infinigram.git
cd infinigram
pip install -e .
Basic Usage
from infinigram import Infinigram
# Create model from text (byte-level, automatic UTF-8 encoding)
model = Infinigram(b"the cat sat on the mat the cat ran")
# Predict next byte probabilities
probs = model.predict(b"the cat", top_k=5)
# {32: 1.0} # space byte (32) has highest probability
# Find longest matching suffix
position, length = model.longest_suffix(b"the cat")
print(f"Matched {length} bytes at position {position}")
# String input works too (auto-encoded to UTF-8)
probs = model.predict("the cat", top_k=5)
Persistent Models
# Build and save a model
model = Infinigram.build("The quick brown fox jumps over the lazy dog.", "my_model")
# Load it later (instant — uses mmap, not RAM)
model = Infinigram.load("my_model")
probs = model.predict("The quick")
Key Features
Variable-Length Pattern Matching
Infinigram automatically finds the longest matching suffix in the corpus:
model = Infinigram(b"abcdef abcdef abcxyz")
pos, length = model.longest_suffix(b"abcd") # length=4
pos, length = model.longest_suffix(b"abc") # length=3
pos, length = model.longest_suffix(b"zzz") # length=0 (no match)
Multiple Prediction Strategies
# Longest suffix match
probs = model.predict(b"the cat", top_k=10)
Dynamic Updates
model = Infinigram(b"hello")
model.update(b" world") # Appends and rebuilds suffix array
print(model.n) # 11
API Reference
Infinigram Class
Infinigram(
corpus_or_path, # bytes, List[int], str, or Path
max_length: int = None, # Max suffix length (None = unlimited)
min_count: int = 1, # Min frequency threshold
)
Methods
| Method | Description |
|---|---|
predict(context, top_k=50, smoothing=0.0) |
Next-token probability distribution |
predict_logprobs(context, top_k=50) |
Log-probabilities |
continuations(context, limit=10000) |
Raw continuation counts (limit=None for all) |
longest_suffix(context) |
(position, length) of longest match |
find_all_suffix_matches(context) |
All matches at all suffix lengths |
search(pattern, limit=None) |
All occurrence positions |
count(pattern) |
Number of occurrences |
update(new_tokens) |
Append tokens and rebuild index |
clear_cache() |
Clear prediction caches |
get_context(position, window=50) |
Corpus text around a position |
Class Methods
| Method | Description |
|---|---|
Infinigram.build(corpus, path, ...) |
Build persistent model |
Infinigram.load(path) |
Load existing model (mmap, instant) |
Infinigram.list_models() |
List models in ~/.infinigram/models/ |
Performance
Byte-level model with mmap-backed suffix arrays:
| Corpus Size | Construction | Query Latency |
|---|---|---|
| 1 KB | <1 ms | <0.1 ms |
| 1 MB | ~100 ms | <1 ms |
| 1 GB | ~30 s | <10 ms |
Memory: ~9 bytes per corpus byte (corpus + suffix array on disk, mmap'd).
Performance optimizations (v0.5.0):
- Vectorized continuation counting with
np.bincount(3-10x speedup) - LRU cache for repeated
continuations()queries (5-20x for hot paths) - Direct bytes comparison in binary search (2-5x vs Python loop)
- Batch SQLite inserts for dataset index rebuild
Architecture
Python API Infinigram class (predict, search, continuations)
Suffix Array Engine MmapSuffixArray, ChunkedMmapSuffixArray (pydivsufsort)
See ARCHITECTURE.md for details.
Testing
508 tests, 79% coverage:
pytest tests/ # All tests
pytest tests/ -v # Verbose
pytest tests/ --cov=infinigram --cov-report=html # Coverage
pytest tests/test_infinigram.py -k "cache" # Keyword filter
Thread Safety
Infinigram instances are not thread-safe. Each thread should use its own model instance, or external synchronization must be used.
License
MIT License - see LICENSE for details.
Links
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file py_infinigram-0.7.0.tar.gz.
File metadata
- Download URL: py_infinigram-0.7.0.tar.gz
- Upload date:
- Size: 43.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
abbf4a8a5ce073ed0ff6d5f86ebaebd0121a308087ae39326ce7daf393117b01
|
|
| MD5 |
73a727e920274fb6ed4802467dcd245e
|
|
| BLAKE2b-256 |
bd9f0942c53e5904aa05b00573f95b30437a46cad2a86f4716cb216a545f7a6b
|
File details
Details for the file py_infinigram-0.7.0-py3-none-any.whl.
File metadata
- Download URL: py_infinigram-0.7.0-py3-none-any.whl
- Upload date:
- Size: 25.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fb3854dade8ed05da3163864095ea7d88a87244cdee39d372a883ed78b7c5b0c
|
|
| MD5 |
7b1642b1bf52f1e6c98f9aeabfa7891a
|
|
| BLAKE2b-256 |
ff5075898138244afaee2dcbb8f98f87682a6cfd3442f867f2b05ac33f410ec1
|