Variable-length n-gram language models using suffix arrays

These details have not been verified by PyPI

Project links

Project description

Infinigram: Variable-Length N-gram Language Models

Efficient variable-length n-gram language models using suffix arrays for O(m log n) pattern matching. Byte-level (UTF-8), works with any text.

Overview

Infinigram is a corpus-based language model where the corpus IS the model — no gradient descent, no training loop. It uses suffix arrays to find variable-length patterns and predict continuations.

Key properties:

Instant training: Build a model by pointing at text
Variable-length patterns: Automatically uses longest matching context
Exact matching: Every prediction traces to actual corpus occurrences
O(m log n) queries: Binary search on suffix arrays
Byte-level: Fixed 256-token vocabulary (UTF-8 bytes 0-255)
Mmap-backed: Memory-mapped files for corpora from 1KB to 100GB+

Quick Start

Installation

pip install py-infinigram

Or install from source:

git clone https://github.com/queelius/infinigram.git
cd infinigram
pip install -e .

Basic Usage

from infinigram import Infinigram

# Create model from text (byte-level, automatic UTF-8 encoding)
model = Infinigram(b"the cat sat on the mat the cat ran")

# Predict next byte probabilities
probs = model.predict(b"the cat", top_k=5)
# {32: 1.0}  # space byte (32) has highest probability

# Find longest matching suffix
position, length = model.longest_suffix(b"the cat")
print(f"Matched {length} bytes at position {position}")

# String input works too (auto-encoded to UTF-8)
probs = model.predict("the cat", top_k=5)

Persistent Models

# Build and save a model
model = Infinigram.build("The quick brown fox jumps over the lazy dog.", "my_model")

# Load it later (instant — uses mmap, not RAM)
model = Infinigram.load("my_model")
probs = model.predict("The quick")

Key Features

Variable-Length Pattern Matching

Infinigram automatically finds the longest matching suffix in the corpus:

model = Infinigram(b"abcdef abcdef abcxyz")

pos, length = model.longest_suffix(b"abcd")   # length=4
pos, length = model.longest_suffix(b"abc")    # length=3
pos, length = model.longest_suffix(b"zzz")    # length=0 (no match)

Multiple Prediction Strategies

# Longest suffix match
probs = model.predict(b"the cat", top_k=10)

Dynamic Updates

model = Infinigram(b"hello")
model.update(b" world")  # Appends and rebuilds suffix array
print(model.n)  # 11

API Reference

Infinigram Class

Infinigram(
    corpus_or_path,           # bytes, List[int], str, or Path
    max_length: int = None,   # Max suffix length (None = unlimited)
    min_count: int = 1,       # Min frequency threshold
)

Methods

Method	Description
`predict(context, top_k=50, smoothing=0.0)`	Next-token probability distribution
`predict_logprobs(context, top_k=50)`	Log-probabilities
`continuations(context, limit=10000)`	Raw continuation counts (`limit=None` for all)
`longest_suffix(context)`	`(position, length)` of longest match
`find_all_suffix_matches(context)`	All matches at all suffix lengths
`search(pattern, limit=None)`	All occurrence positions
`count(pattern)`	Number of occurrences
`update(new_tokens)`	Append tokens and rebuild index
`clear_cache()`	Clear prediction caches
`get_context(position, window=50)`	Corpus text around a position

Class Methods

Method	Description
`Infinigram.build(corpus, path, ...)`	Build persistent model
`Infinigram.load(path)`	Load existing model (mmap, instant)
`Infinigram.list_models()`	List models in `~/.infinigram/models/`

Performance

Byte-level model with mmap-backed suffix arrays:

Corpus Size	Construction	Query Latency
1 KB	<1 ms	<0.1 ms
1 MB	~100 ms	<1 ms
1 GB	~30 s	<10 ms

Memory: ~9 bytes per corpus byte (corpus + suffix array on disk, mmap'd).

Performance optimizations (v0.5.0):

Vectorized continuation counting with np.bincount (3-10x speedup)
LRU cache for repeated continuations() queries (5-20x for hot paths)
Direct bytes comparison in binary search (2-5x vs Python loop)
Batch SQLite inserts for dataset index rebuild

Architecture

Python API              Infinigram class (predict, search, continuations)
Suffix Array Engine     MmapSuffixArray, ChunkedMmapSuffixArray (pydivsufsort)

See ARCHITECTURE.md for details.

Testing

508 tests, 79% coverage:

pytest tests/                                    # All tests
pytest tests/ -v                                 # Verbose
pytest tests/ --cov=infinigram --cov-report=html # Coverage
pytest tests/test_infinigram.py -k "cache"       # Keyword filter

Thread Safety

Infinigram instances are not thread-safe. Each thread should use its own model instance, or external synchronization must be used.

License

MIT License - see LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.7.0

Apr 25, 2026

0.4.1

Dec 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py_infinigram-0.7.0.tar.gz (43.3 kB view details)

Uploaded Apr 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

py_infinigram-0.7.0-py3-none-any.whl (25.4 kB view details)

Uploaded Apr 25, 2026 Python 3

File details

Details for the file py_infinigram-0.7.0.tar.gz.

File metadata

Download URL: py_infinigram-0.7.0.tar.gz
Upload date: Apr 25, 2026
Size: 43.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for py_infinigram-0.7.0.tar.gz
Algorithm	Hash digest
SHA256	`abbf4a8a5ce073ed0ff6d5f86ebaebd0121a308087ae39326ce7daf393117b01`
MD5	`73a727e920274fb6ed4802467dcd245e`
BLAKE2b-256	`bd9f0942c53e5904aa05b00573f95b30437a46cad2a86f4716cb216a545f7a6b`

See more details on using hashes here.

File details

Details for the file py_infinigram-0.7.0-py3-none-any.whl.

File metadata

Download URL: py_infinigram-0.7.0-py3-none-any.whl
Upload date: Apr 25, 2026
Size: 25.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for py_infinigram-0.7.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fb3854dade8ed05da3163864095ea7d88a87244cdee39d372a883ed78b7c5b0c`
MD5	`7b1642b1bf52f1e6c98f9aeabfa7891a`
BLAKE2b-256	`ff5075898138244afaee2dcbb8f98f87682a6cfd3442f867f2b05ac33f410ec1`

See more details on using hashes here.

py-infinigram 0.7.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

Infinigram: Variable-Length N-gram Language Models

Overview

Quick Start

Installation

Basic Usage

Persistent Models

Key Features

Variable-Length Pattern Matching

Multiple Prediction Strategies

Dynamic Updates

API Reference

Infinigram Class

Methods

Class Methods

Performance

Architecture

Testing

Thread Safety

License

Links

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes