Skip to main content

Variable-length n-gram language models using suffix arrays

Project description

Infinigram: Variable-Length N-gram Language Models

Python 3.8+ Tests Coverage License

Efficient variable-length n-gram language models using suffix arrays for O(m log n) pattern matching. Byte-level (UTF-8), works with any text.

Overview

Infinigram is a corpus-based language model where the corpus IS the model — no gradient descent, no training loop. It uses suffix arrays to find variable-length patterns and predict continuations.

Key properties:

  • Instant training: Build a model by pointing at text
  • Variable-length patterns: Automatically uses longest matching context
  • Exact matching: Every prediction traces to actual corpus occurrences
  • O(m log n) queries: Binary search on suffix arrays
  • Byte-level: Fixed 256-token vocabulary (UTF-8 bytes 0-255)
  • Mmap-backed: Memory-mapped files for corpora from 1KB to 100GB+

Quick Start

Installation

pip install py-infinigram

Or install from source:

git clone https://github.com/queelius/infinigram.git
cd infinigram
pip install -e .

Basic Usage

from infinigram import Infinigram

# Create model from text (byte-level, automatic UTF-8 encoding)
model = Infinigram(b"the cat sat on the mat the cat ran")

# Predict next byte probabilities
probs = model.predict(b"the cat", top_k=5)
# {32: 1.0}  # space byte (32) has highest probability

# Find longest matching suffix
position, length = model.longest_suffix(b"the cat")
print(f"Matched {length} bytes at position {position}")

# String input works too (auto-encoded to UTF-8)
probs = model.predict("the cat", top_k=5)

Persistent Models

# Build and save a model
model = Infinigram.build("The quick brown fox jumps over the lazy dog.", "my_model")

# Load it later (instant — uses mmap, not RAM)
model = Infinigram.load("my_model")
probs = model.predict("The quick")

Key Features

Variable-Length Pattern Matching

Infinigram automatically finds the longest matching suffix in the corpus:

model = Infinigram(b"abcdef abcdef abcxyz")

pos, length = model.longest_suffix(b"abcd")   # length=4
pos, length = model.longest_suffix(b"abc")    # length=3
pos, length = model.longest_suffix(b"zzz")    # length=0 (no match)

Multiple Prediction Strategies

# Longest suffix match
probs = model.predict(b"the cat", top_k=10)

Dynamic Updates

model = Infinigram(b"hello")
model.update(b" world")  # Appends and rebuilds suffix array
print(model.n)  # 11

API Reference

Infinigram Class

Infinigram(
    corpus_or_path,           # bytes, List[int], str, or Path
    max_length: int = None,   # Max suffix length (None = unlimited)
    min_count: int = 1,       # Min frequency threshold
)

Methods

Method Description
predict(context, top_k=50, smoothing=0.0) Next-token probability distribution
predict_logprobs(context, top_k=50) Log-probabilities
continuations(context, limit=10000) Raw continuation counts (limit=None for all)
longest_suffix(context) (position, length) of longest match
find_all_suffix_matches(context) All matches at all suffix lengths
search(pattern, limit=None) All occurrence positions
count(pattern) Number of occurrences
update(new_tokens) Append tokens and rebuild index
clear_cache() Clear prediction caches
get_context(position, window=50) Corpus text around a position

Class Methods

Method Description
Infinigram.build(corpus, path, ...) Build persistent model
Infinigram.load(path) Load existing model (mmap, instant)
Infinigram.list_models() List models in ~/.infinigram/models/

Performance

Byte-level model with mmap-backed suffix arrays:

Corpus Size Construction Query Latency
1 KB <1 ms <0.1 ms
1 MB ~100 ms <1 ms
1 GB ~30 s <10 ms

Memory: ~9 bytes per corpus byte (corpus + suffix array on disk, mmap'd).

Performance optimizations (v0.5.0):

  • Vectorized continuation counting with np.bincount (3-10x speedup)
  • LRU cache for repeated continuations() queries (5-20x for hot paths)
  • Direct bytes comparison in binary search (2-5x vs Python loop)
  • Batch SQLite inserts for dataset index rebuild

Architecture

Python API              Infinigram class (predict, search, continuations)
Suffix Array Engine     MmapSuffixArray, ChunkedMmapSuffixArray (pydivsufsort)

See ARCHITECTURE.md for details.

Testing

508 tests, 79% coverage:

pytest tests/                                    # All tests
pytest tests/ -v                                 # Verbose
pytest tests/ --cov=infinigram --cov-report=html # Coverage
pytest tests/test_infinigram.py -k "cache"       # Keyword filter

Thread Safety

Infinigram instances are not thread-safe. Each thread should use its own model instance, or external synchronization must be used.

License

MIT License - see LICENSE for details.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py_infinigram-0.7.0.tar.gz (43.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

py_infinigram-0.7.0-py3-none-any.whl (25.4 kB view details)

Uploaded Python 3

File details

Details for the file py_infinigram-0.7.0.tar.gz.

File metadata

  • Download URL: py_infinigram-0.7.0.tar.gz
  • Upload date:
  • Size: 43.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for py_infinigram-0.7.0.tar.gz
Algorithm Hash digest
SHA256 abbf4a8a5ce073ed0ff6d5f86ebaebd0121a308087ae39326ce7daf393117b01
MD5 73a727e920274fb6ed4802467dcd245e
BLAKE2b-256 bd9f0942c53e5904aa05b00573f95b30437a46cad2a86f4716cb216a545f7a6b

See more details on using hashes here.

File details

Details for the file py_infinigram-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: py_infinigram-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 25.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for py_infinigram-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fb3854dade8ed05da3163864095ea7d88a87244cdee39d372a883ed78b7c5b0c
MD5 7b1642b1bf52f1e6c98f9aeabfa7891a
BLAKE2b-256 ff5075898138244afaee2dcbb8f98f87682a6cfd3442f867f2b05ac33f410ec1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page