Variable-length n-gram language models using suffix arrays
Project description
Infinigram: Variable-Length N-gram Language Models
Efficient variable-length n-gram language models using suffix arrays for O(m log n) pattern matching.
🎯 Overview
Infinigram is a language model that uses suffix arrays to efficiently find and score variable-length patterns in a corpus. Unlike traditional fixed-order n-grams, Infinigram automatically finds the longest matching context (up to a configurable maximum) and uses it for prediction.
Key advantages over traditional n-grams:
- 🚀 Variable-length patterns: Automatically uses as much context as available
- 💾 Memory efficient: O(n) space vs O(V^n) for hash-based n-grams
- 🎯 Accurate: Longer context = better predictions
- 🔄 Dynamic updates: Can add new data to the corpus
- ⚡ Fast queries: O(m log n) suffix array search
🚀 Quick Start
Installation
pip install infinigram
Or install from source:
git clone https://github.com/yourusername/infinigram.git
cd infinigram
pip install -e .
Interactive REPL
The easiest way to explore Infinigram is through the interactive REPL:
# Launch the REPL
infinigram-repl
# Or using Python
python -m infinigram.repl
Quick REPL Tour:
# Create a dataset (datasets are models!)
infinigram> /dataset demo
infinigram [demo]> /load the cat sat on the mat
# Make predictions
infinigram [demo]> /predict the cat
# Shows probability distribution for next byte
# Generate completions
infinigram [demo]> /complete the cat --max 20
# Apply augmentations for case-insensitive matching
infinigram [demo]> /augment lowercase uppercase
# Create multiple datasets and compare
infinigram [demo]> /dataset copy demo demo_backup
infinigram [demo]> /datasets
infinigram [demo]> /use demo_backup
# Use bash commands for data inspection
infinigram [demo]> !ls -lh data/
infinigram [demo]> !head data/sample.txt
# Type /help for all commands
infinigram [demo]> /help
See REPL_GUIDE.md for comprehensive documentation.
Basic Usage (Python API)
from infinigram import Infinigram
# Create a model from a corpus
corpus = [1, 2, 3, 4, 2, 3, 5, 6, 2, 3, 4]
model = Infinigram(corpus, max_length=10)
# Predict next token
context = [2, 3]
probs = model.predict(context)
print(probs)
# {4: 0.6569, 5: 0.3301, 1: 0.0033, ...}
# Find longest matching suffix
position, length = model.longest_suffix(context)
print(f"Matched {length} tokens at position {position}")
# Get confidence score (0.0 to 1.0)
confidence = model.confidence(context)
print(f"Confidence: {confidence:.4f}")
# Update with new data
model.update([2, 3, 7, 8, 9])
Text Example
from infinigram import Infinigram
# Prepare text corpus
sentences = [
"the cat sat on the mat",
"the dog sat on the rug",
"the cat ran on the mat"
]
# Build vocabulary and tokenize
vocab = {}
corpus = []
for sent in sentences:
for word in sent.split():
if word not in vocab:
vocab[word] = len(vocab)
corpus.append(vocab[word])
# Create model
model = Infinigram(corpus, max_length=5)
# Predict after "the cat"
context = [vocab["the"], vocab["cat"]]
probs = model.predict(context)
# Convert predictions back to words
id_to_word = {v: k for k, v in vocab.items()}
for token_id, prob in sorted(probs.items(), key=lambda x: x[1], reverse=True)[:3]:
print(f"{id_to_word[token_id]}: {prob:.3f}")
🔑 Key Features
Variable-Length Pattern Matching
Infinigram automatically finds the longest matching suffix in the corpus:
corpus = [1, 2, 3, 4, 5, 1, 2, 3, 6, 1, 2, 3, 4, 7]
model = Infinigram(corpus, max_length=10)
# Short context - finds exact match
model.longest_suffix([1, 2]) # Returns (0, 2)
# Longer context - finds longer match
model.longest_suffix([1, 2, 3, 4]) # Returns (0, 4)
Confidence Scoring
Longer matches provide higher confidence:
model.confidence([1]) # Low confidence (~0.07)
model.confidence([1, 2]) # Medium confidence (~0.15)
model.confidence([1, 2, 3, 4]) # High confidence (~0.29)
Dynamic Corpus Updates
Add new data without rebuilding from scratch:
model = Infinigram([1, 2, 3, 4, 5])
print(len(model.corpus)) # 5
model.update([6, 7, 8, 9])
print(len(model.corpus)) # 9
# Predictions automatically reflect new data
Efficient Suffix Arrays
The underlying suffix array provides:
- O(n log n) construction time
- O(m log n) query time (m = pattern length, n = corpus size)
- O(n) space complexity
- Binary search for pattern matching
📊 Performance
From benchmarks on various corpus sizes:
| Corpus Size | Construction | Avg Prediction | Avg Suffix Search |
|---|---|---|---|
| 100 tokens | 0.07 ms | 0.043 ms | 0.014 ms |
| 1K tokens | 6.09 ms | 0.390 ms | 0.184 ms |
| 10K tokens | 718 ms | 4.370 ms | 2.373 ms |
Memory efficiency: For a 1B token corpus:
- Hash-based 5-gram: ~34 GB
- Infinigram suffix array: ~1 GB
- 34x reduction in memory usage
🛠️ API Reference
Infinigram Class
class Infinigram:
def __init__(
self,
corpus: List[int],
max_length: Optional[int] = None,
min_count: int = 1,
smoothing: float = 0.01
)
Parameters:
corpus: List of integer token IDsmax_length: Maximum pattern length to consider (None = unlimited)min_count: Minimum occurrences for a pattern to be includedsmoothing: Laplace smoothing factor for unseen tokens
Methods
predict(context, top_k=50)
Predict probability distribution over next tokens.
Returns: Dict[int, float] - Token ID to probability mapping
longest_suffix(context)
Find longest matching suffix in corpus.
Returns: Tuple[int, int] - (position, length) of match
confidence(context)
Get confidence score based on match length.
Returns: float - Confidence score (0.0 to 1.0)
continuations(context, position, length)
Get all tokens that follow a matched pattern.
Returns: Counter[int] - Token counts
update(new_tokens)
Add new tokens to corpus and rebuild suffix array.
Parameters:
new_tokens: List of new token IDs to add
🧪 Testing
The package includes 36 comprehensive tests:
# Run all tests
pytest tests/
# Run with coverage
pytest tests/ --cov=infinigram --cov-report=html
# Run specific test categories
pytest tests/test_suffix_array.py # Suffix array tests
pytest tests/test_infinigram.py # Infinigram tests
pytest tests/test_integration.py # Integration tests
Test coverage:
- ✅ Suffix array construction and queries
- ✅ Longest suffix matching
- ✅ Continuation probability computation
- ✅ Prediction with smoothing
- ✅ Confidence scoring
- ✅ Dynamic corpus updates
- ✅ Edge cases (empty corpus, long contexts)
- ✅ Integration scenarios (Wikipedia, code completion)
📚 Documentation
- API Reference - Complete API documentation
- Design Document - Architecture and implementation details
- Benchmarks - Performance analysis
🔬 Use Cases
Code Completion
# Train on source code
code_corpus = tokenize_source_files("src/**/*.py")
model = Infinigram(code_corpus, max_length=50)
# Complete code
context = tokenize("def factorial(n):")
suggestions = model.predict(context, top_k=5)
Text Generation
# Train on text corpus
text_corpus = tokenize_books(["book1.txt", "book2.txt"])
model = Infinigram(text_corpus, max_length=20)
# Generate text
context = tokenize("Once upon a time")
next_word = sample_from_distribution(model.predict(context))
Autocomplete
# Train on user queries
query_corpus = tokenize_queries(user_queries)
model = Infinigram(query_corpus, max_length=10)
# Suggest completions
partial_query = tokenize("how to make")
completions = model.predict(partial_query, top_k=10)
🤝 Integration
Infinigram can be used standalone or integrated with other systems:
With LangCalc
# Use as a language model in the LangCalc framework
from langcalc.models import LanguageModel
from infinigram import Infinigram
# Infinigram implements the LanguageModel interface
wiki = Infinigram(wikipedia_corpus, max_length=20)
news = Infinigram(news_corpus, max_length=15)
# Compose with other models
model = 0.7 * llm + 0.2 * wiki + 0.1 * news
Custom Tokenization
from infinigram import Infinigram
class TextInfinigram:
def __init__(self, texts, max_length=20):
# Build vocabulary
self.vocab = {}
corpus = []
for text in texts:
for token in self.tokenize(text):
if token not in self.vocab:
self.vocab[token] = len(self.vocab)
corpus.append(self.vocab[token])
self.id_to_token = {v: k for k, v in self.vocab.items()}
self.model = Infinigram(corpus, max_length=max_length)
def tokenize(self, text):
return text.lower().split()
def predict_text(self, context_text):
context = [self.vocab[t] for t in self.tokenize(context_text) if t in self.vocab]
probs = self.model.predict(context)
return {self.id_to_token[tid]: p for tid, p in probs.items()}
🔄 Comparison with Traditional N-grams
| Feature | Traditional N-gram | Infinigram |
|---|---|---|
| Pattern Length | Fixed (n) | Variable (1 to max_length) |
| Memory | O(V^n) exponential | O(corpus_size) linear |
| Query Time | O(1) hash lookup | O(m log n) suffix search |
| Context Usage | Last n-1 tokens | Longest matching suffix |
| Updates | Fast (hash insert) | Slow (rebuild suffix array) |
| Best For | Frequent updates, small n | Large patterns, static/batch updates |
📈 Roadmap
- Incremental suffix array updates (avoid full rebuild)
- Compressed suffix arrays for large corpora
- Parallel suffix array construction
- GPU acceleration for batch predictions
- Integration with popular NLP libraries
- Pre-trained models for common domains
- Support for character-level and subword tokenization
- Streaming corpus updates
🤝 Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
📄 License
MIT License - see LICENSE for details.
📞 Contact
- GitHub: yourusername/infinigram
- Issues: GitHub Issues
- Email: your.email@example.com
🙏 Acknowledgments
Infinigram was originally developed as part of the LangCalc project, which explores algebraic frameworks for language model composition.
📚 References
- Suffix Arrays: Manber, U., & Myers, G. (1993). "Suffix arrays: a new method for on-line string searches"
- Variable-length n-grams in language modeling
- Efficient text indexing and retrieval
Built with ❤️ for efficient language modeling
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file py_infinigram-0.4.1.tar.gz.
File metadata
- Download URL: py_infinigram-0.4.1.tar.gz
- Upload date:
- Size: 88.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
780c5b68252398ddc536fc3dcc45335ff6c4f6e20e466a68dbdfebfbe93268a2
|
|
| MD5 |
b5b97f48c77de4cf599ff0c17cf3fac6
|
|
| BLAKE2b-256 |
390577e4033c40204a512b758d7e936870b3b4714afc5635076676d4721908db
|
File details
Details for the file py_infinigram-0.4.1-py3-none-any.whl.
File metadata
- Download URL: py_infinigram-0.4.1-py3-none-any.whl
- Upload date:
- Size: 58.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
246594ff7c74886d071ff5cc80c799976f9c93adb18d3e7be124661d03e3ac7a
|
|
| MD5 |
efdcd58ab375b3428ee524ff96848350
|
|
| BLAKE2b-256 |
f526ef1d6a1ae0d23ff0c3ddb7e35f089ad4fdf8b27247002be473ec70a76f39
|