Build ARPA format statistical language models with multiple smoothing methods
Project description
arpabo
Build ARPA format statistical language models with multiple smoothing methods.
Features
Core Features
- Multiple smoothing methods (Good-Turing, Kneser-Ney, Katz backoff)
- Support for arbitrary n-gram orders
- Standard ARPA format output
- Binary format conversion (PocketSphinx, Kaldi)
- Corpus normalization tool
- Interactive debug mode
- Zero runtime dependencies (pure Python)
New: First-Pass ASR Optimization
- Multi-order training - Train multiple n-gram orders efficiently (
--orders 1-4) - Perplexity evaluation - Evaluate model quality on test data (
--eval test.txt) - Model statistics - Analyze backoff rates and model behavior (
--stats --backoff test.txt) - Presets - Pre-configured settings for common use cases (
--preset first-pass) - Smoothing comparison - Automatically compare smoothing methods (
--compare-smoothing) - Vocabulary pruning - Reduce model size for mobile (
--prune-vocab topk:10000) - ModelComparison API - High-level Python API for complete workflows
- Uniform baseline - Maximum entropy models for comparison
- Cross-validation - K-fold CV for robust model selection
- Model interpolation - Alternative probability mixing strategy
Installation
pip install arpabo
This installs two commands:
arpabo- Build language modelsarpabo-normalize- Normalize text corpora
Quick Start
# Quick demo
arpabo --demo -o model.arpa
# Build from your corpus
arpabo corpus.txt -o model.arpa
# With binary conversion
arpabo corpus.txt -o model.arpa --to-bin
# Two-stage: normalize then build
arpabo-normalize corpus.txt -o normalized.txt -c lower -n
arpabo normalized.txt -o model.arpa
Python API
Basic Usage
from arpabo import ArpaBoLM
# Build a language model
lm = ArpaBoLM(max_order=3, smoothing_method="good_turing")
with open("corpus.txt") as f:
lm.read_corpus(f)
lm.compute()
lm.write_file("model.arpa")
Use Presets (New!)
# No need to pick parameters - use a preset!
lm = ArpaBoLM.from_preset("first-pass") # For first-pass ASR
lm.read_corpus(open("corpus.txt"))
lm.compute()
lm.write_file("model.arpa")
High-Level API (New!)
from arpabo import ModelComparison
# Complete optimization workflow
comparison = ModelComparison(corpus_file="train.txt")
comparison.train_orders([1, 2, 3, 4])
comparison.add_uniform_baseline()
comparison.evaluate(test_file="test.txt")
comparison.print_comparison()
# Get recommendation
best = comparison.recommend(goal="first-pass")
print(f"Best model: {best}-gram")
# Export for deployment
comparison.export_for_optimization("experiments/", convert_to_binary=True)
Evaluation & Analysis (New!)
# Evaluate model quality
with open("test.txt") as f:
results = lm.perplexity(f)
print(f"Perplexity: {results['perplexity']:.1f}")
# Analyze backoff behavior
with open("test.txt") as f:
backoff = lm.backoff_rate(f)
print(f"Backoff rate: {backoff['overall_backoff_rate']*100:.1f}%")
# Get model statistics
stats = lm.get_statistics()
print(f"Vocabulary: {stats['vocab_size']:,} words")
Smoothing Methods
good_turing(default) - Best for sparse datakneser_ney- Best for larger corporaauto- Automatically optimizes discount massfixed- Fixed discount mass (use-d 0.0for MLE)
Common Workflows
Basic Usage
# Simple model
arpabo corpus.txt -o model.arpa
# Use a preset (easiest!)
arpabo corpus.txt -o model.arpa --preset balanced
# List available presets
arpabo --list-presets
Multi-Order Training (New!)
# Train multiple orders efficiently
arpabo corpus.txt -o models/ --orders 1-4 --to-bin
# Creates: 1gram.arpa, 2gram.arpa, 3gram.arpa, 4gram.arpa (+ .lm.bin files)
Model Evaluation (New!)
# Train and evaluate
arpabo corpus.txt -o model.arpa --eval test.txt
# Evaluate existing model
arpabo --eval-only model.arpa test.txt
# With statistics and backoff analysis
arpabo corpus.txt -o model.arpa --stats --backoff test.txt
Advanced: Compare & Optimize (New!)
# Compare smoothing methods
arpabo corpus.txt --compare-smoothing --eval test.txt
# Prune for mobile deployment
arpabo corpus.txt -o mobile.arpa --prune-vocab topk:10000 --to-bin
Traditional Options
# 4-gram with Kneser-Ney smoothing
arpabo corpus.txt -o model.arpa -m 4 -s kneser_ney
# Lowercase normalization
arpabo corpus.txt -o model.arpa -c lower -v
# Token normalization (strip punctuation)
arpabo corpus.txt -o model.arpa -n
Corpus Preprocessing
# Normalize separately
arpabo-normalize corpus.txt -o clean.txt -c lower -n
# Build model
arpabo clean.txt -o model.arpa
# Or pipeline
cat corpus.txt | arpabo-normalize -c lower -n | arpabo -o model.arpa
Binary Conversion (Optional)
ARPA files work directly with PocketSphinx. Binary conversion is optional for better performance:
# Use ARPA directly (works as-is)
arpabo corpus.txt -o model.arpa
# Optional: Convert to binary for faster loading
arpabo corpus.txt -o model.arpa --to-bin
# Optional: Kaldi FST format
arpabo corpus.txt -o model.arpa --to-fst
# Or convert manually later
pocketsphinx_lm_convert -i model.arpa -o model.lm.bin
Compatibility
Produces standard ARPA format models that work directly with:
- PocketSphinx - Use ARPA directly (optional binary conversion for speed)
- Kaldi - Use ARPA directly or convert to FST
- SphinxTrain - Use ARPA directly
- NVIDIA Riva - ARPA format supported
- Julius, HTK - ARPA compatible
Binary conversion is optional and only improves loading speed.
Documentation
Guides
- Multi-Order Training - Train multiple models efficiently
- Perplexity Evaluation - Evaluate model quality
- ModelComparison API - High-level workflow API
Examples
- examples/model_comparison_example.py - Complete workflow example
Feature Summaries
See PHASE_1_COMPLETE.md, PHASE_2_COMPLETE.md, and PHASE_3_COMPLETE.md for detailed feature documentation.
Development
git clone https://github.com/lenzo-ka/arpabo.git
cd arpabo
make venv
source venv/bin/activate
make test
See CONTRIBUTING.md for details.
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arpabo-0.2.0.tar.gz.
File metadata
- Download URL: arpabo-0.2.0.tar.gz
- Upload date:
- Size: 77.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f20eac3ee12d1b6837a88f1791825a769843111218b551a4bf60b99eb6885d4f
|
|
| MD5 |
03b7854363f00737f02fbe4fe43b844a
|
|
| BLAKE2b-256 |
c2671b50ae5fe1f789d956b9e26edac5fcd1b75bd25b1f383e7e3399d67c8a28
|
File details
Details for the file arpabo-0.2.0-py3-none-any.whl.
File metadata
- Download URL: arpabo-0.2.0-py3-none-any.whl
- Upload date:
- Size: 58.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b429463e5fc9baab62d53439d92c5e9004037a8b0bf914fb3a938c0327ad893b
|
|
| MD5 |
6d9aa6e01fc7c6876cf99dc101003134
|
|
| BLAKE2b-256 |
70e05c1c945696cecfd7929e5d99397503810f882f81a2797dfa1bc4bd733668
|