Build ARPA format statistical language models with multiple smoothing methods
Project description
arpabo
Build ARPA format statistical language models with multiple smoothing methods.
Features
- Multiple smoothing methods (Good-Turing, Kneser-Ney, Katz backoff)
- Support for arbitrary n-gram orders
- Standard ARPA format output
- Binary format conversion (PocketSphinx, Kaldi)
- Corpus normalization tool
- Interactive debug mode
- Zero runtime dependencies (pure Python)
Installation
pip install arpabo
This installs two commands:
arpabo- Build language modelsarpabo-normalize- Normalize text corpora
Quick Start
# Quick demo
arpabo --demo -o model.arpa
# Build from your corpus
arpabo corpus.txt -o model.arpa
# With binary conversion
arpabo corpus.txt -o model.arpa --to-bin
# Two-stage: normalize then build
arpabo-normalize corpus.txt -o normalized.txt -c lower -n
arpabo normalized.txt -o model.arpa
Python API
from arpabo import ArpaBoLM
# Build a language model
lm = ArpaBoLM(max_order=3, smoothing_method="good_turing")
with open("corpus.txt") as f:
lm.read_corpus(f)
lm.compute()
lm.write_file("model.arpa")
Smoothing Methods
good_turing(default) - Best for sparse datakneser_ney- Best for larger corporaauto- Automatically optimizes discount massfixed- Fixed discount mass (use-d 0.0for MLE)
Common Workflows
Basic Usage
arpabo corpus.txt -o model.arpa
With Options
# 4-gram with Kneser-Ney smoothing
arpabo corpus.txt -o model.arpa -m 4 -s kneser_ney
# Lowercase normalization
arpabo corpus.txt -o model.arpa -c lower -v
# Token normalization (strip punctuation)
arpabo corpus.txt -o model.arpa -n
Corpus Preprocessing
# Normalize separately
arpabo-normalize corpus.txt -o clean.txt -c lower -n
# Build model
arpabo clean.txt -o model.arpa
# Or pipeline
cat corpus.txt | arpabo-normalize -c lower -n | arpabo -o model.arpa
Binary Conversion
# Automatic PocketSphinx binary
arpabo corpus.txt -o model.arpa --to-bin
# Kaldi FST format
arpabo corpus.txt -o model.arpa --to-fst
# Manual conversion
pocketsphinx_lm_convert -i model.arpa -o model.lm.bin
Compatibility
ArpaLM produces standard ARPA format models compatible with:
- Kaldi - Convert with
arpa2fst - PocketSphinx - Convert with
pocketsphinx_lm_convert - SphinxTrain - Use ARPA directly
- NVIDIA Riva - ARPA format supported
- Julius, HTK - ARPA compatible
Development
git clone https://github.com/lenzo-ka/arpabo.git
cd arpabo
make venv
source venv/bin/activate
make test
See CONTRIBUTING.md for details.
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
arpabo-0.1.0.tar.gz
(39.4 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
arpabo-0.1.0-py3-none-any.whl
(35.9 kB
view details)
File details
Details for the file arpabo-0.1.0.tar.gz.
File metadata
- Download URL: arpabo-0.1.0.tar.gz
- Upload date:
- Size: 39.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3807d27cde2c89104e5a5f0a6aeb2b50da55987fba79f0678bd78ed88494706b
|
|
| MD5 |
5dd9e73d538d0dcaa56c09b01cf268c9
|
|
| BLAKE2b-256 |
119666b51abb922fe33e0559e3f30bcd2ea2b8246e9b75e90af4c16b4953f39a
|
File details
Details for the file arpabo-0.1.0-py3-none-any.whl.
File metadata
- Download URL: arpabo-0.1.0-py3-none-any.whl
- Upload date:
- Size: 35.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0af6d438e2d9854b6bd71e9c157966dc50a8e6d03020ab66e4fb38b52cbdce4e
|
|
| MD5 |
5caabc3c4151b5aece73a8ea5dbaa686
|
|
| BLAKE2b-256 |
ef3748f0bacb33e2b840e89374445117526c8fc54da2e6b02f51ecd940fd07ff
|