Skip to main content

Build ARPA format statistical language models with multiple smoothing methods

Project description

arpabo

Build ARPA format statistical language models with multiple smoothing methods.

Tests Lint Python 3.9+ License: MIT

Features

  • Multiple smoothing methods (Good-Turing, Kneser-Ney, Katz backoff)
  • Support for arbitrary n-gram orders
  • Standard ARPA format output
  • Binary format conversion (PocketSphinx, Kaldi)
  • Corpus normalization tool
  • Interactive debug mode
  • Zero runtime dependencies (pure Python)

Installation

pip install arpabo

This installs two commands:

  • arpabo - Build language models
  • arpabo-normalize - Normalize text corpora

Quick Start

# Quick demo
arpabo --demo -o model.arpa

# Build from your corpus
arpabo corpus.txt -o model.arpa

# With binary conversion
arpabo corpus.txt -o model.arpa --to-bin

# Two-stage: normalize then build
arpabo-normalize corpus.txt -o normalized.txt -c lower -n
arpabo normalized.txt -o model.arpa

Python API

from arpabo import ArpaBoLM

# Build a language model
lm = ArpaBoLM(max_order=3, smoothing_method="good_turing")
with open("corpus.txt") as f:
    lm.read_corpus(f)
lm.compute()
lm.write_file("model.arpa")

Smoothing Methods

  • good_turing (default) - Best for sparse data
  • kneser_ney - Best for larger corpora
  • auto - Automatically optimizes discount mass
  • fixed - Fixed discount mass (use -d 0.0 for MLE)

Common Workflows

Basic Usage

arpabo corpus.txt -o model.arpa

With Options

# 4-gram with Kneser-Ney smoothing
arpabo corpus.txt -o model.arpa -m 4 -s kneser_ney

# Lowercase normalization
arpabo corpus.txt -o model.arpa -c lower -v

# Token normalization (strip punctuation)
arpabo corpus.txt -o model.arpa -n

Corpus Preprocessing

# Normalize separately
arpabo-normalize corpus.txt -o clean.txt -c lower -n

# Build model
arpabo clean.txt -o model.arpa

# Or pipeline
cat corpus.txt | arpabo-normalize -c lower -n | arpabo -o model.arpa

Binary Conversion

# Automatic PocketSphinx binary
arpabo corpus.txt -o model.arpa --to-bin

# Kaldi FST format
arpabo corpus.txt -o model.arpa --to-fst

# Manual conversion
pocketsphinx_lm_convert -i model.arpa -o model.lm.bin

Compatibility

ArpaLM produces standard ARPA format models compatible with:

  • Kaldi - Convert with arpa2fst
  • PocketSphinx - Convert with pocketsphinx_lm_convert
  • SphinxTrain - Use ARPA directly
  • NVIDIA Riva - ARPA format supported
  • Julius, HTK - ARPA compatible

Development

git clone https://github.com/lenzo-ka/arpabo.git
cd arpabo
make venv
source venv/bin/activate
make test

See CONTRIBUTING.md for details.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arpabo-0.1.0.tar.gz (39.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arpabo-0.1.0-py3-none-any.whl (35.9 kB view details)

Uploaded Python 3

File details

Details for the file arpabo-0.1.0.tar.gz.

File metadata

  • Download URL: arpabo-0.1.0.tar.gz
  • Upload date:
  • Size: 39.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for arpabo-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3807d27cde2c89104e5a5f0a6aeb2b50da55987fba79f0678bd78ed88494706b
MD5 5dd9e73d538d0dcaa56c09b01cf268c9
BLAKE2b-256 119666b51abb922fe33e0559e3f30bcd2ea2b8246e9b75e90af4c16b4953f39a

See more details on using hashes here.

File details

Details for the file arpabo-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: arpabo-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 35.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for arpabo-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0af6d438e2d9854b6bd71e9c157966dc50a8e6d03020ab66e4fb38b52cbdce4e
MD5 5caabc3c4151b5aece73a8ea5dbaa686
BLAKE2b-256 ef3748f0bacb33e2b840e89374445117526c8fc54da2e6b02f51ecd940fd07ff

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page