A toolkit for phonotactic probability calculation and analysis.
Project description
UCI Phonotactic Calculator
This repository contains the source code for the UCI Phonotactic Calculator website, as well as a flexible, extensible CLI for phonotactic modeling and scoring.
🚀 Quick Start
Requirements
- Python 3.8+
- (Optional) Create a virtual environment:
python -m venv venv source venv/bin/activate # or venv\Scripts\activate on Windows
- Install dependencies:
pip install -r requirements.txt
Running a Model
python -m src.main <train_file> <test_file> <output_file> [OPTIONS]
Example:
python -m src.main data/english.csv data/sample_test_data/english_test_data.csv output.csv --boundary-mode both --aggregate sum --weight-mode legacy_log
🛠️ CLI Options
The CLI supports a wide range of options. Run python -m src.main --help for a full list.
🔎 Filtering and Discovering Keys
You can restrict the grid search to specific model/configuration variants using one or more --filter flags:
python -m src.main ... --filter KEY=VAL [--filter KEY2=VAL2 ...]
- Each filter restricts the grid to configs where
Config.<KEY> == <VAL>. - Repeat the flag to combine filters (logical AND).
- Examples:
--filter smoothing=laplace --filter n=2 --filter aggregate=logsumexp
- Both long and short key aliases are accepted (e.g.,
nforngram_order,probforprob_mode). - For a full list of accepted keys and aliases, run:
python -m src.main --list-filters
This will print all canonical keys and their available aliases.
🐍 Scripting & CI: Disabling Progress Bars
To make scripting and automation easier (e.g., in Docker or CI), you can suppress all progress bars globally by setting the environment variable:
NO_PROGRESS=1
This disables all Rich progress bars, regardless of CLI flags. You can also use the --no-progress CLI flag for one-off runs.
Key flags:
--boundary-mode:both(default),prefix,suffix,none--aggregate:sum,mean,min,max,none--weight-mode:none,raw,log,legacy_log--position-strategy:absolute,relative,none(default is None)--smoothing-scheme:laplace,none,kn--count-strategy:ngram(default), others as available--prob-mode:conditional,joint--filter KEY=VAL: Filter variants by config--no-color: Disable colored CLI output
Example:
python -m src.main data/english.csv data/sample_test_data/english_test_data.csv output.csv \
--boundary-mode both --aggregate sum --weight-mode legacy_log --position-strategy absolute
📋 Output CSV & Header Logic
- Headers: Output CSV headers are now always unique and schema-driven. All configuration axes are included, and legacy aliasing is removed for clarity.
- No duplicate headers: The system guarantees that every configuration generates a unique header.
- Debugging: Set the environment variable
DEBUG_VARIANTS=1to log all generated headers and configurations for troubleshooting.
⚖️ Weighting & Smoothing
- Weighting modes:
none: Unweighted (1.0)raw: Raw frequencylog: log(freq + 1)legacy_log: 2018 behavior (log(freq) if freq > 0, -inf if freq == 0)
- Smoothing:
- Laplace smoothing (
laplace) now automatically zeroes negative or -inf counts (from legacy_log) before smoothing, ensuring legacy compatibility.
- Laplace smoothing (
🧑💻 Extending the CLI
You can extend the CLI by registering your own argument injectors using the @register_cli_ext decorator from src.cli_ext. This allows you to add custom flags or argument groups from external packages or add-ons.
Example: Adding a Custom CLI Extension
Suppose you want to add a new CLI flag --my-flag that prints a custom message. You can do this in an external Python file or package:
# my_cli_plugin.py
from src.cli_ext import register_cli_ext
import argparse
@register_cli_ext("myplugin")
class MyPlugin:
def flags(self):
action = argparse.Action(
option_strings=["--my-flag"],
dest="my_flag",
nargs=0,
help="Print a custom message from my plugin."
)
return [action]
When you run the main CLI, this flag will appear automatically if your plugin is imported before CLI construction.
🧩 Writing a Custom Model or Strategy
The system is fully registry-driven. You can add new models, weighting, smoothing, or aggregation strategies without changing core code.
Example: Registering a Custom Aggregator
# my_aggregators.py
from src.plugins.core import register
def my_custom_agg(scores):
# Your aggregation logic here
return sum(scores) / (len(scores) + 1)
register('aggregate_mode', 'my_custom_agg')(my_custom_agg)
Now you can use --aggregate my_custom_agg in the CLI.
Example: Registering a Custom Model
# my_model.py
from src.plugins.core import register, BaseModel
@register('model', 'my_ngram_model')
class MyNGramModel(BaseModel):
def fit(self, corpus):
# Custom training logic
...
def score(self, token):
# Custom scoring logic
...
Your model will now be available as a --model my_ngram_model option.
🐞 Error Handling & UX
- Interrupts: Pressing Ctrl+C exits gracefully with a yellow, bold
[Interrupted by user]message. - Warnings: Deprecated use of the string
'none'forposition_strategyis now eliminated; use PythonNoneor omit the flag. - Helpful errors: Permission and file errors are reported clearly.
🧪 Testing & Debugging
- Use
DEBUG_VARIANTS=1to log all headers/configs for debugging duplicate header issues. - All errors and warnings are designed to be clear and actionable.
📖 Citing the UCI Phonotactic Calculator
If you publish work that uses the UCI Phonotactic Calculator, please cite this repository:
Mayer, C., Kondur, A., & Sundara, M. (2022). UCI Phonotactic Calculator (Version 0.1.0) [Computer software]. https://doi.org/10.5281/zenodo.7443706
🤝 Contributing
Contributions and suggestions are welcome! Please open issues or pull requests for bugfixes, improvements, or new features.
🔗 Resources
from src.cli_ext import register_cli_ext, CLIExtension
@register_cli_ext('my_plugin')
class MyCLIExt:
def inject(self, parser):
parser.add_argument('--my-flag', action='store_true', help='Enable my feature')
This will automatically inject your flag into the CLI when your extension is imported.
Adding a new aggregator
When implementing a custom aggregator or counter, ensure your class uses the ABC-friendly accumulate signature:
from src.plugins.strategies.base import BaseCounter
class MyCounter(BaseCounter):
def accumulate(self, token, weight, **kwargs):
# Your accumulation logic here
pass
The **kwargs ensures compatibility with the abstract base class and the CLI, allowing extra arguments like boundary to be passed without error.
For more details, see the source code or run python -m src.ngram_calculator --help to view extensions and available flags.
All CLI examples and documentation now use the new flag names and header tokens for consistency and easier integration with tools like Pandas.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file uci_phonotactic_calculator-0.1.0.tar.gz.
File metadata
- Download URL: uci_phonotactic_calculator-0.1.0.tar.gz
- Upload date:
- Size: 3.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b4b621bcc7689577c53ac4e751225691e34fd683b2159599ddef8de0ff69e847
|
|
| MD5 |
f252d51a68dd63ec82f069954e6f2cc5
|
|
| BLAKE2b-256 |
fb6763785a371acbf70d09981b7bdbc06599e89999b5f60faf028e9054b25ee6
|
File details
Details for the file uci_phonotactic_calculator-0.1.0-py3-none-any.whl.
File metadata
- Download URL: uci_phonotactic_calculator-0.1.0-py3-none-any.whl
- Upload date:
- Size: 44.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2b282f4739f9cd4e881b3ea23460aa12bb505687c537dcce12683542a5b1090c
|
|
| MD5 |
4165f621cd9f66937673b1f90068b348
|
|
| BLAKE2b-256 |
c210318a4dc6191f4e2b193a141408d9f76a0c64441986a6c31b9f95605b4d35
|