A toolkit for phonotactic probability calculation and analysis.

Project description

UCI Phonotactic Calculator

This repository contains the source code for the UCI Phonotactic Calculator website, as well as a flexible, extensible CLI for phonotactic modeling and scoring.

Source code: src/
Example datasets: data/

🚀 Quick Start

Requirements

Python 3.8+

(Optional) Create a virtual environment:

python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows

Install dependencies:
```
pip install -r requirements.txt
```

Running a Model

python -m src.main <train_file> <test_file> <output_file> [OPTIONS]

Example:

python -m src.main data/english.csv data/sample_test_data/english_test_data.csv output.csv --boundary-mode both --aggregate sum --weight-mode legacy_log

🛠️ CLI Options

The CLI supports a wide range of options. Run python -m src.main --help for a full list.

🔎 Filtering and Discovering Keys

You can restrict the grid search to specific model/configuration variants using one or more --filter flags:

python -m src.main ... --filter KEY=VAL [--filter KEY2=VAL2 ...]

Each filter restricts the grid to configs where Config.<KEY> == <VAL>.
Repeat the flag to combine filters (logical AND).

Examples:

--filter smoothing=laplace --filter n=2
--filter aggregate=logsumexp

Both long and short key aliases are accepted (e.g., n for ngram_order, prob for prob_mode).
For a full list of accepted keys and aliases, run:
```
python -m src.main --list-filters
```
This will print all canonical keys and their available aliases.

🐍 Scripting & CI: Disabling Progress Bars

To make scripting and automation easier (e.g., in Docker or CI), you can suppress all progress bars globally by setting the environment variable:

NO_PROGRESS=1

This disables all Rich progress bars, regardless of CLI flags. You can also use the --no-progress CLI flag for one-off runs.

Key flags:

--boundary-mode: both (default), prefix, suffix, none
--aggregate: sum, mean, min, max, none
--weight-mode: none, raw, log, legacy_log
--position-strategy: absolute, relative, none (default is None)
--smoothing-scheme: laplace, none, kn
--count-strategy: ngram (default), others as available
--prob-mode: conditional, joint
--filter KEY=VAL: Filter variants by config
--no-color: Disable colored CLI output

Example:

python -m src.main data/english.csv data/sample_test_data/english_test_data.csv output.csv \
    --boundary-mode both --aggregate sum --weight-mode legacy_log --position-strategy absolute

📋 Output CSV & Header Logic

Headers: Output CSV headers are now always unique and schema-driven. All configuration axes are included, and legacy aliasing is removed for clarity.
No duplicate headers: The system guarantees that every configuration generates a unique header.
Debugging: Set the environment variable DEBUG_VARIANTS=1 to log all generated headers and configurations for troubleshooting.

⚖️ Weighting & Smoothing

Weighting modes:
- none: Unweighted (1.0)
- raw: Raw frequency
- log: log(freq + 1)
- legacy_log: 2018 behavior (log(freq) if freq > 0, -inf if freq == 0)
Smoothing:
- Laplace smoothing (laplace) now automatically zeroes negative or -inf counts (from legacy_log) before smoothing, ensuring legacy compatibility.

🧑‍💻 Extending the CLI

You can extend the CLI by registering your own argument injectors using the @register_cli_ext decorator from src.cli_ext. This allows you to add custom flags or argument groups from external packages or add-ons.

Example: Adding a Custom CLI Extension

Suppose you want to add a new CLI flag --my-flag that prints a custom message. You can do this in an external Python file or package:

# my_cli_plugin.py
from src.cli_ext import register_cli_ext
import argparse

@register_cli_ext("myplugin")
class MyPlugin:
    def flags(self):
        action = argparse.Action(
            option_strings=["--my-flag"],
            dest="my_flag",
            nargs=0,
            help="Print a custom message from my plugin."
        )
        return [action]

When you run the main CLI, this flag will appear automatically if your plugin is imported before CLI construction.

🧩 Writing a Custom Model or Strategy

The system is fully registry-driven. You can add new models, weighting, smoothing, or aggregation strategies without changing core code.

Example: Registering a Custom Aggregator

# my_aggregators.py
from src.plugins.core import register

def my_custom_agg(scores):
    # Your aggregation logic here
    return sum(scores) / (len(scores) + 1)

register('aggregate_mode', 'my_custom_agg')(my_custom_agg)

Now you can use --aggregate my_custom_agg in the CLI.

Example: Registering a Custom Model

# my_model.py
from src.plugins.core import register, BaseModel

@register('model', 'my_ngram_model')
class MyNGramModel(BaseModel):
    def fit(self, corpus):
        # Custom training logic
        ...
    def score(self, token):
        # Custom scoring logic
        ...

Your model will now be available as a --model my_ngram_model option.

🐞 Error Handling & UX

Interrupts: Pressing Ctrl+C exits gracefully with a yellow, bold [Interrupted by user] message.
Warnings: Deprecated use of the string 'none' for position_strategy is now eliminated; use Python None or omit the flag.
Helpful errors: Permission and file errors are reported clearly.

🧪 Testing & Debugging

Use DEBUG_VARIANTS=1 to log all headers/configs for debugging duplicate header issues.
All errors and warnings are designed to be clear and actionable.

📖 Citing the UCI Phonotactic Calculator

If you publish work that uses the UCI Phonotactic Calculator, please cite this repository:

Mayer, C., Kondur, A., & Sundara, M. (2022). UCI Phonotactic Calculator (Version 0.1.0) [Computer software]. https://doi.org/10.5281/zenodo.7443706

🤝 Contributing

Contributions and suggestions are welcome! Please open issues or pull requests for bugfixes, improvements, or new features.

🔗 Resources

from src.cli_ext import register_cli_ext, CLIExtension

@register_cli_ext('my_plugin')
class MyCLIExt:
    def inject(self, parser):
        parser.add_argument('--my-flag', action='store_true', help='Enable my feature')

This will automatically inject your flag into the CLI when your extension is imported.

Adding a new aggregator

When implementing a custom aggregator or counter, ensure your class uses the ABC-friendly accumulate signature:

from src.plugins.strategies.base import BaseCounter

class MyCounter(BaseCounter):
    def accumulate(self, token, weight, **kwargs):
        # Your accumulation logic here
        pass

The **kwargs ensures compatibility with the abstract base class and the CLI, allowing extra arguments like boundary to be passed without error.

For more details, see the source code or run python -m src.ngram_calculator --help to view extensions and available flags.

All CLI examples and documentation now use the new flag names and header tokens for consistency and easier integration with tools like Pandas.

Project details

Release history Release notifications | RSS feed

1.0.1

Jun 26, 2025

1.0.0

Jun 26, 2025

0.2.3

May 23, 2025

0.2.2

May 13, 2025

0.2.1

May 13, 2025

0.2.0

May 13, 2025

0.1.2

May 13, 2025

0.1.1

May 13, 2025

This version

0.1.0

May 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uci_phonotactic_calculator-0.1.0.tar.gz (3.2 MB view details)

Uploaded May 13, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

uci_phonotactic_calculator-0.1.0-py3-none-any.whl (44.5 kB view details)

Uploaded May 13, 2025 Python 3

File details

Details for the file uci_phonotactic_calculator-0.1.0.tar.gz.

File metadata

Download URL: uci_phonotactic_calculator-0.1.0.tar.gz
Upload date: May 13, 2025
Size: 3.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for uci_phonotactic_calculator-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`b4b621bcc7689577c53ac4e751225691e34fd683b2159599ddef8de0ff69e847`
MD5	`f252d51a68dd63ec82f069954e6f2cc5`
BLAKE2b-256	`fb6763785a371acbf70d09981b7bdbc06599e89999b5f60faf028e9054b25ee6`

See more details on using hashes here.

File details

Details for the file uci_phonotactic_calculator-0.1.0-py3-none-any.whl.

File metadata

Download URL: uci_phonotactic_calculator-0.1.0-py3-none-any.whl
Upload date: May 13, 2025
Size: 44.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for uci_phonotactic_calculator-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2b282f4739f9cd4e881b3ea23460aa12bb505687c537dcce12683542a5b1090c`
MD5	`4165f621cd9f66937673b1f90068b348`
BLAKE2b-256	`c210318a4dc6191f4e2b193a141408d9f76a0c64441986a6c31b9f95605b4d35`

See more details on using hashes here.

uci-phonotactic-calculator 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

UCI Phonotactic Calculator

🚀 Quick Start

Requirements

Running a Model

🛠️ CLI Options

🔎 Filtering and Discovering Keys

🐍 Scripting & CI: Disabling Progress Bars

📋 Output CSV & Header Logic

⚖️ Weighting & Smoothing

🧑‍💻 Extending the CLI

Example: Adding a Custom CLI Extension

🧩 Writing a Custom Model or Strategy

Example: Registering a Custom Aggregator

Example: Registering a Custom Model

🐞 Error Handling & UX

🧪 Testing & Debugging

📖 Citing the UCI Phonotactic Calculator

🤝 Contributing

🔗 Resources

Adding a new aggregator

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes