Skip to main content

A toolkit for phonotactic probability calculation and analysis.

Project description

UCI Phonotactic Calculator

DOI

This repository contains the source code for the UCI Phonotactic Calculator website, as well as a flexible, extensible CLI for phonotactic modeling and scoring.


🚀 Quick Start

Requirements

  • Python 3.8+
  • (Optional) Create a virtual environment:
    python -m venv venv
    source venv/bin/activate  # or venv\Scripts\activate on Windows
    
  • Install dependencies:
    pip install -r requirements.txt
    

Running a Model

python -m src.main <train_file> <test_file> <output_file> [OPTIONS]

Example:

python -m src.main data/english.csv data/sample_test_data/english_test_data.csv output.csv --boundary-mode both --aggregate sum --weight-mode legacy_log

🛠️ CLI Options

The CLI supports a wide range of options. Run python -m src.main --help for a full list.

🔎 Filtering and Discovering Keys

You can restrict the grid search to specific model/configuration variants using one or more --filter flags:

python -m src.main ... --filter KEY=VAL [--filter KEY2=VAL2 ...]
  • Each filter restricts the grid to configs where Config.<KEY> == <VAL>.
  • Repeat the flag to combine filters (logical AND).
  • Examples:
    --filter smoothing=laplace --filter n=2
    --filter aggregate=logsumexp
    
  • Both long and short key aliases are accepted (e.g., n for ngram_order, prob for prob_mode).
  • For a full list of accepted keys and aliases, run:
    python -m src.main --list-filters
    
    This will print all canonical keys and their available aliases.

🐍 Scripting & CI: Disabling Progress Bars

To make scripting and automation easier (e.g., in Docker or CI), you can suppress all progress bars globally by setting the environment variable:

NO_PROGRESS=1

This disables all Rich progress bars, regardless of CLI flags. You can also use the --no-progress CLI flag for one-off runs.

Key flags:

  • --boundary-mode: both (default), prefix, suffix, none
  • --aggregate: sum, mean, min, max, none
  • --weight-mode: none, raw, log, legacy_log
  • --position-strategy: absolute, relative, none (default is None)
  • --smoothing-scheme: laplace, none, kn
  • --count-strategy: ngram (default), others as available
  • --prob-mode: conditional, joint
  • --filter KEY=VAL: Filter variants by config
  • --no-color: Disable colored CLI output

Example:

python -m src.main data/english.csv data/sample_test_data/english_test_data.csv output.csv \
    --boundary-mode both --aggregate sum --weight-mode legacy_log --position-strategy absolute

📋 Output CSV & Header Logic

  • Headers: Output CSV headers are now always unique and schema-driven. All configuration axes are included, and legacy aliasing is removed for clarity.
  • No duplicate headers: The system guarantees that every configuration generates a unique header.
  • Debugging: Set the environment variable DEBUG_VARIANTS=1 to log all generated headers and configurations for troubleshooting.

⚖️ Weighting & Smoothing

  • Weighting modes:
    • none: Unweighted (1.0)
    • raw: Raw frequency
    • log: log(freq + 1)
    • legacy_log: 2018 behavior (log(freq) if freq > 0, -inf if freq == 0)
  • Smoothing:
    • Laplace smoothing (laplace) now automatically zeroes negative or -inf counts (from legacy_log) before smoothing, ensuring legacy compatibility.

🧑‍💻 Extending the CLI

You can extend the CLI by registering your own argument injectors using the @register_cli_ext decorator from src.cli_ext. This allows you to add custom flags or argument groups from external packages or add-ons.

Example: Adding a Custom CLI Extension

Suppose you want to add a new CLI flag --my-flag that prints a custom message. You can do this in an external Python file or package:

# my_cli_plugin.py
from src.cli_ext import register_cli_ext
import argparse

@register_cli_ext("myplugin")
class MyPlugin:
    def flags(self):
        action = argparse.Action(
            option_strings=["--my-flag"],
            dest="my_flag",
            nargs=0,
            help="Print a custom message from my plugin."
        )
        return [action]

When you run the main CLI, this flag will appear automatically if your plugin is imported before CLI construction.


🧩 Writing a Custom Model or Strategy

The system is fully registry-driven. You can add new models, weighting, smoothing, or aggregation strategies without changing core code.

Example: Registering a Custom Aggregator

# my_aggregators.py
from src.plugins.core import register

def my_custom_agg(scores):
    # Your aggregation logic here
    return sum(scores) / (len(scores) + 1)

register('aggregate_mode', 'my_custom_agg')(my_custom_agg)

Now you can use --aggregate my_custom_agg in the CLI.

Example: Registering a Custom Model

# my_model.py
from src.plugins.core import register, BaseModel

@register('model', 'my_ngram_model')
class MyNGramModel(BaseModel):
    def fit(self, corpus):
        # Custom training logic
        ...
    def score(self, token):
        # Custom scoring logic
        ...

Your model will now be available as a --model my_ngram_model option.


🐞 Error Handling & UX

  • Interrupts: Pressing Ctrl+C exits gracefully with a yellow, bold [Interrupted by user] message.
  • Warnings: Deprecated use of the string 'none' for position_strategy is now eliminated; use Python None or omit the flag.
  • Helpful errors: Permission and file errors are reported clearly.

🧪 Testing & Debugging

  • Use DEBUG_VARIANTS=1 to log all headers/configs for debugging duplicate header issues.
  • All errors and warnings are designed to be clear and actionable.

📖 Citing the UCI Phonotactic Calculator

If you publish work that uses the UCI Phonotactic Calculator, please cite this repository:

Mayer, C., Kondur, A., & Sundara, M. (2022). UCI Phonotactic Calculator (Version 0.1.0) [Computer software]. https://doi.org/10.5281/zenodo.7443706


🤝 Contributing

Contributions and suggestions are welcome! Please open issues or pull requests for bugfixes, improvements, or new features.


🔗 Resources

from src.cli_ext import register_cli_ext, CLIExtension

@register_cli_ext('my_plugin')
class MyCLIExt:
    def inject(self, parser):
        parser.add_argument('--my-flag', action='store_true', help='Enable my feature')

This will automatically inject your flag into the CLI when your extension is imported.

Adding a new aggregator

When implementing a custom aggregator or counter, ensure your class uses the ABC-friendly accumulate signature:

from src.plugins.strategies.base import BaseCounter

class MyCounter(BaseCounter):
    def accumulate(self, token, weight, **kwargs):
        # Your accumulation logic here
        pass

The **kwargs ensures compatibility with the abstract base class and the CLI, allowing extra arguments like boundary to be passed without error.

For more details, see the source code or run python -m src.ngram_calculator --help to view extensions and available flags.


All CLI examples and documentation now use the new flag names and header tokens for consistency and easier integration with tools like Pandas.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uci_phonotactic_calculator-0.1.1.tar.gz (3.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

uci_phonotactic_calculator-0.1.1-py3-none-any.whl (44.5 kB view details)

Uploaded Python 3

File details

Details for the file uci_phonotactic_calculator-0.1.1.tar.gz.

File metadata

File hashes

Hashes for uci_phonotactic_calculator-0.1.1.tar.gz
Algorithm Hash digest
SHA256 bc5ccb52115bb1cd5f3dae82411f025b436a30f24324478904cd074acdb35e50
MD5 666967626ea90cb5cfc3d06fe8ca2ddc
BLAKE2b-256 e0e9499ff0d22a90886943ca4a80b4f7c27776ab87be791600d104ad47186912

See more details on using hashes here.

File details

Details for the file uci_phonotactic_calculator-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for uci_phonotactic_calculator-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 91be8b030c47222f4fe3678f9b61ba15485df0f4e874ff9e54aa191ca75f947b
MD5 d518161e6b7d2aa5d128b133ba35576d
BLAKE2b-256 bcae04b9c50af348acc1af0ca867f5369a772de7fd7d3303cb925bb2bb9de1f0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page