Skip to main content

Runtime: lexicons, PocketSphinx alignment, CLI. Train-time (SphinxTrain/ARCTIC, future CTC) uses the same package editable in-repo — see docs/PACKAGING_RUNTIME_AND_TRAIN.md.

Project description

phonetic-decoding (Python package)

Installable library: phone inventories, lexicon ingestion (phonetic_decoding.phones.lexicon), directed phoneset maps (phonetic_decoding.phones.mapping), PocketSphinx alignment helpers (phonetic_decoding.align), phonetic transcripts (phone strings from the project .dic + words — usually IPA for ARCTIC — phonetic_decoding.align.phonetic_transcript). ph-* console scripts are thin wrappers around library code; python -m phonetic_decoding matches ph-dec. See CLI entry points below.

Runtime vs train-time: pip install phonetic-decoding is the deployable runtime (minimal dependencies). Building models (SphinxTrain / CMU ARCTIC under st-hmm/, and later CTC / neural heads) uses this same repo with an editable install and [dev] (tests, lint); optional [train] will carry train-only Python deps (e.g. torch) without pulling dev tools into every environment. Downstream train projects can depend on the published runtime wheel and add their own trainers. Full layout, model wheels, extras, and JSON / library conventions: docs/PACKAGING_RUNTIME_AND_TRAIN.md.

CLI entry points

Script Role
ph-dec Runtime / decode: align, fetch-lexicon, process-lexicon, pt (alias phonetic-transcript), … Global -C / --directory, --config, --workspace before subcommands.
ph-init Workspace scaffold: phonetic-decoding.toml, optional Makefile ( make train-checkph-train -C . check), [init] metadata, seed data/lexicons/processed/en_ipa.tsv, creates build/sphinx_projects/, optional --preset arctic, optional --audio-dir / --convert-audio-ffmpeg (writes data/audio_16k/<stem>.wav; ffmpeg on PATH only when transcoding is needed — see LEXICONS_WORKSPACE_AND_WHEEL.md). Project directory: positional DIR or -C / --directory DIR (default .; do not pass both). With --prompts-file and --wav-dir (both required together), also prepares build/sphinx_projects/<db-name>/ for a custom corpus (see docs/TURNKEY_ARCTIC_ALIGN_AND_COMPARE.md).
ph-train Train-time entry point: ph-train check verifies the project directory (data/build layout, processed lexicon hints). SphinxTrain / CMU ARCTIC builds use st-hmm/ Makefiles from the package repo (see docs/TURNKEY_ARCTIC_ALIGN_AND_COMPARE.md); more subcommands may move here later.
ph-arctic-corpus CMU ARCTIC 0.95-release download / extract / ensure under build/arctic/ (same CLI as the library module; prefer this over python -m phonetic_decoding.sphinxtrain.arctic_corpus).

Library: structured results for scripting — e.g. phonetic_decoding.init_workspace.ph_init_dict returns a JSON-serializable dict (paths as strings, messages for human-oriented lines). Use phonetic_decoding.json_util.json_dumps for emission; it always sets ensure_ascii=False so IPA and other Unicode stay readable.

Default dev setup (from this directory — creates .venv/ here, same path uv sync / uv run use):

make bootstrap          # .venv/ + pip install -e ".[dev]"
make test && make lint

Lexicons (recommended for deploy and offline use): set PHONETIC_DECODING_LEXICONS_DIR if the lexicon tree should not live under data/lexicons/. Under the lexicons root, group files by kind:

  • pronunciation_api/ — raw tab-separated API dumps (e.g. en_ipa.tsv). Legacy flat files at lexicons/<lang>_<phoneset>.tsv are still resolved by resolve_pronunciation_api_tsv_path when present.
  • processed/phonolog-style TSVs (same naming as raw: e.g. en_ipa.tsv): xenophone pronunciations removed using unmarked phone counts and a per-language frequency floor (phonsim-aligned defaults). Output phones are stress/tone/length-stripped (IPA suprasegmentals, same as counting — see phones/ipa_suprasegmentals.py). Plain word<TAB>phones lines only (no # metadata rows). Produce with ph-dec process-lexicon -l en --phoneset ipa (reads raw pronunciation_api/, writes processed/).
  • cmudict/original CMUdict text (default name cmudict.txt; copy or symlink from whatever upstream ships) and PocketSphinx cmudict.dict (same line format as iter_cmudict_raw_lines). ARCTIC prep uses lexicons/cmudict/cmudict.dict when that file exists, otherwise downloads under the build dir.
  • other/ — any additional dictionaries you ship (same repo-relative layout idea).

You can copy artifacts from CI or storage—no API client beyond this package’s declared dependencies.

Step-by-step fetch, process, optional bundled wheel, workspace, and transcripts: docs/LEXICONS_WORKSPACE_AND_WHEEL.md.

Wheel / pip install: a small en_ipa.tsv ships inside the package under phonetic_decoding.bundled.processed (see pyproject.toml package-data). When data/lexicons/processed/en_ipa.tsv is absent, phonetic_decoding.data_paths.resolve_processed_lexicon_path (and CLI expansion without --dic) use that bundled file, materialized under the user cache directory. Replace or grow the bundled TSV in your release process when you need full coverage; local development still prefers the on-disk tree when present.

Optional (dev only): fetch from the Duolingo pronunciation API via the sibling pronunciation-client—install with make install-pronunciation-client, then e.g. ph-dec fetch-lexicon -l en --phoneset arpabet (writes …/lexicons/pronunciation_api/en_arpabet.tsv; JWT via duo jwt / PRONUNCIATION_JWT). Omit -l / --language to fetch all languages returned by the API; process-lexicon without -l processes every raw pronunciation_api/*_<phoneset>.tsv for --phoneset (default ipa). make lexicons runs install-pronunciation-client, fetch-lexicon --skip-existing (skips API calls when a non-empty raw TSV is already on disk), and process-lexicon; use make fetch-lexicon alone to force a full re-download. Leave LEX_LANG empty for all-language defaults (LEX_PHONESET defaults to ipa). make clean-lexicons removes the lexicons directory with rm -rf (defaults to data/lexicons/; honors PHONETIC_DECODING_LEXICONS_DIR / PHONETIC_DECODING_DATA_DIR). ph-dec clean-lexicons does the same path resolution via Python (including phonetic-decoding.toml). Skip all of this in minimal deployments. fetch-lexicon -l zh uses the zh-CN lexicon on the API but still writes zh_<phoneset>.tsv. Not every supported language exposes a full dump; failures are a backend/data limitation.

Or install by hand:

pip install -e ".[dev]"          # library + tests (arpabo + pocketsphinx are core dependencies)

This directory is the package root (pyproject.toml, src/phonetic_decoding/, tests/). The library does not import st-hmm/; st-hmm/ scripts import phonetic_decoding (runtime as a dependency in the train-time workspace).

Persistent data (lexicons, corpora, builds)

Large artifacts should live outside git. Same idea as phonolog’s data/lexicons, but paths are configurable (no hardcoded machine paths).

Env var Default (under repo) Role
PHONETIC_DECODING_DATA_DIR data/ Lab data root; default lexicons live at data/lexicons/ unless overridden below.
PHONETIC_DECODING_LEXICONS_DIR (unset → <data_dir>/lexicons) Entire lexicon tree: pronunciation_api/, processed/, cmudict/, other/, …
PHONETIC_DECODING_BUILD_DIR build/ CMU ARCTIC trees, SphinxTrain projects, cached cmudict.dict when not bundled under lexicons

Wheel / any working directory: the CLI does not require a git checkout. Workspace resolution uses phonetic_decoding.app_config.effective_workspace: --workspace, then $PHONETIC_DECODING_WORKSPACE, then the directory of the loaded config file, then a phonetic-decoding.toml found walking upward from cwd, then a pyproject.toml (dev trees), else cwd. Run ph-init /path/to/myproject (or ph-init .) to add phonetic-decoding.toml, a minimal Makefile, [init] metadata, and (by default) data/lexicons/processed/en_ipa.tsv from the bundled wheel; use --no-makefile / --no-lexicon when you want less. TOML keys data_dir, build_dir, lexicons_dir (optional) override the defaults below; environment variables still take precedence when set. Use ph-dec --config /path/to/phonetic-decoding.toml … (or $PHONETIC_DECODING_CONFIG) when the config is not discoverable from cwd. Global -C / --directory, --config, and --workspace must appear before the subcommand (e.g. ph-dec -C /proj align … or ph-dec --workspace /proj align …).

Recommended on a big disk: one persistent root and nest builds next to lexicons:

export PHONETIC_DECODING_DATA_DIR=/path/to/persistent/phonetic-decoding-data
export PHONETIC_DECODING_BUILD_DIR="${PHONETIC_DECODING_DATA_DIR}/build"

Helpers: phonetic_decoding.data_paths (resolve_lexicons_root, lexicon_group_dir, default_lexicon_tsv_path, default_processed_lexicon_tsv_path, default_cmudict_pocketsphinx_dict_path, resolve_pronunciation_api_tsv_path, …) match the layout above. Processing logic: phonetic_decoding.phones.processed_lexicon and phones.ipa_suprasegmentals.

st-hmm/ is optional: SphinxTrain / CMU ARCTIC Makefile and scripts. After make bootstrap and processed lexicons (see below), run make st-hmm-help or make arctic-combined-ipa-multipron-and-report from this directory (the top-level Makefile forwards to st-hmm with BUILD_DIR=../build). Details: st-hmm/README.md.

Base setup + lexicons: docs/LEXICONS_WORKSPACE_AND_WHEEL.md. ARCTIC / SphinxTrain / align / compare (after lexicons): docs/TURNKEY_ARCTIC_ALIGN_AND_COMPARE.mdcorpus inference (forced alignment phone strings vs dictionary-only), not lexicon fetch/process.

Project details


Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page