Skip to main content

XPF Corpus grapheme-to-phoneme transcriber

Project description

xpfcorpus

Documentation Status

A Python package for grapheme-to-phoneme transcription based on the XPF Corpus.

Documentation: xpfcorpus.readthedocs.io

XPF Corpus Resources:

Installation

pip install xpfcorpus

Python API

from xpfcorpus import Transcriber, available_languages

# Basic usage - languages with a default script
es = Transcriber("es")
es.transcribe("ejemplo")  # ['e', 'x', 'e', 'm', 'p', 'l', 'o']

# Languages with multiple scripts require explicit script choice
tt_latin = Transcriber("tt", "latin")
tt_cyrillic = Transcriber("tt", "cyrillic")

# BCP-47 style language codes with script/region
es_es = Transcriber("es-ES")  # Region code stripped, uses default script
yi = Transcriber("yi-Latn")   # Script extracted from code
tt = Transcriber("tt-cyrillic")  # Script name in code
zh = Transcriber("zh-Hans-CN")  # Script extracted, region stripped

# Explicit script parameter overrides code
yi = Transcriber("yi-Latn", script="hebrew")  # Uses hebrew, not latin

# Skip verification on load
es = Transcriber("es", verify=False)

# List available languages
available_languages()
# {"es": {"scripts": ["latin"], "default": "latin"},
#  "tt": {"scripts": ["latin", "cyrillic"], "default": None}, ...}

Language Code Format

The package supports BCP-47 style language codes:

  • Simple codes: "es", "tt", "yi"
  • With region (variants): "es-ES", "en-US" → treated as language variants
  • With script (extracted): "yi-Latn", "tt-Cyrl" → extracts script
  • Script names (extracted): "tt-cyrillic", "yi-latin" → extracts script
  • Complex codes: "zh-Hans-CN" → extracts "hans" script, preserves "CN" region

When both a script in the code and an explicit script parameter are provided, the explicit parameter takes precedence.

Language Variants

Region codes are treated as language variants. The package will:

  1. Try to load variant-specific data (e.g., es-ES.json) if available
  2. Fall back to base language (e.g., es.json) with a warning if variant not found
  3. Store the variant information in the variant property (only if variant file exists)
# Base language
es = Transcriber("es")
print(es.variant)  # None

# Variant request (falls back to base with warning if es-ES.json doesn't exist)
es_es = Transcriber("es-ES")
print(es_es.variant)  # None (because es-ES.json doesn't exist, fell back to es.json)

# If es-ES.json existed, then:
# es_es.variant would be "ES"

Behavior:

  • If variant file exists: variant property returns the region code (e.g., "ES")
  • If variant file doesn't exist: falls back to base language, variant is None

To create a variant, add a JSON file like es-ES.json to the xpfcorpus/data/languages/ directory with variant-specific rules.

Command-Line Interface

# Transcribe words
xpfcorpus transcribe es ejemplo hola mundo

# Transcribe from file (extracts first word from each line)
xpfcorpus transcribe es -f words.txt

# Transcribe from stdin
echo -e "mundo\nbueno" | xpfcorpus transcribe es
cat words.txt | xpfcorpus transcribe es -f -

# Combine command-line words and file
xpfcorpus transcribe es ejemplo hola -f more_words.txt

# List available languages
xpfcorpus list
xpfcorpus list --json

# Export language rules as YAML
xpfcorpus export es
xpfcorpus export es -o spanish.yaml

# Verify language rules
xpfcorpus verify es -v
xpfcorpus verify --all

Supported Languages

The package includes rules for 201 languages with 203 language/script combinations. Some languages have multiple scripts:

  • iu (Inuktitut): latin, syllabics
  • tt (Tatar): latin, cyrillic

Use xpfcorpus list or available_languages() for the full list.

Citation

If you use this package in your research, please cite the XPF Corpus:

@Manual{XPF2021manual,
  author={Cohen Priva, Uriel and Strand, Emily and Yang, Shiying and Mizgerd, William and Creighton, Abigail and Bai, Justin and Mathew, Rebecca and Shao, Allison and Schuster, Jordan and Wiepert, Daniela},
  title = 	 {The Cross-linguistic Phonological Frequencies (XPF) Corpus manual},
  year = 	 {2021},
  note =         {Accessible online, \url{https://cohenpr-xpf.github.io/XPF/manual/xpf_manual.pdf}}
}

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xpfcorpus-0.1.1.tar.gz (308.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

xpfcorpus-0.1.1-py3-none-any.whl (418.1 kB view details)

Uploaded Python 3

File details

Details for the file xpfcorpus-0.1.1.tar.gz.

File metadata

  • Download URL: xpfcorpus-0.1.1.tar.gz
  • Upload date:
  • Size: 308.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for xpfcorpus-0.1.1.tar.gz
Algorithm Hash digest
SHA256 f7204963adc3afa92f4cb7b7861e445e8b03656861cf3999a11de5d7d496bfa8
MD5 7f01ffb1e6828b0e39bb720f3003f092
BLAKE2b-256 db810ccd0666933a17a1d459fa73e45a8d6274232576368331dbadb549f3b795

See more details on using hashes here.

File details

Details for the file xpfcorpus-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: xpfcorpus-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 418.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for xpfcorpus-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c0fbe401e2e9c9a64568b5477b0104e54edb1971a78d8557f77826dbf5555dfa
MD5 99ef998be2bebd80e35ced03627032da
BLAKE2b-256 7048471c7655c2d814a5d59fd681eb3c80580bd662c9719ecc5bd668f3fd6937

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page