Skip to main content

XPF Corpus grapheme-to-phoneme transcriber

Project description

xpfcorpus

Documentation Status

A Python package for grapheme-to-phoneme transcription based on the XPF Corpus.

Documentation: xpfcorpus.readthedocs.io

XPF Corpus Resources:

Installation

pip install xpfcorpus

Python API

from xpfcorpus import Transcriber, available_languages

# Basic usage - languages with a default script
es = Transcriber("es")
es.transcribe("ejemplo")  # ['e', 'x', 'e', 'm', 'p', 'l', 'o']

# Languages with multiple scripts require explicit script choice
tt_latin = Transcriber("tt", "latin")
tt_cyrillic = Transcriber("tt", "cyrillic")

# BCP-47 style language codes with script/region
es_es = Transcriber("es-ES")  # Region code stripped, uses default script
yi = Transcriber("yi-Latn")   # Script extracted from code
tt = Transcriber("tt-cyrillic")  # Script name in code
zh = Transcriber("zh-Hans-CN")  # Script extracted, region stripped

# Explicit script parameter overrides code
yi = Transcriber("yi-Latn", script="hebrew")  # Uses hebrew, not latin

# Skip verification on load
es = Transcriber("es", verify=False)

# List available languages
available_languages()
# {"es": {"scripts": ["latin"], "default": "latin"},
#  "tt": {"scripts": ["latin", "cyrillic"], "default": None}, ...}

Language Code Format

The package supports BCP-47 style language codes:

  • Simple codes: "es", "tt", "yi"
  • With region (variants): "es-ES", "en-US" → treated as language variants
  • With script (extracted): "yi-Latn", "tt-Cyrl" → extracts script
  • Script names (extracted): "tt-cyrillic", "yi-latin" → extracts script
  • Complex codes: "zh-Hans-CN" → extracts "hans" script, preserves "CN" region

When both a script in the code and an explicit script parameter are provided, the explicit parameter takes precedence.

Language Variants

Region codes are treated as language variants. The package will:

  1. Try to load variant-specific data (e.g., es-ES.json) if available
  2. Fall back to base language (e.g., es.json) with a warning if variant not found
  3. Store the variant information in the variant property (only if variant file exists)
# Base language
es = Transcriber("es")
print(es.variant)  # None

# Variant request (falls back to base with warning if es-ES.json doesn't exist)
es_es = Transcriber("es-ES")
print(es_es.variant)  # None (because es-ES.json doesn't exist, fell back to es.json)

# If es-ES.json existed, then:
# es_es.variant would be "ES"

Behavior:

  • If variant file exists: variant property returns the region code (e.g., "ES")
  • If variant file doesn't exist: falls back to base language, variant is None

To create a variant, add a JSON file like es-ES.json to the xpfcorpus/data/languages/ directory with variant-specific rules.

Command-Line Interface

# Transcribe words
xpfcorpus transcribe es ejemplo hola mundo

# Transcribe from file (extracts first word from each line)
xpfcorpus transcribe es -f words.txt

# Transcribe from stdin
echo -e "mundo\nbueno" | xpfcorpus transcribe es
cat words.txt | xpfcorpus transcribe es -f -

# Combine command-line words and file
xpfcorpus transcribe es ejemplo hola -f more_words.txt

# List available languages
xpfcorpus list
xpfcorpus list --json

# Export language rules as YAML
xpfcorpus export es
xpfcorpus export es -o spanish.yaml

# Verify language rules
xpfcorpus verify es -v
xpfcorpus verify --all

Supported Languages

The package includes rules for 201 languages with 203 language/script combinations. Some languages have multiple scripts:

  • iu (Inuktitut): latin, syllabics
  • tt (Tatar): latin, cyrillic

Use xpfcorpus list or available_languages() for the full list.

Citation

If you use this package in your research, please cite the XPF Corpus:

@misc{xpf_corpus,
  title={The Cross-linguistic Phonological Frequencies (XPF) Corpus},
  author={Cohen Priva, Uriel and Gleason, Emily},
  year={2022},
  url={https://cohenpr-xpf.github.io/XPF/}
}

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xpfcorpus-0.1.0.tar.gz (306.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

xpfcorpus-0.1.0-py3-none-any.whl (417.0 kB view details)

Uploaded Python 3

File details

Details for the file xpfcorpus-0.1.0.tar.gz.

File metadata

  • Download URL: xpfcorpus-0.1.0.tar.gz
  • Upload date:
  • Size: 306.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for xpfcorpus-0.1.0.tar.gz
Algorithm Hash digest
SHA256 316c7559adc7d8b2f6bc3b9c3420e6fbb8b055a7423ebb627251759bd65a99bd
MD5 77fea832f6a4ce1814b1404b5116c435
BLAKE2b-256 af3521adbbb3878864d0ecfd5a2f158ffe89f611e4742d0a33cb377dac5a6620

See more details on using hashes here.

File details

Details for the file xpfcorpus-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: xpfcorpus-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 417.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for xpfcorpus-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b10ff34a42b2ea520bc8a990522d428e85badfc5b438c9a09a91756d11fbeea5
MD5 c5de2d119e0cbf6a0f4a1d79991fe2c6
BLAKE2b-256 8f420c36b40bcc62f1bc0c4550ac4d198df6586a3ccea55c19515aedd65d92da

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page