XPF Corpus grapheme-to-phoneme transcriber
Project description
xpfcorpus
A Python package for grapheme-to-phoneme transcription based on the XPF Corpus.
Documentation: xpfcorpus.readthedocs.io
XPF Corpus Resources:
Installation
pip install xpfcorpus
Python API
from xpfcorpus import Transcriber, available_languages
# Basic usage - languages with a default script
es = Transcriber("es")
es.transcribe("ejemplo") # ['e', 'x', 'e', 'm', 'p', 'l', 'o']
# Languages with multiple scripts require explicit script choice
tt_latin = Transcriber("tt", "latin")
tt_cyrillic = Transcriber("tt", "cyrillic")
# BCP-47 style language codes with script/region
es_es = Transcriber("es-ES") # Region code stripped, uses default script
yi = Transcriber("yi-Latn") # Script extracted from code
tt = Transcriber("tt-cyrillic") # Script name in code
zh = Transcriber("zh-Hans-CN") # Script extracted, region stripped
# Explicit script parameter overrides code
yi = Transcriber("yi-Latn", script="hebrew") # Uses hebrew, not latin
# Skip verification on load
es = Transcriber("es", verify=False)
# List available languages
available_languages()
# {"es": {"scripts": ["latin"], "default": "latin"},
# "tt": {"scripts": ["latin", "cyrillic"], "default": None}, ...}
Language Code Format
The package supports BCP-47 style language codes:
- Simple codes:
"es","tt","yi" - With region (variants):
"es-ES","en-US"→ treated as language variants - With script (extracted):
"yi-Latn","tt-Cyrl"→ extracts script - Script names (extracted):
"tt-cyrillic","yi-latin"→ extracts script - Complex codes:
"zh-Hans-CN"→ extracts"hans"script, preserves"CN"region
When both a script in the code and an explicit script parameter are provided, the explicit parameter takes precedence.
Language Variants
Region codes are treated as language variants. The package will:
- Try to load variant-specific data (e.g.,
es-ES.json) if available - Fall back to base language (e.g.,
es.json) with a warning if variant not found - Store the variant information in the
variantproperty (only if variant file exists)
# Base language
es = Transcriber("es")
print(es.variant) # None
# Variant request (falls back to base with warning if es-ES.json doesn't exist)
es_es = Transcriber("es-ES")
print(es_es.variant) # None (because es-ES.json doesn't exist, fell back to es.json)
# If es-ES.json existed, then:
# es_es.variant would be "ES"
Behavior:
- If variant file exists:
variantproperty returns the region code (e.g., "ES") - If variant file doesn't exist: falls back to base language,
variantisNone
To create a variant, add a JSON file like es-ES.json to the xpfcorpus/data/languages/ directory with variant-specific rules.
Command-Line Interface
# Transcribe words
xpfcorpus transcribe es ejemplo hola mundo
# Transcribe from file (extracts first word from each line)
xpfcorpus transcribe es -f words.txt
# Transcribe from stdin
echo -e "mundo\nbueno" | xpfcorpus transcribe es
cat words.txt | xpfcorpus transcribe es -f -
# Combine command-line words and file
xpfcorpus transcribe es ejemplo hola -f more_words.txt
# List available languages
xpfcorpus list
xpfcorpus list --json
# Export language rules as YAML
xpfcorpus export es
xpfcorpus export es -o spanish.yaml
# Verify language rules
xpfcorpus verify es -v
xpfcorpus verify --all
Supported Languages
The package includes rules for 201 languages with 203 language/script combinations. Some languages have multiple scripts:
iu(Inuktitut): latin, syllabicstt(Tatar): latin, cyrillic
Use xpfcorpus list or available_languages() for the full list.
Citation
If you use this package in your research, please cite the XPF Corpus:
@misc{xpf_corpus,
title={The Cross-linguistic Phonological Frequencies (XPF) Corpus},
author={Cohen Priva, Uriel and Gleason, Emily},
year={2022},
url={https://cohenpr-xpf.github.io/XPF/}
}
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file xpfcorpus-0.1.0.tar.gz.
File metadata
- Download URL: xpfcorpus-0.1.0.tar.gz
- Upload date:
- Size: 306.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
316c7559adc7d8b2f6bc3b9c3420e6fbb8b055a7423ebb627251759bd65a99bd
|
|
| MD5 |
77fea832f6a4ce1814b1404b5116c435
|
|
| BLAKE2b-256 |
af3521adbbb3878864d0ecfd5a2f158ffe89f611e4742d0a33cb377dac5a6620
|
File details
Details for the file xpfcorpus-0.1.0-py3-none-any.whl.
File metadata
- Download URL: xpfcorpus-0.1.0-py3-none-any.whl
- Upload date:
- Size: 417.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b10ff34a42b2ea520bc8a990522d428e85badfc5b438c9a09a91756d11fbeea5
|
|
| MD5 |
c5de2d119e0cbf6a0f4a1d79991fe2c6
|
|
| BLAKE2b-256 |
8f420c36b40bcc62f1bc0c4550ac4d198df6586a3ccea55c19515aedd65d92da
|