Skip to main content

Linguistically motivated grapheme-to-IPA and allophone mappings for 350+ language codes

Project description

orthography2ipa

Linguistically motivated grapheme→IPA and allophone mappings for 350+ language codes across 20+ language families — pure data, a maximal-munch IPA tokenizer, and a family of phonological/script distance metrics, with no trained weights to ship.

Only mappings grounded in official orthography and documented grammar are included. Arbitrary substring rules are excluded.

Why two maps

The central distinction the package enforces:

  • A grapheme map tells you which phonemes a spelling can represent. English ⟨th⟩ → ['θ', 'ð'].
  • An allophone map tells you how a phoneme surfaces in context. English /t/ → ['t', 'tʰ', 'ɾ', 'ʔ', 't̚'].

Keeping these separate lets you go from text to phoneme candidates (transcription) and from phonemes to surface realisations (pronunciation modelling) without conflating the two.

What each language carries

Every LanguageSpec provides:

  1. Graphemes — orthographic units (characters, digraphs, trigraphs) mapped to canonical IPA phonemes.
  2. Allophones — each phoneme mapped to its positional/contextual surface realisations.
  3. Positional graphemes — context-sensitive overrides (word-initial, intervocalic, before /i/, …).
  4. Ancestry — weighted multi-ancestor lineage (parent, substrate, superstrate, adstrate, …) for dialect trees.
  5. Sandhi rules — cross-word phonological processes.
  6. Tone inventory — tone marks → labels, where applicable.
  7. ProvenanceQualityTier (stub → skeleton → research → production), ScriptType, and bibliographic sources.

Regional varieties get their own LanguageSpec objects linked through ancestry, and JSON data files support graphemes_base/allophones_base inheritance so a dialect only declares what differs from its parent.

Installation

pip install orthography2ipa

For the optional Arabic G2P backend:

pip install orthography2ipa[arabic]

Quick start

Python API

import orthography2ipa

# Get a language spec
en = orthography2ipa.get("en-GB")

# Grapheme → IPA candidates
en.graphemes["th"]    # ['θ', 'ð']

# Allophone map: how /t/ surfaces
en.allophones["t"]    # ['t', 'tʰ', 'ɾ', 'ʔ', 't̚']

# Metadata
en.name               # 'British English (RP)'
en.family             # 'Germanic'
en.script             # 'Latin'

# Regional variants share ancestry but diverge where pronunciation does
pt_br = orthography2ipa.get("pt-BR")
pt_br.graphemes["t"]  # ['t', 't͡ʃ']   — palatalisation before /i/

# ISO 639-3 aliases resolve to BCP-47 codes
orthography2ipa.get("eng").name   # 'British English (RP)'

# Discover what's available
orthography2ipa.available_codes()
orthography2ipa.available_families()

IPA tokenizer

PhonetokTokenizer performs maximal-munch grapheme tokenization with beam-search IPA expansion, ranking candidate transcriptions when a spelling is ambiguous:

from orthography2ipa import get
from orthography2ipa.phonetok import PhonetokTokenizer

tok = PhonetokTokenizer(get("en-GB"))

tok.ipa_best("through")              # 'θɹɔː'
for path in tok.ipa_beam("through", beam_width=8):
    print(path.ipa, path.score)      # θɹɔː 0.0, ðɹɔː 1.0, θɹoʊ 1.0, …

Distance metrics

Compare two languages across inventory, grapheme, allophone, and ancestry dimensions:

from orthography2ipa import get
from orthography2ipa.distance import phonological_distance

d = phonological_distance(get("pt-BR"), get("pt-PT"))
d.combined                    # 0.04 — near-identical
d.inventory.feature_mean      # phoneme-inventory distance
d.grapheme.mean_ipa_distance  # grapheme-mapping divergence
d.allophone_sim               # allophone-overlap similarity

Script-level distance and feature vectors are available via script_distance.py and feats.py.

Command-line interface

After installation the orthography2ipa command is available. Every subcommand accepts --json for machine-readable output.

# List languages and families
orthography2ipa list
orthography2ipa list --families
orthography2ipa list --family Romance

# Inspect a language
orthography2ipa info pt-BR
orthography2ipa info pt-BR --graphemes
orthography2ipa info pt-BR --json

# Transcribe text to IPA (beam-ranked candidates)
orthography2ipa transcribe pt-BR "chuva"
orthography2ipa transcribe en-GB "through" --beam 8

# Phonological distance between two languages
orthography2ipa distance pt-BR pt-PT
orthography2ipa distance es-ES it-IT --json

Languages

Family Examples
Romance pt-PT, pt-BR, es-ES, es-AR, ca, fr-FR, it-IT, ro-RO, gl, oc, sc, an
Germanic en-GB, de-DE, nl-NL, sv-SE, da-DK, no-NO, af
Slavic ru-RU, uk-UA, pl-PL, cs-CZ, sr-RS, hr-HR, bg-BG
Celtic cy, ga, gd, br, kw, gv
Indo-Aryan hi-IN, bn-BD, ur-PK, ne-NP, pa, gu, mr
Semitic arb, he-IL, mt
Turkic tr-TR, az, kk, uz
Hellenic el-GR
Uralic fi-FI, hu-HU, et-EE
Japonic ja
Sinitic zh
Koreanic ko

350+ codes across 40+ family groupings, including reconstructed proto-languages and fine-grained regional dialects.

Data structure

@dataclass(frozen=True)
class LanguageSpec:
    code: str                              # 'pt-BR'
    name: str                              # 'Brazilian Portuguese'
    family: str                            # 'Romance'
    script: str                            # 'Latin'
    graphemes: Dict[str, List[str]]        # 'th' → ['θ', 'ð']
    allophones: Dict[str, List[str]]       # 't' → ['t', 'tʰ', 'ɾ', 'ʔ', 't̚']
    positional_graphemes: Dict[...]        # context-sensitive overrides
    parent: Optional[str]                  # primary parent code
    ancestors: Tuple[Ancestor, ...]        # weighted multi-ancestor lineage
    quality: QualityTier                   # stub | skeleton | research | production
    script_type: ScriptType                # alphabet | abjad | abugida | ...
    sandhi_rules: Tuple[SandhiRule, ...]   # cross-word rules
    tone_inventory: Optional[Dict]         # tone marks → labels
    sources: Tuple[LinguisticSource, ...]  # bibliographic references

When a spec declares graphemes but no explicit allophone map, a baseline identity allophone map is derived: every phoneme a grapheme can produce is, at minimum, its own surface realisation.

Design principles

  • Linguistically motivated only — digraphs like English ⟨th⟩, Portuguese ⟨lh⟩, or German ⟨sch⟩ are included because they are standard orthographic units; arbitrary substrings are not.
  • Graphemes ≠ allophones — spelling-to-phoneme and phoneme-to-surface are modelled separately.
  • Regional variants — where pronunciation diverges systematically, a separate LanguageSpec is provided with ancestry links.
  • Multi-ancestor inheritancegraphemes_base/allophones_base let dialect trees declare only their differences.
  • Pure data, pluggable logic — mappings are declarative JSON; algorithmic G2P (e.g. Arabic) uses the plugin system.

Plugins

Algorithmic G2P backends register under the orthography2ipa.g2p entry-point group. The bundled Arabic plugin (plugins/arabic_g2p.py) handles consonant mapping, harakat vowels, sun-letter assimilation, hamzat al-wasl elision, and tanwin forms.

A neural Arabic diacritizer (plugins/tashkeel.py) is wired as an optional ONNX backend but ships as a documented stub: with no model loaded it returns input unchanged, and the rule-based plugin transcribes whatever diacritics are present. Bundling a tashkeel model is planned future work.

Contributing

To add a language, create orthography2ipa/data/{code}.json following orthography2ipa/data/SCHEMA.md. For dialects, use graphemes_base/allophones_base to inherit from the parent.

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

orthography2ipa-0.4.0a1.tar.gz (529.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

orthography2ipa-0.4.0a1-py3-none-any.whl (566.3 kB view details)

Uploaded Python 3

File details

Details for the file orthography2ipa-0.4.0a1.tar.gz.

File metadata

  • Download URL: orthography2ipa-0.4.0a1.tar.gz
  • Upload date:
  • Size: 529.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for orthography2ipa-0.4.0a1.tar.gz
Algorithm Hash digest
SHA256 44838970cda3c6e01ea0922d98aec64b0329cee79fda56b75c081643fd68c00b
MD5 82d0648d5b3f9611ac70b553c52aeee7
BLAKE2b-256 112bee44cb0b83040df30acbbb9a3c2473d2e975bba68a9f1a19950fd8156287

See more details on using hashes here.

File details

Details for the file orthography2ipa-0.4.0a1-py3-none-any.whl.

File metadata

File hashes

Hashes for orthography2ipa-0.4.0a1-py3-none-any.whl
Algorithm Hash digest
SHA256 efe8a98e424f5c7cd4f616313953c3102171e7fa98ce4eb94515c1186d48ba1c
MD5 d07f40983b6eb1db610e5560bb7f6ee1
BLAKE2b-256 4e686da6718e0c3684c4156ec9c2cefdfdaf77515c7e470e4d11f8824a1ff36f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page