Skip to main content

Normalize any language identifier to canonical ISO 639-3 + ISO 15924 form

Project description

babelcode

Normalize any language identifier — ISO 639-1 (en), ISO 639-3 (eng), BCP-47 (zh-Hans), NLLB-style (eng_Latn), WikiPron filenames (wp_eng_latn_us), CHILDES corpus names (EnglishNA), or plain English (German) — into a single canonical form:

{iso_639_3}_{iso_15924}

For example: eng_Latn, arb_Arab, cmn_Hans, jpn_Jpan.

Review the mapping decisions used in this project.

Installation

pip install babelcode

Quick start

from babelcode import BabelCode

bc = BabelCode()

bc.normalize("en")          # → "eng_Latn"
bc.normalize("zh-Hans")     # → "cmn_Hans"
bc.normalize("ar")          # → "arb_Arab"  (macrolanguage → preferred individual)
bc.normalize("wp_deu_latn") # → "deu_Latn"  (WikiPron format)
bc.normalize("Farsi")       # → "pes_Arab"  (English name)
bc.normalize("EnglishNA")   # → "eng_Latn"  (CHILDES corpus name)

Singleton shortcut

from babelcode import get_instance

bc = get_instance()  # cached singleton — same instance every call

Accessors

bc.iso639_3("eng_Latn")        # → "eng"
bc.script("eng_Latn")          # → "Latn"
bc.bcp47("eng_Latn")           # → "en"
bc.name("arb")                 # → "Standard Arabic"
bc.scripts("srp")              # → ["Cyrl", "Latn"]
bc.is_macrolanguage("ara")     # → True
bc.macro_members("ara")        # → ["arb", "arz", ...]

Script detection from text

from babelcode import detect_script

detect_script("مرحبا")     # → "Arab"
detect_script("Привет")    # → "Cyrl"
detect_script("こんにちは")  # → "Jpan"

Batch operations

bc.normalize_list(["en", "de", "unknown", "fr"])
# → ["eng_Latn", "deu_Latn", None, "fra_Latn"]

bc.build_mapping(
    source_codes=["en", "de"],
    target_codes=["eng_Latn", "deu_Latn", "fra_Latn"],
)
# → {"en": "eng_Latn", "de": "deu_Latn"}

GlotLID Compatibility Mode

To follow the same decisions made by the GlotLID project use glotlid=True when calling normalize.

from babelcode import BabelCode

bc = BabelCode()

bc.normalize("Farsi", glotlid=True)             # → "fas_Arab"
bc.normalize("ktu_Latn", glotlid=True)          # → "kng_Latn"
bc.normalize("tgl_Latn", glotlid=True)          # → "fil_Latn"

Why iso3_Script instead of BCP-47?

BCP-47 tags (en, zh-Hans) are familiar but they conflate macrolanguages with individual languages, have variable length, and omit the script when it is "obvious" — which is ambiguous for multi-script languages like Serbian.

The {iso_639_3}_{iso_15924} canonical form:

  • Always has exactly two components — easy to split and compare.
  • Uses ISO 639-3 — one code per individual language, no macrolanguage ambiguity.
  • Always includes the ISO 15924 script — no guessing for Serbian (srp_Cyrl vs srp_Latn) or Chinese (cmn_Hans vs cmn_Hant).

Data sources & methodology

babelcode is pre-built from LinguaMeta, a comprehensive open dataset from Google Research that aggregates:

Source What it provides
ISO 639-3 (SIL) Three-letter codes for 7 800+ languages
ISO 15924 (Unicode) Four-letter script codes (Latn, Arab, …)
IETF BCP 47 Standard language tags (en, zh-Hans, …)
LinguaMeta JSON files (~7 500) Canonical script, macrolanguage membership, English names

Build process

  1. Raw data: ~7500 per-language JSON files from LinguaMeta, each containing script associations, name data, and macrolanguage membership.
  2. Cache compilation (babelcode-build-cache): Reads all JSON files and produces a single linguameta_cache.json (~550 KB) with pre-resolved BCP-47 → ISO 639-3 mappings, canonical scripts, English names, and macrolanguage membership tables.
  3. Runtime: BabelCode loads the cache once and resolves any input format through a cascade of regex matchers and lookup tables.

The cache is distributed inside the package — no network calls at runtime, zero dependencies.

Macrolanguage resolution

By default, macrolanguages resolve to their preferred individual language (e.g. ararb Standard Arabic, zhcmn Mandarin). This can be disabled with resolve_macro=False.

Script inference

When a script is not explicit in the input, babelcode uses this cascade:

  1. Script from the input tag itself (e.g. zh-HansHans)
  2. text_hint parameter — runs Unicode-range script detection on sample text
  3. Canonical script from the LinguaMeta cache
  4. Fallback to Latn

Development

git clone https://github.com/omneity-labs/babelcode
cd babelcode
pip install -e ".[dev]"
pytest

Rebuilding the cache

If you update the LinguaMeta data files under src/babelcode/data/_url_nlp_repo/, rebuild the cache:

babelcode-build-cache

License

MITOmar Kamali / Omneity Labs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

babelcode-0.1.1.tar.gz (406.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

babelcode-0.1.1-py3-none-any.whl (405.3 kB view details)

Uploaded Python 3

File details

Details for the file babelcode-0.1.1.tar.gz.

File metadata

  • Download URL: babelcode-0.1.1.tar.gz
  • Upload date:
  • Size: 406.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for babelcode-0.1.1.tar.gz
Algorithm Hash digest
SHA256 331bb0bb92c93d0de116a39a04c8c78412298595283d322ce3cae1e22f16268c
MD5 a5db360286bc1e53a590d99c80bcf594
BLAKE2b-256 39b7d90797b7d4c3d192da18855e152e9c744ff82c8632ca30f5e7710af82493

See more details on using hashes here.

Provenance

The following attestation bundles were made for babelcode-0.1.1.tar.gz:

Publisher: publish.yml on omneity-labs/babelcode

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file babelcode-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: babelcode-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 405.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for babelcode-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ed98fb66f0a52dcf626c431becf36dceec773f2ff2d8aec951b982683f6479da
MD5 d759f8c3f1014d225a9a820880352624
BLAKE2b-256 a540a953107d2282af9638f0c0978d49bddf97df52be81400d8f9791d2c63656

See more details on using hashes here.

Provenance

The following attestation bundles were made for babelcode-0.1.1-py3-none-any.whl:

Publisher: publish.yml on omneity-labs/babelcode

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page