Normalize any language identifier to canonical ISO 639-3 + ISO 15924 form
Project description
babelcode
Normalize any language identifier — ISO 639-1 (en), ISO 639-3 (eng), BCP-47 (zh-Hans), NLLB-style (eng_Latn), WikiPron filenames (wp_eng_latn_us), CHILDES corpus names (EnglishNA), or plain English (German) — into a single canonical form:
{iso_639_3}_{iso_15924}
For example: eng_Latn, arb_Arab, cmn_Hans, jpn_Jpan.
Review the mapping decisions used in this project.
Installation
pip install babelcode
Quick start
from babelcode import BabelCode
bc = BabelCode()
bc.normalize("en") # → "eng_Latn"
bc.normalize("zh-Hans") # → "cmn_Hans"
bc.normalize("ar") # → "arb_Arab" (macrolanguage → preferred individual)
bc.normalize("wp_deu_latn") # → "deu_Latn" (WikiPron format)
bc.normalize("Farsi") # → "pes_Arab" (English name)
bc.normalize("EnglishNA") # → "eng_Latn" (CHILDES corpus name)
Singleton shortcut
from babelcode import get_instance
bc = get_instance() # cached singleton — same instance every call
Accessors
bc.iso639_3("eng_Latn") # → "eng"
bc.script("eng_Latn") # → "Latn"
bc.bcp47("eng_Latn") # → "en"
bc.name("arb") # → "Standard Arabic"
bc.scripts("srp") # → ["Cyrl", "Latn"]
bc.is_macrolanguage("ara") # → True
bc.macro_members("ara") # → ["arb", "arz", ...]
Script detection from text
from babelcode import detect_script
detect_script("مرحبا") # → "Arab"
detect_script("Привет") # → "Cyrl"
detect_script("こんにちは") # → "Jpan"
Batch operations
bc.normalize_list(["en", "de", "unknown", "fr"])
# → ["eng_Latn", "deu_Latn", None, "fra_Latn"]
bc.build_mapping(
source_codes=["en", "de"],
target_codes=["eng_Latn", "deu_Latn", "fra_Latn"],
)
# → {"en": "eng_Latn", "de": "deu_Latn"}
GlotLID Compatibility Mode
To follow the same decisions made by the GlotLID project use glotlid=True when calling normalize.
from babelcode import BabelCode
bc = BabelCode()
bc.normalize("Farsi", glotlid=True) # → "fas_Arab"
bc.normalize("ktu_Latn", glotlid=True) # → "kng_Latn"
bc.normalize("tgl_Latn", glotlid=True) # → "fil_Latn"
Why iso3_Script instead of BCP-47?
BCP-47 tags (en, zh-Hans) are familiar but they conflate
macrolanguages with individual languages, have variable length, and omit
the script when it is "obvious" — which is ambiguous for multi-script
languages like Serbian.
The {iso_639_3}_{iso_15924} canonical form:
- Always has exactly two components — easy to split and compare.
- Uses ISO 639-3 — one code per individual language, no macrolanguage ambiguity.
- Always includes the ISO 15924 script — no guessing for
Serbian (
srp_Cyrlvssrp_Latn) or Chinese (cmn_Hansvscmn_Hant).
Data sources & methodology
babelcode is pre-built from LinguaMeta, a comprehensive open dataset from Google Research that aggregates:
| Source | What it provides |
|---|---|
| ISO 639-3 (SIL) | Three-letter codes for 7 800+ languages |
| ISO 15924 (Unicode) | Four-letter script codes (Latn, Arab, …) |
| IETF BCP 47 | Standard language tags (en, zh-Hans, …) |
| LinguaMeta JSON files (~7 500) | Canonical script, macrolanguage membership, English names |
Build process
- Raw data: ~7500 per-language JSON files from LinguaMeta, each containing script associations, name data, and macrolanguage membership.
- Cache compilation (
babelcode-build-cache): Reads all JSON files and produces a singlelinguameta_cache.json(~550 KB) with pre-resolved BCP-47 → ISO 639-3 mappings, canonical scripts, English names, and macrolanguage membership tables. - Runtime:
BabelCodeloads the cache once and resolves any input format through a cascade of regex matchers and lookup tables.
The cache is distributed inside the package — no network calls at runtime, zero dependencies.
Macrolanguage resolution
By default, macrolanguages resolve to their preferred individual language (e.g. ar → arb Standard Arabic, zh → cmn Mandarin). This can be disabled with resolve_macro=False.
Script inference
When a script is not explicit in the input, babelcode uses this cascade:
- Script from the input tag itself (e.g.
zh-Hans→Hans) text_hintparameter — runs Unicode-range script detection on sample text- Canonical script from the LinguaMeta cache
- Fallback to
Latn
Development
git clone https://github.com/omneity-labs/babelcode
cd babelcode
pip install -e ".[dev]"
pytest
Rebuilding the cache
If you update the LinguaMeta data files under src/babelcode/data/_url_nlp_repo/, rebuild the cache:
babelcode-build-cache
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file babelcode-0.1.1.tar.gz.
File metadata
- Download URL: babelcode-0.1.1.tar.gz
- Upload date:
- Size: 406.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
331bb0bb92c93d0de116a39a04c8c78412298595283d322ce3cae1e22f16268c
|
|
| MD5 |
a5db360286bc1e53a590d99c80bcf594
|
|
| BLAKE2b-256 |
39b7d90797b7d4c3d192da18855e152e9c744ff82c8632ca30f5e7710af82493
|
Provenance
The following attestation bundles were made for babelcode-0.1.1.tar.gz:
Publisher:
publish.yml on omneity-labs/babelcode
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
babelcode-0.1.1.tar.gz -
Subject digest:
331bb0bb92c93d0de116a39a04c8c78412298595283d322ce3cae1e22f16268c - Sigstore transparency entry: 1230055933
- Sigstore integration time:
-
Permalink:
omneity-labs/babelcode@97c9ec82a18142b18fc10d916dd7e7ef12e944f5 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/omneity-labs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@97c9ec82a18142b18fc10d916dd7e7ef12e944f5 -
Trigger Event:
release
-
Statement type:
File details
Details for the file babelcode-0.1.1-py3-none-any.whl.
File metadata
- Download URL: babelcode-0.1.1-py3-none-any.whl
- Upload date:
- Size: 405.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed98fb66f0a52dcf626c431becf36dceec773f2ff2d8aec951b982683f6479da
|
|
| MD5 |
d759f8c3f1014d225a9a820880352624
|
|
| BLAKE2b-256 |
a540a953107d2282af9638f0c0978d49bddf97df52be81400d8f9791d2c63656
|
Provenance
The following attestation bundles were made for babelcode-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on omneity-labs/babelcode
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
babelcode-0.1.1-py3-none-any.whl -
Subject digest:
ed98fb66f0a52dcf626c431becf36dceec773f2ff2d8aec951b982683f6479da - Sigstore transparency entry: 1230056022
- Sigstore integration time:
-
Permalink:
omneity-labs/babelcode@97c9ec82a18142b18fc10d916dd7e7ef12e944f5 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/omneity-labs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@97c9ec82a18142b18fc10d916dd7e7ef12e944f5 -
Trigger Event:
release
-
Statement type: