Skip to main content

Normalize any language identifier to canonical ISO 639-3 + ISO 15924 form

Project description

babelcode

Normalize any language identifier — ISO 639-1 (en), ISO 639-3 (eng), BCP-47 (zh-Hans), NLLB-style (eng_Latn), WikiPron filenames (wp_eng_latn_us), CHILDES corpus names (EnglishNA), or plain English (German) — into a single canonical form:

{iso_639_3}_{iso_15924}

For example: eng_Latn, arb_Arab, cmn_Hans, jpn_Jpan.

Installation

pip install babelcode

Quick start

from babelcode import BabelCode

bc = BabelCode()

bc.normalize("en")          # → "eng_Latn"
bc.normalize("zh-Hans")     # → "cmn_Hans"
bc.normalize("ar")          # → "arb_Arab"  (macrolanguage → preferred individual)
bc.normalize("wp_deu_latn") # → "deu_Latn"  (WikiPron format)
bc.normalize("Farsi")       # → "pes_Arab"  (English name)
bc.normalize("EnglishNA")   # → "eng_Latn"  (CHILDES corpus name)

Singleton shortcut

from babelcode import get_instance

bc = get_instance()  # cached singleton — same instance every call

Accessors

bc.iso639_3("eng_Latn")        # → "eng"
bc.script("eng_Latn")          # → "Latn"
bc.bcp47("eng_Latn")           # → "en"
bc.name("arb")                 # → "Standard Arabic"
bc.scripts("srp")              # → ["Cyrl", "Latn"]
bc.is_macrolanguage("ara")     # → True
bc.macro_members("ara")        # → ["arb", "arz", ...]

Script detection from text

from babelcode import detect_script

detect_script("مرحبا")     # → "Arab"
detect_script("Привет")    # → "Cyrl"
detect_script("こんにちは")  # → "Jpan"

Batch operations

bc.normalize_list(["en", "de", "unknown", "fr"])
# → ["eng_Latn", "deu_Latn", None, "fra_Latn"]

bc.build_mapping(
    source_codes=["en", "de"],
    target_codes=["eng_Latn", "deu_Latn", "fra_Latn"],
)
# → {"en": "eng_Latn", "de": "deu_Latn"}

Why iso3_Script instead of BCP-47?

BCP-47 tags (en, zh-Hans) are familiar but they conflate macrolanguages with individual languages, have variable length, and omit the script when it is "obvious" — which is ambiguous for multi-script languages like Serbian.

The {iso_639_3}_{iso_15924} canonical form:

  • Always has exactly two components — easy to split and compare.
  • Uses ISO 639-3 — one code per individual language, no macrolanguage ambiguity.
  • Always includes the ISO 15924 script — no guessing for Serbian (srp_Cyrl vs srp_Latn) or Chinese (cmn_Hans vs cmn_Hant).

Data sources & methodology

babelcode is pre-built from LinguaMeta, a comprehensive open dataset from Google Research that aggregates:

Source What it provides
ISO 639-3 (SIL) Three-letter codes for 7 800+ languages
ISO 15924 (Unicode) Four-letter script codes (Latn, Arab, …)
IETF BCP 47 Standard language tags (en, zh-Hans, …)
LinguaMeta JSON files (~7 500) Canonical script, macrolanguage membership, English names

Build process

  1. Raw data: ~7500 per-language JSON files from LinguaMeta, each containing script associations, name data, and macrolanguage membership.
  2. Cache compilation (babelcode-build-cache): Reads all JSON files and produces a single linguameta_cache.json (~550 KB) with pre-resolved BCP-47 → ISO 639-3 mappings, canonical scripts, English names, and macrolanguage membership tables.
  3. Runtime: BabelCode loads the cache once and resolves any input format through a cascade of regex matchers and lookup tables.

The cache is distributed inside the package — no network calls at runtime, zero dependencies.

Macrolanguage resolution

By default, macrolanguages resolve to their preferred individual language (e.g. ararb Standard Arabic, zhcmn Mandarin). This can be disabled with resolve_macro=False.

Script inference

When a script is not explicit in the input, babelcode uses this cascade:

  1. Script from the input tag itself (e.g. zh-HansHans)
  2. text_hint parameter — runs Unicode-range script detection on sample text
  3. Canonical script from the LinguaMeta cache
  4. Fallback to Latn

Development

git clone https://github.com/omneity-labs/babelcode
cd babelcode
pip install -e ".[dev]"
pytest

Rebuilding the cache

If you update the LinguaMeta data files under src/babelcode/data/_url_nlp_repo/, rebuild the cache:

babelcode-build-cache

License

MIT — Omar Kamali / Omneity Labs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

babelcode-0.1.0.tar.gz (399.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

babelcode-0.1.0-py3-none-any.whl (403.4 kB view details)

Uploaded Python 3

File details

Details for the file babelcode-0.1.0.tar.gz.

File metadata

  • Download URL: babelcode-0.1.0.tar.gz
  • Upload date:
  • Size: 399.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for babelcode-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ac5bd0ca891b96b56387e9d6376fcca9a80e53a852affda2d4f48a96f930d33e
MD5 2efdaf70fcb6e18c4268eef0081c1fb8
BLAKE2b-256 822fe6dc79da8268578b11ce459c715c9bd18aa4ac2a61a65726dfbac73aa564

See more details on using hashes here.

Provenance

The following attestation bundles were made for babelcode-0.1.0.tar.gz:

Publisher: publish.yml on omneity-labs/babelcode

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file babelcode-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: babelcode-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 403.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for babelcode-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cb47c1405bdd93b07edcab552362815bdae2ed5b437174c50c5af9306f5d9546
MD5 f29816c847b4fe259dd3b7afcf60f35a
BLAKE2b-256 5546d459c8af86b9d1a508e2a8fc5b9a8832e2e25cd1792ced9fa0e067174cd4

See more details on using hashes here.

Provenance

The following attestation bundles were made for babelcode-0.1.0-py3-none-any.whl:

Publisher: publish.yml on omneity-labs/babelcode

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page