Skip to main content

Robust Language Identification using an ensemble of 5-7 LID backends

Project description

robust-lid

PyPI Python CI Release License: MIT

Robust language identification that ensembles multiple LID backends into a single (language_script, confidence) prediction. Designed for short/noisy text where any single classifier is unreliable.

Install

pip install robust-lid                 # core ensemble (5 backends)
pip install 'robust-lid[cld3]'         # +CLD3 (requires system protoc)
pip install 'robust-lid[dev]'          # +dev tooling (ruff, mypy, pytest, …)

Quick start

Python

from robust_lid import RobustLID

lid = RobustLID()
code, confidence = lid.predict("The quick brown fox jumps over the lazy dog.")
# ('eng_Latn', 0.91)

CLI

Two entry points are registered: rlid (short) and robust-lid (long alias).

rlid "The quick brown fox jumps over the lazy dog."
# eng_Latn    0.987    The quick brown fox jumps over the lazy dog.

echo "안녕하세요" | rlid --json
# {"text": "안녕하세요", "lang": "kor_Hang", "confidence": 0.94}

rlid --file input.txt --no-text          # one pred per input line, no echo
rlid --models ft176,glotlid "Hello"      # use a subset of backends
rlid --uniform "Hello"                   # disable tuned defaults
rlid --low-memory "Hello"                # load one backend at a time (peak ~1.9 GB)
rlid --no-parallel "Hello"               # sequential predict (default is threaded)
rlid --verbose "Hello"                   # stage-by-stage progress on stderr
rlid --list-backends                     # inventory and exit
rlid --help

First call downloads ~1.5 GB of fastText models to ~/.cache/robust_lid/.

Batch prediction

For multi-text workloads, call predict_batch instead of predict in a loop:

from robust_lid import RobustLID

lid = RobustLID()
results = lid.predict_batch(["Hello world", "안녕하세요", "Bonjour"])
# [('eng_Latn', 0.99), ('kor_Hang', 1.0), ('fra_Latn', 0.97)]

The CLI automatically switches to the batch path when more than one text is provided (via --file or stdin).

Where the speedup actually comes from:

  1. Single thread pool per batch instead of one per text — eliminates the 7-worker pool-construction overhead × N repetitions.
  2. Cached label normalization in FastTextLID.predict_batch — the ISOConverter lookup runs once per distinct fasttext label across the batch, not k × N times.
  3. fastText multilinePredict (C++) — a single C entry per backend. Measured benefit is small for already-fast models (lid.176: ~1× vs sequential) but more meaningful for the heavier fasttext-218e / GlotLID on large batches.

The unfixable bottleneck (and what fast_mode solves): per-backend wall time on N=200 English sentences, grouped by backend family:

family backend implementation ms/text
pure Python langid Naive Bayes ~6.0
pure Python langdetect Naive Bayes ~2.2
CLD cld2 C binding ~0.00
CLD cld3 C++ via gcld3 ~0.04
fastText ft176 C++ ~0.01
fastText ft218e C++ ~0.06
fastText glotlid C++ ~0.43

langid and langdetect are pure-Python, GIL-bound, and have no batch API — they sit at ~95 % of total ensemble wall time regardless of how we call them.

fast_mode (default: on)

RobustLID(fast_mode=True) — which is the default — drops those two backends from the ensemble, leaving 4-5 all-C/C++ backends (cld2, cld3, ft176, ft218e, glotlid). This gives a large wall-time reduction with a small accuracy cost (fastText-176 alone already covers 176 languages — most of what langid+langdetect contribute).

from robust_lid import RobustLID
RobustLID()                   # fast_mode=True, 5-backend ensemble (default)
RobustLID(fast_mode=False)    # all 7 backends — maximum ensemble diversity

CLI equivalents:

rlid "text"                   # fast_mode default
rlid --with-slow "text"       # include langid + langdetect

SLOW_BACKEND_NAMES in robust_lid.ensemble exposes the excluded set (frozenset({"langid", "langdetect"})) for introspection.

Execution modes and memory footprint

Mode How Peak RSS Per-call latency
Fast (default) All backends eagerly loaded, predict calls run on a thread pool ~3.2 GB ~30-100 ms
Sequential parallel=False / --no-parallel — no thread pool ~3.2 GB ~100-300 ms
Low memory low_memory=True / --low-memory — each predict re-instantiates every backend, releases when done ~1.9 GB peak, ~250 MB between calls seconds (re-loads fastText from disk each call)

Low-memory mode trades per-call latency for a much smaller resident footprint (useful on CI runners, small VPSes, or embedded-like environments). It disables supported-script gating — backends aren't live between calls, so their supported_scripts attribute can't be inspected. Incompatible with a custom models= list.

Backends

Backend Bundled Notes
langid yes pure Python
langdetect yes pure Python
pycld2 (CLD2) yes C binding
gcld3 (CLD3) opt-in requires protoc — see below
fastText-176 yes (downloaded on first use) 126 MB
fastText-218e yes (downloaded on first use) 1.2 GB
GlotLID v3 yes (downloaded on first use) 172 MB, 2,100+ languages

If gcld3 is not installed, the ensemble runs with 6 backends and emits an ImportWarning on package import.

Installation

pip install robust-lid

Optional: enable the CLD3 backend

gcld3 depends on the Protocol Buffers compiler (protoc), which must be installed at the system level before pip install:

Platform Command
RHEL / Fedora / Rocky sudo dnf install protobuf-compiler protobuf-devel
Debian / Ubuntu sudo apt-get install protobuf-compiler libprotobuf-dev
macOS brew install protobuf

Then:

pip install 'robust-lid[cld3]'

If you skip this, RobustLID will print:

ImportWarning: gcld3 is not installed; the CLD3 backend will be excluded from the RobustLID ensemble. ...

You can check availability at runtime:

from robust_lid.models import is_cld3_available
print(is_cld3_available())  # True or False

Development

uv sync --extra dev              # lint, mypy, pytest, pre-commit
uv sync --extra dev --extra e2e  # + datasets (WiLi-2018, papluca) + gcld3
uv run pytest                    # unit tests only (no network)
uv run pytest -m "slow and network"  # E2E — downloads ~1.5 GB of models on first run
uv run mypy src/robust_lid
uv run ruff check src/ tests/

LID benchmarks

The ensemble is evaluated on three Hub datasets with different domain characteristics and label granularity. Each dataset is compared at its native label granularity:

Dataset Domain Langs Label format Comparison Accuracy (tuned defaults)
martinthoma/wili_2018 Wikipedia, 235 langs 30 major ISO 639-3 (eng) lang only 98.4 % (886 / 900)
papluca/language-identification reviews + news, 20 langs 18 overlap ISO 639-1 (en) lang only 99.6 % (538 / 540)
openlanguagedata/flores_plus (gated) translated Wikipedia, 200 langs 30 major lang_Script (eng_Latn) strict lang_Script 100.0 % (900 / 900)

Why different comparison granularities? WiLi-2018 and papluca ship only language labels (no script), so we can't verify the script dimension against them. FLORES+ labels carry both, so matches_lang_script (in tests/integration/_common.py) enforces exact language AND exact script (modulo macrolanguage and script-supercode equivalence classes — so cmn_Hant ≡ zho_Hans and arb_Arab ≡ arz_Arab). This catches the class of bugs where a backend nails the language but mis-detects the script.

The 30 major languages tracked (tests/integration/_common.py) cover all commonly used scripts (Latn / Hang / Jpan / Hani / Hira / Kana / Arab / Cyrl / Deva / Beng / Thai / Grek / Hebr). Per-backend numbers are available via scripts/per_backend_accuracy.py --dataset wili papluca.

Running gated datasets (FLORES-200)

  1. Copy the template: cp .env.example .env
  2. Fill in a read-only Hugging Face token from https://huggingface.co/settings/tokens and accept the dataset's terms on its page.
  3. Run: uv run pytest -m "slow and network" tests/integration/test_flores_e2e.py -s

.env is already in .gitignore. The token is loaded by tests/integration/conftest.py (stdlib-only parser — no extra dep). When HF_TOKEN is not set, FLORES tests skip automatically; the ungated benchmarks keep running.

Why project root? Every Python/Node tool (python-dotenv, pytest-env, Docker Compose, VS Code's python.envFile, Vercel/Netlify, etc.) searches for .env from the project root. A gitignored .env at root never shows up in git status, so the "clutter" cost is zero in practice.

Injecting fake backends (for tests)

from robust_lid import RobustLID
from robust_lid.models import LID

class FakeLID(LID):
    def predict(self, text: str) -> list[tuple[str, float]]:
        return [("eng", 0.99)]

lid = RobustLID(models=[FakeLID(), FakeLID()])  # no network, no models

FastTextLID(cache_dir=..., download_fn=..., model_loader=...) and ISOConverter(mapping=..., iso639_3_map=..., tsv_path=...) accept injected dependencies for testing.

Weighted voting

RobustLID combines three multiplicative knobs per backend:

Knob Shape Applies when Default source
weights list[float] always (per-model scalar) default_weights()
script_weights list[dict[script → float]] when detect_script(text) hits a key default_script_weights()
lang_weights list[dict[predicted_lang → float]] when the backend's top-1 matches a key default_lang_weights()

Effective contribution of backend i:

weights[i] * script_weights[i].get(script, 1.0) * lang_weights[i].get(pred_lang, 1.0) * prob

Script-based backend gating

Each backend exposes supported_langs (ISO 639-3 codes it can ever emit) and supported_scripts (derived ISO 15924 codes). RobustLID uses the latter to auto-silence any backend whose supported-script set doesn't cover the input's detected script — preventing a backend from dragging the ensemble down with a confidently-wrong guess on text outside its coverage.

from robust_lid.models import LangdetectLID, FastText176LID

LangdetectLID().supported_langs       # frozenset of 55 ISO-639-3 codes
LangdetectLID().supported_scripts     # 41 ISO-15924 codes (Latn, Cyrl, …)
# langdetect has no Khmer, Ethiopic, Tibetan coverage;
# fastText-176 does, so on Amharic text only fastText's vote counts.

Custom backends (models=[MyLID(), ...]) default to frozenset() → gating is disabled. Override supported_langs to opt in.

RobustLID also applies two upstream-bug fixes at import time:

  • langid returns raw log-probabilities by default (large negative values). LangidLID constructs the identifier with norm_probs=True so it yields [0, 1] probabilities; otherwise the negative vote totals flip sign during normalization and hand wins to whatever language langid disagreed with.
  • fasttext-wheel 0.9.2 uses np.array(copy=False) which breaks on NumPy 2. _patch_fasttext_for_numpy2() monkey-patches _FastText.predict to use np.asarray instead.

compute_ensemble_vote also skips any vote with prob ≤ 0 defensively.

Calling RobustLID() without args auto-applies all three defaults. They were tuned on WiLi-2018 across 14 major languages to patch the known per-backend weak spots:

  • langdetect × Hani → 0.3 (73 % recall on Chinese)
  • langdetect × Jpan → 0.5
  • cld2 × Hani → 0.8
  • glotlid × Deva → 0.5 (confuses Hindi with Marathi)
  • langid × {ltz, kir} → 0.5 / 0.3 (rare mis-labels of German/Turkish)
  • glotlid × mar → 0.7
  • ft176 scalar → 1.3, ft218e scalar → 1.2 (the two strongest backends)

Override or disable selectively:

from robust_lid import RobustLID
from robust_lid.ensemble import default_backend_order

order = default_backend_order()
# ['langid', 'langdetect', 'cld2', 'cld3', 'ft176', 'ft218e', 'glotlid']
# (cld3 is omitted if gcld3 isn't installed)

# Uniform (disable all tuning)
lid = RobustLID(
    weights=[1.0] * len(order),
    script_weights=[{}] * len(order),
    lang_weights=[{}] * len(order),
)

# Custom scalar weights by name
weights_by_name = {
    "langid": 1.0, "langdetect": 0.5, "cld2": 1.0, "cld3": 1.0,
    "ft176": 2.0, "ft218e": 2.0, "glotlid": 1.5,
}
lid = RobustLID(weights=[weights_by_name[name] for name in order])

For custom models (RobustLID(models=[...])) defaults are not applied because the tuning is keyed by backend name.

To measure per-backend accuracy on your own data and re-tune:

uv run python scripts/per_backend_accuracy.py --lang por deu tur --n 50

Release process

Releases are fully automated via python-semantic-release + PyPI Trusted Publishing. Every merge to main runs the release workflow, which:

  1. Parses Conventional Commits since the last tag.
  2. Decides the next semver (fix:/perf: → patch, feat: → minor, BREAKING CHANGE or type!: → major).
  3. Updates pyproject.toml's project.version, appends to CHANGELOG.md, tags vX.Y.Z, and pushes a release commit.
  4. Builds with uv build and uploads the wheel + sdist to PyPI via OIDC (no token secret — configured once as a PyPI trusted publisher).
  5. Attaches the artifacts to the GitHub Release.

If no commit since the last release qualifies for a bump (only chore/docs/etc.), the workflow is a no-op.

Commit prefix cheat sheet

Prefix Version bump Example
feat: minor feat: add predict_batch
fix: / perf: patch fix: handle empty input
feat!: / fix!: / BREAKING CHANGE: footer major (once we're on 1.x) feat!: drop Python 3.11 support
chore: / docs: / style: / refactor: / test: / build: / ci: docs: add FAQ

While major_on_zero = false (0.x lifecycle), breaking changes still bump minor rather than promoting to 1.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

robust_lid-0.1.0.tar.gz (287.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

robust_lid-0.1.0-py3-none-any.whl (54.1 kB view details)

Uploaded Python 3

File details

Details for the file robust_lid-0.1.0.tar.gz.

File metadata

  • Download URL: robust_lid-0.1.0.tar.gz
  • Upload date:
  • Size: 287.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for robust_lid-0.1.0.tar.gz
Algorithm Hash digest
SHA256 935715809d74d4c383c32cf39f8e5b400110e9109ba7aa159a4ce94c697a1a44
MD5 d83ad38cd9bd502508c25931ed39c2ac
BLAKE2b-256 d293a52c4e06a2922faaa7a92aa05805ee3c749127f0fc2611db8a4a5cd6cf5b

See more details on using hashes here.

Provenance

The following attestation bundles were made for robust_lid-0.1.0.tar.gz:

Publisher: release.yml on NoUnique/robust-lid

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file robust_lid-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: robust_lid-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 54.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for robust_lid-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 decfb6ffb04b136cbe849ae7712b97153934c8e5bb3ba003b4a21c7cc553c7bc
MD5 5e12737cacef948b0eb7e690998dccfa
BLAKE2b-256 491fcbb01ff56a15c0e5db80f1a3a9529028a84b3dd272fac2e352f5ce4d87ac

See more details on using hashes here.

Provenance

The following attestation bundles were made for robust_lid-0.1.0-py3-none-any.whl:

Publisher: release.yml on NoUnique/robust-lid

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page