Robust Language Identification using an ensemble of 5-7 LID backends

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

robust-lid

Robust language identification that ensembles multiple LID backends into a single (language_script, confidence) prediction. Designed for short/noisy text where any single classifier is unreliable.

Install

pip install robust-lid                 # core ensemble (5 backends)
pip install 'robust-lid[cld3]'         # +CLD3 (requires system protoc)
pip install 'robust-lid[dev]'          # +dev tooling (ruff, mypy, pytest, …)

Quick start

Python

from robust_lid import RobustLID

lid = RobustLID()
code, confidence = lid.predict("The quick brown fox jumps over the lazy dog.")
# ('eng_Latn', 0.91)

CLI

Two entry points are registered: rlid (short) and robust-lid (long alias).

rlid "The quick brown fox jumps over the lazy dog."
# eng_Latn    0.987    The quick brown fox jumps over the lazy dog.

echo "안녕하세요" | rlid --json
# {"text": "안녕하세요", "lang": "kor_Hang", "confidence": 0.94}

rlid --file input.txt --no-text          # one pred per input line, no echo
rlid --models ft176,glotlid "Hello"      # use a subset of backends
rlid --uniform "Hello"                   # disable tuned defaults
rlid --low-memory "Hello"                # load one backend at a time (peak ~1.9 GB)
rlid --no-parallel "Hello"               # sequential predict (default is threaded)
rlid --verbose "Hello"                   # stage-by-stage progress on stderr
rlid --list-backends                     # inventory and exit
rlid --help

First call downloads ~1.5 GB of fastText models to ~/.cache/robust_lid/.

Batch prediction

For multi-text workloads, call predict_batch instead of predict in a loop:

from robust_lid import RobustLID

lid = RobustLID()
results = lid.predict_batch(["Hello world", "안녕하세요", "Bonjour"])
# [('eng_Latn', 0.99), ('kor_Hang', 1.0), ('fra_Latn', 0.97)]

The CLI automatically switches to the batch path when more than one text is provided (via --file or stdin).

Where the speedup actually comes from:

Single thread pool per batch instead of one per text — eliminates the 7-worker pool-construction overhead × N repetitions.
Cached label normalization in FastTextLID.predict_batch — the ISOConverter lookup runs once per distinct fasttext label across the batch, not k × N times.
fastText multilinePredict (C++) — a single C entry per backend. Measured benefit is small for already-fast models (lid.176: ~1× vs sequential) but more meaningful for the heavier fasttext-218e / GlotLID on large batches.

The unfixable bottleneck (and what fast_mode solves): per-backend wall time on N=200 English sentences, grouped by backend family:

family	backend	implementation	ms/text
pure Python	langid	Naive Bayes	~6.0
pure Python	langdetect	Naive Bayes	~2.2
CLD	cld2	C binding	~0.00
CLD	cld3	C++ via gcld3	~0.04
fastText	ft176	C++	~0.01
fastText	ft218e	C++	~0.06
fastText	glotlid	C++	~0.43

langid and langdetect are pure-Python, GIL-bound, and have no batch API — they sit at ~95 % of total ensemble wall time regardless of how we call them.

`fast_mode` (default: on)

RobustLID(fast_mode=True) — which is the default — drops those two backends from the ensemble, leaving 4-5 all-C/C++ backends (cld2, cld3, ft176, ft218e, glotlid). This gives a large wall-time reduction with a small accuracy cost (fastText-176 alone already covers 176 languages — most of what langid+langdetect contribute).

from robust_lid import RobustLID
RobustLID()                   # fast_mode=True, 5-backend ensemble (default)
RobustLID(fast_mode=False)    # all 7 backends — maximum ensemble diversity

CLI equivalents:

rlid "text"                   # fast_mode default
rlid --with-slow "text"       # include langid + langdetect

SLOW_BACKEND_NAMES in robust_lid.ensemble exposes the excluded set (frozenset({"langid", "langdetect"})) for introspection.

Execution modes and memory footprint

Mode	How	Peak RSS	Per-call latency
Fast (default)	All backends eagerly loaded, predict calls run on a thread pool	~3.2 GB	~30-100 ms
Sequential	`parallel=False` / `--no-parallel` — no thread pool	~3.2 GB	~100-300 ms
Low memory	`low_memory=True` / `--low-memory` — each predict re-instantiates every backend, releases when done	~1.9 GB peak, ~250 MB between calls	seconds (re-loads fastText from disk each call)

Low-memory mode trades per-call latency for a much smaller resident footprint (useful on CI runners, small VPSes, or embedded-like environments). It disables supported-script gating — backends aren't live between calls, so their supported_scripts attribute can't be inspected. Incompatible with a custom models= list.

Backends

Backend	Bundled	Notes
langid	yes	pure Python
langdetect	yes	pure Python
pycld2 (CLD2)	yes	C binding
gcld3 (CLD3)	opt-in	requires `protoc` — see below
fastText-176	yes (downloaded on first use)	126 MB
fastText-218e	yes (downloaded on first use)	1.2 GB
GlotLID v3	yes (downloaded on first use)	172 MB, 2,100+ languages

If gcld3 is not installed, the ensemble runs with 6 backends and emits an ImportWarning on package import.

Installation

pip install robust-lid

Optional: enable the CLD3 backend

gcld3 depends on the Protocol Buffers compiler (protoc), which must be installed at the system level before pip install:

Platform	Command
RHEL / Fedora / Rocky	`sudo dnf install protobuf-compiler protobuf-devel`
Debian / Ubuntu	`sudo apt-get install protobuf-compiler libprotobuf-dev`
macOS	`brew install protobuf`

Then:

pip install 'robust-lid[cld3]'

If you skip this, RobustLID will print:

ImportWarning: gcld3 is not installed; the CLD3 backend will be excluded from the RobustLID ensemble. ...

You can check availability at runtime:

from robust_lid.models import is_cld3_available
print(is_cld3_available())  # True or False

Development

uv sync --extra dev              # lint, mypy, pytest, pre-commit
uv sync --extra dev --extra e2e  # + datasets (WiLi-2018, papluca) + gcld3
uv run pytest                    # unit tests only (no network)
uv run pytest -m "slow and network"  # E2E — downloads ~1.5 GB of models on first run
uv run mypy src/robust_lid
uv run ruff check src/ tests/

LID benchmarks

The ensemble is evaluated on three Hub datasets with different domain characteristics and label granularity. Each dataset is compared at its native label granularity:

Dataset	Domain	Langs	Label format	Comparison	Accuracy (tuned defaults)
`martinthoma/wili_2018`	Wikipedia, 235 langs	30 major	ISO 639-3 (`eng`)	lang only	98.4 % (886 / 900)
`papluca/language-identification`	reviews + news, 20 langs	18 overlap	ISO 639-1 (`en`)	lang only	99.6 % (538 / 540)
`openlanguagedata/flores_plus` (gated)	translated Wikipedia, 200 langs	30 major	`lang_Script` (`eng_Latn`)	strict `lang_Script`	100.0 % (900 / 900)

Why different comparison granularities? WiLi-2018 and papluca ship only language labels (no script), so we can't verify the script dimension against them. FLORES+ labels carry both, so matches_lang_script (in tests/integration/_common.py) enforces exact language AND exact script (modulo macrolanguage and script-supercode equivalence classes — so cmn_Hant ≡ zho_Hans and arb_Arab ≡ arz_Arab). This catches the class of bugs where a backend nails the language but mis-detects the script.

The 30 major languages tracked (tests/integration/_common.py) cover all commonly used scripts (Latn / Hang / Jpan / Hani / Hira / Kana / Arab / Cyrl / Deva / Beng / Thai / Grek / Hebr). Per-backend numbers are available via scripts/per_backend_accuracy.py --dataset wili papluca.

Running gated datasets (FLORES-200)

Copy the template: cp .env.example .env
Fill in a read-only Hugging Face token from https://huggingface.co/settings/tokens and accept the dataset's terms on its page.
Run: uv run pytest -m "slow and network" tests/integration/test_flores_e2e.py -s

.env is already in .gitignore. The token is loaded by tests/integration/conftest.py (stdlib-only parser — no extra dep). When HF_TOKEN is not set, FLORES tests skip automatically; the ungated benchmarks keep running.

Why project root? Every Python/Node tool (python-dotenv, pytest-env, Docker Compose, VS Code's python.envFile, Vercel/Netlify, etc.) searches for .env from the project root. A gitignored .env at root never shows up in git status, so the "clutter" cost is zero in practice.

Injecting fake backends (for tests)

from robust_lid import RobustLID
from robust_lid.models import LID

class FakeLID(LID):
    def predict(self, text: str) -> list[tuple[str, float]]:
        return [("eng", 0.99)]

lid = RobustLID(models=[FakeLID(), FakeLID()])  # no network, no models

FastTextLID(cache_dir=..., download_fn=..., model_loader=...) and ISOConverter(mapping=..., iso639_3_map=..., tsv_path=...) accept injected dependencies for testing.

Weighted voting

RobustLID combines three multiplicative knobs per backend:

Knob	Shape	Applies when	Default source
`weights`	`list[float]`	always (per-model scalar)	`default_weights()`
`script_weights`	`list[dict[script → float]]`	when `detect_script(text)` hits a key	`default_script_weights()`
`lang_weights`	`list[dict[predicted_lang → float]]`	when the backend's top-1 matches a key	`default_lang_weights()`

Effective contribution of backend i:

weights[i] * script_weights[i].get(script, 1.0) * lang_weights[i].get(pred_lang, 1.0) * prob

Script-based backend gating

Each backend exposes supported_langs (ISO 639-3 codes it can ever emit) and supported_scripts (derived ISO 15924 codes). RobustLID uses the latter to auto-silence any backend whose supported-script set doesn't cover the input's detected script — preventing a backend from dragging the ensemble down with a confidently-wrong guess on text outside its coverage.

from robust_lid.models import LangdetectLID, FastText176LID

LangdetectLID().supported_langs       # frozenset of 55 ISO-639-3 codes
LangdetectLID().supported_scripts     # 41 ISO-15924 codes (Latn, Cyrl, …)
# langdetect has no Khmer, Ethiopic, Tibetan coverage;
# fastText-176 does, so on Amharic text only fastText's vote counts.

Custom backends (models=[MyLID(), ...]) default to frozenset() → gating is disabled. Override supported_langs to opt in.

RobustLID also applies two upstream-bug fixes at import time:

langid returns raw log-probabilities by default (large negative values). LangidLID constructs the identifier with norm_probs=True so it yields [0, 1] probabilities; otherwise the negative vote totals flip sign during normalization and hand wins to whatever language langid disagreed with.
fasttext-wheel 0.9.2 uses np.array(copy=False) which breaks on NumPy 2. _patch_fasttext_for_numpy2() monkey-patches _FastText.predict to use np.asarray instead.

compute_ensemble_vote also skips any vote with prob ≤ 0 defensively.

Calling RobustLID() without args auto-applies all three defaults. They were tuned on WiLi-2018 across 14 major languages to patch the known per-backend weak spots:

langdetect × Hani → 0.3 (73 % recall on Chinese)
langdetect × Jpan → 0.5
cld2 × Hani → 0.8
glotlid × Deva → 0.5 (confuses Hindi with Marathi)
langid × {ltz, kir} → 0.5 / 0.3 (rare mis-labels of German/Turkish)
glotlid × mar → 0.7
ft176 scalar → 1.3, ft218e scalar → 1.2 (the two strongest backends)

Override or disable selectively:

from robust_lid import RobustLID
from robust_lid.ensemble import default_backend_order

order = default_backend_order()
# ['langid', 'langdetect', 'cld2', 'cld3', 'ft176', 'ft218e', 'glotlid']
# (cld3 is omitted if gcld3 isn't installed)

# Uniform (disable all tuning)
lid = RobustLID(
    weights=[1.0] * len(order),
    script_weights=[{}] * len(order),
    lang_weights=[{}] * len(order),
)

# Custom scalar weights by name
weights_by_name = {
    "langid": 1.0, "langdetect": 0.5, "cld2": 1.0, "cld3": 1.0,
    "ft176": 2.0, "ft218e": 2.0, "glotlid": 1.5,
}
lid = RobustLID(weights=[weights_by_name[name] for name in order])

For custom models (RobustLID(models=[...])) defaults are not applied because the tuning is keyed by backend name.

To measure per-backend accuracy on your own data and re-tune:

uv run python scripts/per_backend_accuracy.py --lang por deu tur --n 50

Release process

Releases are fully automated via python-semantic-release + PyPI Trusted Publishing. Every merge to main runs the release workflow, which:

Parses Conventional Commits since the last tag.
Decides the next semver (fix:/perf: → patch, feat: → minor, BREAKING CHANGE or type!: → major).
Updates pyproject.toml's project.version, appends to CHANGELOG.md, tags vX.Y.Z, and pushes a release commit.
Builds with uv build and uploads the wheel + sdist to PyPI via OIDC (no token secret — configured once as a PyPI trusted publisher).
Attaches the artifacts to the GitHub Release.

If no commit since the last release qualifies for a bump (only chore/docs/etc.), the workflow is a no-op.

Commit prefix cheat sheet

Prefix	Version bump	Example
`feat:`	minor	`feat: add predict_batch`
`fix:` / `perf:`	patch	`fix: handle empty input`
`feat!:` / `fix!:` / `BREAKING CHANGE:` footer	major (once we're on 1.x)	`feat!: drop Python 3.11 support`
`chore:` / `docs:` / `style:` / `refactor:` / `test:` / `build:` / `ci:`	—	`docs: add FAQ`

While major_on_zero = false (0.x lifecycle), breaking changes still bump minor rather than promoting to 1.0.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

nounique

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Apr 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

robust_lid-0.1.0.tar.gz (287.6 kB view details)

Uploaded Apr 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

robust_lid-0.1.0-py3-none-any.whl (54.1 kB view details)

Uploaded Apr 20, 2026 Python 3

File details

Details for the file robust_lid-0.1.0.tar.gz.

File metadata

Download URL: robust_lid-0.1.0.tar.gz
Upload date: Apr 20, 2026
Size: 287.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for robust_lid-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`935715809d74d4c383c32cf39f8e5b400110e9109ba7aa159a4ce94c697a1a44`
MD5	`d83ad38cd9bd502508c25931ed39c2ac`
BLAKE2b-256	`d293a52c4e06a2922faaa7a92aa05805ee3c749127f0fc2611db8a4a5cd6cf5b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for robust_lid-0.1.0.tar.gz:

Publisher: release.yml on NoUnique/robust-lid

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: robust_lid-0.1.0.tar.gz
- Subject digest: 935715809d74d4c383c32cf39f8e5b400110e9109ba7aa159a4ce94c697a1a44
- Sigstore transparency entry: 1343609691
- Sigstore integration time: Apr 20, 2026
Source repository:
- Permalink: NoUnique/robust-lid@ea40dc91257097d2e73f75a5b0ccf3015a7c713a
- Branch / Tag: refs/heads/main
- Owner: https://github.com/NoUnique
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@ea40dc91257097d2e73f75a5b0ccf3015a7c713a
- Trigger Event: push

File details

Details for the file robust_lid-0.1.0-py3-none-any.whl.

File metadata

Download URL: robust_lid-0.1.0-py3-none-any.whl
Upload date: Apr 20, 2026
Size: 54.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for robust_lid-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`decfb6ffb04b136cbe849ae7712b97153934c8e5bb3ba003b4a21c7cc553c7bc`
MD5	`5e12737cacef948b0eb7e690998dccfa`
BLAKE2b-256	`491fcbb01ff56a15c0e5db80f1a3a9529028a84b3dd272fac2e352f5ce4d87ac`

See more details on using hashes here.

Provenance

The following attestation bundles were made for robust_lid-0.1.0-py3-none-any.whl:

Publisher: release.yml on NoUnique/robust-lid

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: robust_lid-0.1.0-py3-none-any.whl
- Subject digest: decfb6ffb04b136cbe849ae7712b97153934c8e5bb3ba003b4a21c7cc553c7bc
- Sigstore transparency entry: 1343609719
- Sigstore integration time: Apr 20, 2026
Source repository:
- Permalink: NoUnique/robust-lid@ea40dc91257097d2e73f75a5b0ccf3015a7c713a
- Branch / Tag: refs/heads/main
- Owner: https://github.com/NoUnique
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@ea40dc91257097d2e73f75a5b0ccf3015a7c713a
- Trigger Event: push

robust-lid 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

robust-lid

Install

Quick start

Python

CLI

Batch prediction

fast_mode (default: on)

Execution modes and memory footprint

Backends

Installation

Optional: enable the CLD3 backend

Development

LID benchmarks

Running gated datasets (FLORES-200)

Injecting fake backends (for tests)

Weighted voting

Script-based backend gating

Release process

Commit prefix cheat sheet

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`fast_mode` (default: on)