Fast, high-accuracy language detection for Python. Uses ngram classification augmented with a topwords signal for improved short-text accuracy. Supports 80+ languages.
Project description
LangIdentify
A fast, lightweight language detection library for Python. LangIdentify detects the language of text using a combination of ngram frequency analysis and whole-word ("topwords") frequency signals, both trained on the Wikipedia corpus. It supports 80+ languages across Latin, Cyrillic, Arabic, CJK, and many other scripts, and runs entirely offline with no network calls.
Most language detection libraries rely solely on character ngram models. While ngrams are an excellent primary signal, they struggle with short or ambiguous text. LangIdentify augments ngram scoring with a topwords signal that identifies common whole words from each language, giving it higher accuracy on short sentences than other approaches -- even on two-word phrases.
Quick start
Install
pip install langidentify
For the full (higher accuracy) model:
pip install "langidentify[full]"
Basic usage
from langidentify import Detector, Model, Language
# Load the model for the languages you care about.
languages = Language.from_comma_separated("en,fr,de,es,it")
model = Model.load(languages)
# Create a detector (lightweight, not thread-safe -- use one per thread).
detector = Detector(model)
# Detect.
lang = detector.detect("Bonjour le monde")
print(lang) # Language.FRENCH
print(lang.iso_code) # fr
Inspecting results
After detection, detector.results provides scoring details:
detector.detect("The quick brown fox")
results = detector.results
print(results.result) # Language.ENGLISH
print(results.gap) # confidence gap (0.0 = close, 1.0 = decisive)
Incremental detection
For streaming or multi-part text:
detector.clear_scores()
detector.add_text("Bonjour")
detector.add_text(" le monde")
result = detector.compute_result() # Language.FRENCH
Language boosts
When you have prior context (e.g. an HTTP Accept-Language header), you can bias detection toward expected languages:
boosts = model.build_boost_array({Language.FRENCH: 0.08})
lang = detector.detect("message", boosts) # FRENCH
# Without the boost, "message" is ambiguous between English and French.
Loading from a filesystem path
If you prefer to point directly at model data files instead of using the bundled package data:
model = Model.load_from_path("/path/to/models/lite", languages)
Choosing languages
Configure only the languages you actually need. Each additional language increases loading time and memory usage. Closely related languages can cross-detect on very short phrases -- for example, adding Luxembourgish when you only need German may cause short German phrases to be misidentified.
Group aliases are supported for convenience:
| Alias | Languages |
|---|---|
efigs |
English, French, Italian, German, Spanish |
efigsnp |
EFIGS + Dutch, Portuguese |
europe_west_common |
EFIGSNP + Nordic languages |
europe_common |
Western + Eastern European + Cyrillic |
cjk |
Chinese (Simplified/Traditional), Japanese, Korean |
latin_alphabet |
All Latin-script languages |
unique_alphabet |
Languages where the script implies the language (e.g. Thai, Greek) |
languages = Language.from_comma_separated("europe_west_common,cjk")
Lite vs. full model
Both models are trained from the same Wikipedia data but cropped at different probability floors:
| Lite | Full | |
|---|---|---|
| Log-probability floor | -12 | -15 |
| Disk size (all languages) | ~17 MB | ~89 MB |
| Best for | Most use cases | Maximum accuracy when memory is not a concern |
By default, Model.load() auto-discovers which model variant is available,
preferring the full model. To force a variant:
model = Model.load_lite(languages) # recommended default
model = Model.load_full(languages) # higher accuracy, more memory
Getting the full model
The lite model is sufficient for most use cases. If you want the full model for maximum accuracy, install the companion package:
pip install "langidentify[full]"
This installs the langidentify-full-model package, which provides the full
model data. Once installed, Model.load() will automatically prefer the full
model, or you can request it explicitly:
model = Model.load_full(languages)
CJK detection
Chinese/Japanese disambiguation is handled by the cjclassifier package, which is installed automatically as a dependency. Korean uses the distinct Hangul script and is identified by alphabet alone.
Thread safety
Model caches loaded data in a module-level dict protected by a lock.
Detector is lightweight to construct and intentionally not thread-safe.
For concurrent detection, use a separate instance per thread:
import threading
model = Model.load(languages) # shared, thread-safe
local = threading.local()
def get_detector():
if not hasattr(local, "detector"):
local.detector = Detector(model)
return local.detector
# In each thread:
lang = get_detector().detect(text)
Requirements
- Python 3.9+
- cjclassifier (installed automatically)
License
Apache License 2.0 -- see LICENSE.
The bundled models contain statistical parameters derived from Wikipedia text. The models do not contain or reproduce Wikipedia text.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file langidentify-1.0.0.tar.gz.
File metadata
- Download URL: langidentify-1.0.0.tar.gz
- Upload date:
- Size: 17.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
21685dd33c5886c4983e7d52bda2cebc9c81d56fb86ba1ad95dcb7d547c1a8d2
|
|
| MD5 |
3cdbb5e08b8bf047334723b77288d7f0
|
|
| BLAKE2b-256 |
d725b0164585cc84612c739f7d6db1e2254c2a1811c57a4da25602eba190fa6c
|
File details
Details for the file langidentify-1.0.0-py3-none-any.whl.
File metadata
- Download URL: langidentify-1.0.0-py3-none-any.whl
- Upload date:
- Size: 17.9 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2cd0f34bfddf17d68486631280718217acc7abe5fdb5853f7c572784b9234bc0
|
|
| MD5 |
15fbb979496f7a3d78d12a05d9bb71b0
|
|
| BLAKE2b-256 |
7a2203838b527b4d077014ea71feb2d096e989dead4b819ec08b8fe528e40500
|