Fast, high-accuracy language detection for Python. Uses ngram classification augmented with a topwords signal for improved short-text accuracy. Supports 80+ languages.

These details have not been verified by PyPI

Project links

Project description

LangIdentify

A fast, lightweight language detection library for Python. LangIdentify detects the language of text using a combination of ngram frequency analysis and whole-word ("topwords") frequency signals, both trained on the Wikipedia corpus. It supports 80+ languages across Latin, Cyrillic, Arabic, CJK, and many other scripts, and runs entirely offline with no network calls.

Most language detection libraries rely solely on character ngram models. While ngrams are an excellent primary signal, they struggle with short or ambiguous text. LangIdentify augments ngram scoring with a topwords signal that identifies common whole words from each language, giving it higher accuracy on short sentences than other approaches -- even on two-word phrases.

Quick start

Install

pip install langidentify

For the full (higher accuracy) model:

pip install "langidentify[full]"

Basic usage

from langidentify import Detector, Model, Language

# Load the model for the languages you care about.
languages = Language.from_comma_separated("en,fr,de,es,it")
model = Model.load(languages)

# Create a detector (lightweight, not thread-safe -- use one per thread).
detector = Detector(model)

# Detect.
lang = detector.detect("Bonjour le monde")
print(lang)            # Language.FRENCH
print(lang.iso_code)   # fr

Inspecting results

After detection, detector.results provides scoring details:

detector.detect("The quick brown fox")
results = detector.results
print(results.result)  # Language.ENGLISH
print(results.gap)     # confidence gap (0.0 = close, 1.0 = decisive)

Incremental detection

For streaming or multi-part text:

detector.clear_scores()
detector.add_text("Bonjour")
detector.add_text(" le monde")
result = detector.compute_result()  # Language.FRENCH

Language boosts

When you have prior context (e.g. an HTTP Accept-Language header), you can bias detection toward expected languages:

boosts = model.build_boost_array({Language.FRENCH: 0.08})
lang = detector.detect("message", boosts)  # FRENCH
# Without the boost, "message" is ambiguous between English and French.

Loading from a filesystem path

If you prefer to point directly at model data files instead of using the bundled package data:

model = Model.load_from_path("/path/to/models/lite", languages)

Choosing languages

Configure only the languages you actually need. Each additional language increases loading time and memory usage. Closely related languages can cross-detect on very short phrases -- for example, adding Luxembourgish when you only need German may cause short German phrases to be misidentified.

Group aliases are supported for convenience:

Alias	Languages
`efigs`	English, French, Italian, German, Spanish
`efigsnp`	EFIGS + Dutch, Portuguese
`europe_west_common`	EFIGSNP + Nordic languages
`europe_common`	Western + Eastern European + Cyrillic
`cjk`	Chinese (Simplified/Traditional), Japanese, Korean
`latin_alphabet`	All Latin-script languages
`unique_alphabet`	Languages where the script implies the language (e.g. Thai, Greek)

languages = Language.from_comma_separated("europe_west_common,cjk")

Lite vs. full model

Both models are trained from the same Wikipedia data but cropped at different probability floors:

	Lite	Full
Log-probability floor	-12	-15
Disk size (all languages)	~17 MB	~89 MB
Best for	Most use cases	Maximum accuracy when memory is not a concern

By default, Model.load() auto-discovers which model variant is available, preferring the full model. To force a variant:

model = Model.load_lite(languages)   # recommended default
model = Model.load_full(languages)   # higher accuracy, more memory

Getting the full model

The lite model is sufficient for most use cases. If you want the full model for maximum accuracy, install the companion package:

pip install "langidentify[full]"

This installs the langidentify-full-model package, which provides the full model data. Once installed, Model.load() will automatically prefer the full model, or you can request it explicitly:

model = Model.load_full(languages)

CJK detection

Chinese/Japanese disambiguation is handled by the cjclassifier package, which is installed automatically as a dependency. Korean uses the distinct Hangul script and is identified by alphabet alone.

Thread safety

Model caches loaded data in a module-level dict protected by a lock. Detector is lightweight to construct and intentionally not thread-safe. For concurrent detection, use a separate instance per thread:

import threading

model = Model.load(languages)  # shared, thread-safe

local = threading.local()

def get_detector():
    if not hasattr(local, "detector"):
        local.detector = Detector(model)
    return local.detector

# In each thread:
lang = get_detector().detect(text)

Model load time

The language model needs to be loaded before the first detection, the expensive part is the initial load, subsequent accesses are cached. Load only the languages you need; each additional language adds to both load time and memory, though the "unique_alphabet" languages are mostly free (e.g. Thai or Greek can be deduced from their alphabets)

Load time and memory (lite model)

Measured on an Mac M4 with Python 3.13. Memory figures are from tracemalloc (Python heap only; RSS will be higher because RSS isn't aggressively reclaimed).

Language set	Languages	Load time	Memory	Ngram entries	Topword entries
`efigs`	5	~0.25s	~30 MB	92K	46K
`europe_west_common`	11	~0.7s	~55 MB	165K	91K
`all`	84	~5.5s	~300 MB	986K	604K

The full model uses roughly 5x more memory and takes proportionally longer to load. For many applications the lite model is recommended.

Detection throughput

Detection runs at millions of words per second on a single core. Short phrases (1-5 words) are dominated by per-call overhead; longer text approaches peak throughput. Detector is lightweight to construct but not threadsafe,

Requirements

Python 3.9+
cjclassifier (installed automatically)

License

Apache License 2.0 -- see LICENSE.

The bundled models contain statistical parameters derived from Wikipedia text. The models do not contain or reproduce Wikipedia text.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.2

Mar 19, 2026

This version

1.0.1

Mar 19, 2026

1.0.0 yanked

Mar 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langidentify-1.0.1.tar.gz (17.9 MB view details)

Uploaded Mar 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

langidentify-1.0.1-py3-none-any.whl (17.9 MB view details)

Uploaded Mar 19, 2026 Python 3

File details

Details for the file langidentify-1.0.1.tar.gz.

File metadata

Download URL: langidentify-1.0.1.tar.gz
Upload date: Mar 19, 2026
Size: 17.9 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for langidentify-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`25677db90d6ca371cd8e8b26631993481a8f6a68fb28d4e20ac1fe47bdd41273`
MD5	`f76ef2a374003489b5edb96aa1b8915b`
BLAKE2b-256	`0a7f29cfdc03281c53caeaf14a69d05025856c1d83d16a9b012b45610d694390`

See more details on using hashes here.

File details

Details for the file langidentify-1.0.1-py3-none-any.whl.

File metadata

Download URL: langidentify-1.0.1-py3-none-any.whl
Upload date: Mar 19, 2026
Size: 17.9 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for langidentify-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bf5bf18338206d4f15af026e164419897afe58446d958dc5519fb9d230a72ae3`
MD5	`9cdfcfc5963957beea4e401e22ce272a`
BLAKE2b-256	`f16022d0bb4b06975c1d7224e54a33274421623916b111be2e5e4e44101c2d7d`

See more details on using hashes here.

langidentify 1.0.1

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

LangIdentify

Quick start

Install

Basic usage

Inspecting results

Incremental detection

Language boosts

Loading from a filesystem path

Choosing languages

Lite vs. full model

Getting the full model

CJK detection

Thread safety

Model load time

Load time and memory (lite model)

Detection throughput

Requirements

License

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes