Quickly detect text language and segment language
Project description
fast-langdetect 🚀
Overview
fast-langdetect is an ultra-fast and highly accurate language detection library based on FastText, a library developed by Facebook. Its incredible speed and accuracy make it 80x faster than conventional methods and deliver up to 95% accuracy.
- Supported Python
3.9to3.14. - Works offline with the lite model.
- No
numpyrequired (thanks to @dalf).
Background
This project builds upon zafercavdar/fasttext-langdetect with enhancements in packaging. For more information about the underlying model, see the official FastText documentation: Language Identification.
Memory note
The lite model runs offline and is memory-friendly; the full model is larger and offers higher accuracy.
Approximate memory usage (RSS after load):
- Lite: ~45–60 MB
- Full: ~170–210 MB
- Auto: tries full first, falls back to lite only on MemoryError.
Notes:
- Measurements vary by Python version, OS, allocator, and import graph; treat these as practical ranges.
- Validate on your system if constrained; see
examples/memory_usage_check.py(credit: script by github@JackyHe398`).- Run memory checks in a clean terminal session. IDEs/REPLs may preload frameworks and inflate peak RSS (ru_maxrss), leading to very large peaks with near-zero deltas.
Choose the model that best fits your constraints.
Installation 💻
To install fast-langdetect, you can use either pip or pdm:
Using pip
pip install fast-langdetect
Using pdm
pdm add fast-langdetect
Usage 🖥️
For higher accuracy, prefer the full model via detect(text, model='full'). For robust behavior under memory pressure, use detect(text, model='auto') which falls back to the lite model only on MemoryError.
Prerequisites
- If the sample is too long (generally over 80 characters) or too short, accuracy will be reduced.
- The model downloads to the system temporary directory by default. You can customize it by:
- Setting
FTLANG_CACHEenvironment variable - Using
LangDetectConfig(cache_dir="your/path")
- Setting
Quick Start
from fast_langdetect import detect
print(detect("Hello, world!", model="auto", k=1))
print(detect("Hello 世界 こんにちは", model="auto", k=3))
detect always returns a list of candidates ordered by score. Use model="full" for the best accuracy or model="lite" for an offline-only workflow.
Custom Configuration
from fast_langdetect import LangDetectConfig, LangDetector
config = LangDetectConfig(cache_dir="/custom/cache", model="lite")
detector = LangDetector(config)
print(detector.detect("Bonjour", k=1))
print(detector.detect("Hola", model="full", k=1))
Each LangDetector instance maintains its own in-memory model cache. Once loaded, models are reused for subsequent calls within the same instance. The global detect() function uses a shared default detector, so it also benefits from automatic caching.
Create a custom LangDetector instance when you need specific configuration (custom cache directory, input limits, etc.) or isolated model management.
🌵 Fallback Policy
Keep It Simple!
- Only
MemoryErrortriggers fallback (inmodel='auto'): when loading the full model runs out of memory, it falls back to the lite model. - I/O/network/permission/path/integrity errors raise standard exceptions (e.g.,
FileNotFoundError,PermissionError) or library-specific errors where applicable — no silent fallback. model='lite'andmodel='full'never fallback by design.
Errors
- Base error:
FastLangdetectError(library-specific failures). - Model loading failures:
ModelLoadError. - Standard Python exceptions (e.g.,
ValueError,TypeError,FileNotFoundError,MemoryError) propagate when they are not library-specific.
Splitting Text by Language 🌐
For text splitting based on language, please refer to the split-lang repository.
Input Handling
You can control log verbosity and input normalization via LangDetectConfig:
from fast_langdetect import LangDetectConfig, LangDetector
config = LangDetectConfig(max_input_length=200)
detector = LangDetector(config)
print(detector.detect("Some very long text..." * 5))
- Newlines are always replaced with spaces to avoid FastText errors (silent, no log).
- When truncation happens, a WARNING is logged because it may reduce accuracy.
- The default
max_input_lengthis 80 characters (optimal for accuracy); increase it if you need longer samples, or setNoneto disable truncation entirely.
Cache Directory Behavior
- Default cache: if
cache_diris not set, models are stored under a system temp-based directory specified byFTLANG_CACHEor an internal default. This directory is created automatically when needed. - User-provided cache_dir: if you set
LangDetectConfig(cache_dir=...)to a path that does not exist, the library raisesFileNotFoundErrorinstead of silently creating or using another location. Create the directory yourself if that’s intended.
Advanced Options (Optional)
The constructor exposes a few advanced knobs (proxy, normalize_input, max_input_length). These are rarely needed for typical usage and can be ignored. Prefer detect(..., model=...) unless you know you need them.
Language Codes → English Names
fastText reports BCP-47 style tags such as en, zh-cn, pt-br, yue. The detector keeps those codes so you can decide how to display them. Choose the approach that fits your product:
- Small, fixed list? Maintain a hand-written mapping and fall back to the raw code for anything unexpected.
FASTTEXT_DISPLAY_NAMES = {
"en": "English",
"zh": "Chinese",
"zh-cn": "Chinese (China)",
"zh-tw": "Chinese (Taiwan)",
"pt": "Portuguese",
"pt-br": "Portuguese (Brazil)",
"yue": "Cantonese",
"wuu": "Wu Chinese",
"arz": "Egyptian Arabic",
"ckb": "Central Kurdish",
"kab": "Kabyle",
}
def code_to_display_name(code: str) -> str:
return FASTTEXT_DISPLAY_NAMES.get(code.lower(), code)
print(code_to_display_name("pt-br"))
print(code_to_display_name("de"))
- Need coverage for all 176 fastText languages? Use a language database library that understands subtags and scripts. Two popular libraries are
langcodesandpycountry.
# pip install langcodes
from langcodes import Language
LANG_OVERRIDES = {
"pt-br": "Portuguese (Brazil)",
"zh-cn": "Chinese (China)",
"zh-tw": "Chinese (Taiwan)",
"yue": "Cantonese",
}
def fasttext_to_name(code: str) -> str:
normalized = code.replace("_", "-").lower()
if normalized in LANG_OVERRIDES:
return LANG_OVERRIDES[normalized]
try:
return Language.get(normalized).display_name("en")
except Exception:
base = normalized.split("-")[0]
try:
return Language.get(base).display_name("en")
except Exception:
return code
from fast_langdetect import detect
result = detect("Olá mundo", model="full", k=1)
print(fasttext_to_name(result[0]["lang"]))
pycountry works similarly (pip install pycountry). Use pycountry.languages.lookup("pt") for fuzzy matching or pycountry.languages.get(alpha_2="pt") for exact lookups, and pair it with a small override dictionary for non-standard tags such as pt-br, zh-cn, or dialect codes like yue.
# pip install pycountry
import pycountry
FASTTEXT_OVERRIDES = {
"pt-br": "Portuguese (Brazil)",
"zh-cn": "Chinese (China)",
"zh-tw": "Chinese (Taiwan)",
"yue": "Cantonese",
}
def fasttext_to_name_pycountry(code: str) -> str:
normalized = code.replace("_", "-").lower()
if normalized in FASTTEXT_OVERRIDES:
return FASTTEXT_OVERRIDES[normalized]
try:
return pycountry.languages.lookup(normalized).name
except LookupError:
base = normalized.split("-")[0]
try:
return pycountry.languages.lookup(base).name
except LookupError:
return code
from fast_langdetect import detect
result = detect("Olá mundo", model="full", k=1)
print(fasttext_to_name_pycountry(result[0]["lang"]))
Load Custom Models
from importlib import resources
from fast_langdetect import LangDetectConfig, LangDetector
with resources.path("fast_langdetect.resources", "lid.176.ftz") as model_path:
config = LangDetectConfig(custom_model_path=str(model_path))
detector = LangDetector(config)
print(detector.detect("Hello world", k=1))
When using a custom model via custom_model_path, the model parameter in detect() calls is ignored since your custom model file is always loaded directly. The model="lite", model="full", and model="auto" parameters only apply when using the built-in models.
Benchmark 📊
For detailed benchmark results, refer to zafercavdar/fasttext-langdetect#benchmark.
References 📚
[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification
@article{joulin2016bag,
title={Bag of Tricks for Efficient Text Classification},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
journal={arXiv preprint arXiv:1607.01759},
year={2016}
}
[2] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification models
@article{joulin2016fasttext,
title={FastText.zip: Compressing text classification models},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},
journal={arXiv preprint arXiv:1612.03651},
year={2016}
}
License 📄
- Code: Released under the MIT License (see
LICENSE). - Models: This package uses the pre-trained fastText language identification models (
lid.176.ftzbundled for offline use andlid.176.bindownloaded as needed). These models are licensed under the Creative Commons Attribution-ShareAlike 3.0 (CC BY-SA 3.0) license. - Attribution: fastText language identification models by Facebook AI Research. See the fastText docs and license for details:
- Note: If you redistribute or modify the model files, you must comply with CC BY-SA 3.0. Inference usage via this library does not change the license of the model files themselves.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fast_langdetect-1.0.1.tar.gz.
File metadata
- Download URL: fast_langdetect-1.0.1.tar.gz
- Upload date:
- Size: 796.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: pdm/2.26.8 CPython/3.12.3 Linux/6.17.0-1010-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
22c651bf576aff3cb90edf070071f256996eabe612111ea9ced5b935e4376413
|
|
| MD5 |
26d72ecd7c042e6a1eac92645d6f0b5e
|
|
| BLAKE2b-256 |
2483639b1afc54b4cc73bf56c7f8efe50218b967364a9503947612b2b49212fa
|
File details
Details for the file fast_langdetect-1.0.1-py3-none-any.whl.
File metadata
- Download URL: fast_langdetect-1.0.1-py3-none-any.whl
- Upload date:
- Size: 790.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: pdm/2.26.8 CPython/3.12.3 Linux/6.17.0-1010-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d965844dfe44bb5e6042779dbc592618f227d447b752c4e2e503b0fd6abe5a4f
|
|
| MD5 |
00ca89a5ba756724d6ff64b45ff534e8
|
|
| BLAKE2b-256 |
461ce4171f5235c2052ffcba1e19f8243cc6643ff678615d6cc803924343ed6f
|