Quickly detect text language and segment language
Project description
fast-langdetect 🚀
Overview
fast-langdetect is an ultra-fast and highly accurate language detection library based on FastText, a library developed by Facebook. Its incredible speed and accuracy make it 80x faster than conventional methods and deliver up to 95% accuracy.
- Supported Python
3.9to3.13. - Works offline with the lite model
- No
numpyrequired (thanks to @dalf).
Background
This project builds upon zafercavdar/fasttext-langdetect with enhancements in packaging. For more information about the underlying model, see the official FastText documentation: Language Identification.
Memory note
The lite model runs offline and is memory-friendly; the full model is larger and offers higher accuracy.
Approximate memory usage (RSS after load):
- Lite: ~45–60 MB
- Full: ~170–210 MB
- Auto: tries full first, falls back to lite only on MemoryError.
Notes:
- Measurements vary by Python version, OS, allocator, and import graph; treat these as practical ranges.
- Validate on your system if constrained; see
examples/memory_usage_check.py(credit: script by github@JackyHe398`).- Run memory checks in a clean terminal session. IDEs/REPLs may preload frameworks and inflate peak RSS (ru_maxrss), leading to very large peaks with near-zero deltas.
Choose the model that best fits your constraints.
Installation 💻
To install fast-langdetect, you can use either pip or pdm:
Using pip
pip install fast-langdetect
Using pdm
pdm add fast-langdetect
Usage 🖥️
For higher accuracy, prefer the full model via detect(text, model='full'). For robust behavior under memory pressure, use detect(text, model='auto') which falls back to the lite model only on MemoryError.
Prerequisites
- If the sample is too long or too short, the accuracy will be reduced.
- The model will be downloaded to system temporary directory by default. You can customize it by:
- Setting
FTLANG_CACHEenvironment variable - Using
LangDetectConfig(cache_dir="your/path")
- Setting
Simple Usage (Recommended)
Call by model explicitly — clear and predictable, and use k to get multiple candidates. The function always returns a list of results:
from fast_langdetect import detect
# Lite model (offline, smaller, faster) — never falls back
print(detect("Hello", model='lite', k=1)) # -> [{'lang': 'en', 'score': ...}]
# Full model (downloaded to cache, higher accuracy) — never falls back
print(detect("Hello", model='full', k=1)) # -> [{'lang': 'en', 'score': ...}]
# Auto mode: try full, fallback to lite only on MemoryError
print(detect("Hello", model='auto', k=1)) # -> [{'lang': 'en', 'score': ...}]
# Multilingual: top 3 candidates (always a list)
print(detect("Hello 世界 こんにちは", model='auto', k=3))
If you need a custom cache directory, pass LangDetectConfig:
from fast_langdetect import LangDetectConfig, detect
cfg = LangDetectConfig(cache_dir="/custom/cache/path")
print(detect("Hello", model='full', config=cfg))
# Set a default model via config and let calls omit model
cfg_lite = LangDetectConfig(model="lite")
print(detect("Hello", config=cfg_lite)) # uses lite by default
print(detect("Bonjour", config=cfg_lite)) # uses lite by default
print(detect("Hello", model='full', config=cfg_lite)) # per-call override to full
Native API (Recommended)
from fast_langdetect import detect, LangDetector, LangDetectConfig
# Simple detection (uses config default if not provided; defaults to 'auto')
print(detect("Hello, world!", k=1))
# Output: [{'lang': 'en', 'score': 0.98}]
# Using full model for better accuracy
print(detect("Hello, world!", model='full', k=1))
# Output: [{'lang': 'en', 'score': 0.99}]
# Custom configuration
config = LangDetectConfig(cache_dir="/custom/cache/path", model="auto") # Custom cache + default model
detector = LangDetector(config)
# Omit model to use config.model; pass model to override
result = detector.detect("Hello world", k=1)
print(result) # [{'lang': 'en', 'score': 0.98}]
# Multiline text is handled automatically (newlines are replaced)
multiline_text = "Hello, world!\nThis is a multiline text."
print(detect(multiline_text, k=1))
# Output: [{'lang': 'en', 'score': 0.85}]
# Multi-language detection
results = detect(
"Hello 世界 こんにちは",
model='auto',
k=3 # Return top 3 languages (auto model loading)
)
print(results)
# Output: [
# {'lang': 'ja', 'score': 0.4},
# {'lang': 'zh', 'score': 0.3},
# {'lang': 'en', 'score': 0.2}
# ]
Fallback Policy (Keep It Simple)
- Only
MemoryErrortriggers fallback (inmodel='auto'): when loading the full model runs out of memory, it falls back to the lite model. - I/O/network/permission/path/integrity errors raise standard exceptions (e.g.,
FileNotFoundError,PermissionError) or library-specific errors where applicable — no silent fallback. model='lite'andmodel='full'never fallback by design.
Errors
- Base error:
FastLangdetectError(library-specific failures). - Model loading failures:
ModelLoadError. - Standard Python exceptions (e.g.,
ValueError,TypeError,FileNotFoundError,MemoryError) propagate when they are not library-specific.
Convenient detect_language Function
from fast_langdetect import detect_language
# Single language detection
print(detect_language("Hello, world!"))
# Output: EN
print(detect_language("Привет, мир!"))
# Output: RU
print(detect_language("你好,世界!"))
# Output: ZH
Load Custom Models
# Load model from local file
config = LangDetectConfig(custom_model_path="/path/to/your/model.bin")
detector = LangDetector(config)
result = detector.detect("Hello world", model='auto', k=1)
Splitting Text by Language 🌐
For text splitting based on language, please refer to the split-lang repository.
Input Handling
You can control log verbosity and input normalization via LangDetectConfig:
from fast_langdetect import LangDetectConfig, LangDetector
config = LangDetectConfig(
max_input_length=80, # default: auto-truncate long inputs for stable results
)
detector = LangDetector(config)
print(detector.detect("Some very long text..."))
- Newlines are always replaced with spaces to avoid FastText errors (silent, no log).
- When truncation happens, a WARNING is logged because it may reduce accuracy.
max_input_length=80truncates overly long inputs; setNoneto disable if you prefer no truncation.
Cache Directory Behavior
- Default cache: if
cache_diris not set, models are stored under a system temp-based directory specified byFTLANG_CACHEor an internal default. This directory is created automatically when needed. - User-provided cache_dir: if you set
LangDetectConfig(cache_dir=...)to a path that does not exist, the library raisesFileNotFoundErrorinstead of silently creating or using another location. Create the directory yourself if that’s intended.
Advanced Options (Optional)
The constructor exposes a few advanced knobs (proxy, normalize_input, max_input_length). These are rarely needed for typical usage and can be ignored. Prefer detect(..., model=...) unless you know you need them.
Language Codes → English Names
The detector returns fastText language codes (e.g., en, zh, ja, pt-br). To present user-friendly names, you can map codes to English names using a third-party library. Example using langcodes:
# pip install langcodes
from langcodes import Language
OVERRIDES = {
# fastText-specific or variant tags commonly used
"yue": "Cantonese",
"wuu": "Wu Chinese",
"arz": "Egyptian Arabic",
"ckb": "Central Kurdish",
"kab": "Kabyle",
"zh-cn": "Chinese (China)",
"zh-tw": "Chinese (Taiwan)",
"pt-br": "Portuguese (Brazil)",
}
def code_to_english_name(code: str) -> str:
code = code.replace("_", "-").lower()
if code in OVERRIDES:
return OVERRIDES[code]
try:
# Display name in English; e.g. 'Portuguese (Brazil)'
return Language.get(code).display_name("en")
except Exception:
# Try the base language (e.g., 'pt' from 'pt-br')
base = code.split("-")[0]
try:
return Language.get(base).display_name("en")
except Exception:
return code
# Usage
from fast_langdetect import detect
result = detect("Olá mundo", model='full', k=1)
print(code_to_english_name(result[0]["lang"])) # Portuguese (Brazil) or Portuguese
Alternatively, pycountry can be used for ISO 639 lookups (install with pip install pycountry), combined with a small override dict for non-standard tags like pt-br, zh-cn, yue, etc.
Benchmark 📊
For detailed benchmark results, refer to zafercavdar/fasttext-langdetect#benchmark.
References 📚
[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification
@article{joulin2016bag,
title={Bag of Tricks for Efficient Text Classification},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
journal={arXiv preprint arXiv:1607.01759},
year={2016}
}
[2] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification models
@article{joulin2016fasttext,
title={FastText.zip: Compressing text classification models},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},
journal={arXiv preprint arXiv:1612.03651},
year={2016}
}
License 📄
- Code: Released under the MIT License (see
LICENSE). - Models: This package uses the pre-trained fastText language identification models (
lid.176.ftzbundled for offline use andlid.176.bindownloaded as needed). These models are licensed under the Creative Commons Attribution-ShareAlike 3.0 (CC BY-SA 3.0) license. - Attribution: fastText language identification models by Facebook AI Research. See the fastText docs and license for details:
- Note: If you redistribute or modify the model files, you must comply with CC BY-SA 3.0. Inference usage via this library does not change the license of the model files themselves.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fast_langdetect-1.0.0.tar.gz.
File metadata
- Download URL: fast_langdetect-1.0.0.tar.gz
- Upload date:
- Size: 796.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: pdm/2.25.9 CPython/3.12.3 Linux/6.11.0-1018-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ea8ac6a8914e0ff1bfc1bbc0f25992eb913ddb69e63ea1b24e907e263d0cd113
|
|
| MD5 |
ca065ecf66b867478d27d6883e94d3d4
|
|
| BLAKE2b-256 |
531585b0137066be418b6249d8e8d98e2b16c072c65b80c293b9438fdea1be5e
|
File details
Details for the file fast_langdetect-1.0.0-py3-none-any.whl.
File metadata
- Download URL: fast_langdetect-1.0.0-py3-none-any.whl
- Upload date:
- Size: 789.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: pdm/2.25.9 CPython/3.12.3 Linux/6.11.0-1018-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aab9e3435cc667ac8ba8b1a38872f75492f65b7087901d0f3a02a88d436cd22a
|
|
| MD5 |
288682dac8cec519e512e3aa3ab2253c
|
|
| BLAKE2b-256 |
f6710db1ac89f8661048ebc22d62f503a2e147cb6872c5f2aeb659c1f02c1694
|