Skip to main content

Quickly detect text language and segment language

Project description

fast-langdetect 🚀

PyPI version Downloads Downloads

Overview

fast-langdetect is an ultra-fast and highly accurate language detection library based on FastText, a library developed by Facebook. Its incredible speed and accuracy make it 80x faster than conventional methods and deliver up to 95% accuracy.

  • Supported Python 3.9 to 3.14.
  • Works offline with the lite model.
  • No numpy required (thanks to @dalf).

Background

This project builds upon zafercavdar/fasttext-langdetect with enhancements in packaging. For more information about the underlying model, see the official FastText documentation: Language Identification.

Memory note

The lite model runs offline and is memory-friendly; the full model is larger and offers higher accuracy.

Approximate memory usage (RSS after load):

  • Lite: ~45–60 MB
  • Full: ~170–210 MB
  • Auto: tries full first, falls back to lite only on MemoryError.

Notes:

  • Measurements vary by Python version, OS, allocator, and import graph; treat these as practical ranges.
  • Validate on your system if constrained; see examples/memory_usage_check.py (credit: script by github@JackyHe398`).
  • Run memory checks in a clean terminal session. IDEs/REPLs may preload frameworks and inflate peak RSS (ru_maxrss), leading to very large peaks with near-zero deltas.

Choose the model that best fits your constraints.

Installation 💻

To install fast-langdetect, you can use either pip or pdm:

Using pip

pip install fast-langdetect

Using pdm

pdm add fast-langdetect

Usage 🖥️

For higher accuracy, prefer the full model via detect(text, model='full'). For robust behavior under memory pressure, use detect(text, model='auto') which falls back to the lite model only on MemoryError.

Prerequisites

  • If the sample is too long (generally over 80 characters) or too short, accuracy will be reduced.
  • The model downloads to the system temporary directory by default. You can customize it by:
    • Setting FTLANG_CACHE environment variable
    • Using LangDetectConfig(cache_dir="your/path")

Quick Start

from fast_langdetect import detect

print(detect("Hello, world!", model="auto", k=1))
print(detect("Hello 世界 こんにちは", model="auto", k=3))

detect always returns a list of candidates ordered by score. Use model="full" for the best accuracy or model="lite" for an offline-only workflow.

Custom Configuration

from fast_langdetect import LangDetectConfig, LangDetector

config = LangDetectConfig(cache_dir="/custom/cache", model="lite")
detector = LangDetector(config)
print(detector.detect("Bonjour", k=1))
print(detector.detect("Hola", model="full", k=1))

Each LangDetector instance maintains its own in-memory model cache. Once loaded, models are reused for subsequent calls within the same instance. The global detect() function uses a shared default detector, so it also benefits from automatic caching.

Create a custom LangDetector instance when you need specific configuration (custom cache directory, input limits, etc.) or isolated model management.

🌵 Fallback Policy

Keep It Simple!

  • Only MemoryError triggers fallback (in model='auto'): when loading the full model runs out of memory, it falls back to the lite model.
  • I/O/network/permission/path/integrity errors raise standard exceptions (e.g., FileNotFoundError, PermissionError) or library-specific errors where applicable — no silent fallback.
  • model='lite' and model='full' never fallback by design.

Errors

  • Base error: FastLangdetectError (library-specific failures).
  • Model loading failures: ModelLoadError.
  • Standard Python exceptions (e.g., ValueError, TypeError, FileNotFoundError, MemoryError) propagate when they are not library-specific.

Splitting Text by Language 🌐

For text splitting based on language, please refer to the split-lang repository.

Input Handling

You can control log verbosity and input normalization via LangDetectConfig:

from fast_langdetect import LangDetectConfig, LangDetector

config = LangDetectConfig(max_input_length=200)
detector = LangDetector(config)
print(detector.detect("Some very long text..." * 5))
  • Newlines are always replaced with spaces to avoid FastText errors (silent, no log).
  • When truncation happens, a WARNING is logged because it may reduce accuracy.
  • The default max_input_length is 80 characters (optimal for accuracy); increase it if you need longer samples, or set None to disable truncation entirely.

Cache Directory Behavior

  • Default cache: if cache_dir is not set, models are stored under a system temp-based directory specified by FTLANG_CACHE or an internal default. This directory is created automatically when needed.
  • User-provided cache_dir: if you set LangDetectConfig(cache_dir=...) to a path that does not exist, the library raises FileNotFoundError instead of silently creating or using another location. Create the directory yourself if that’s intended.

Advanced Options (Optional)

The constructor exposes a few advanced knobs (proxy, normalize_input, max_input_length). These are rarely needed for typical usage and can be ignored. Prefer detect(..., model=...) unless you know you need them.

Language Codes → English Names

fastText reports BCP-47 style tags such as en, zh-cn, pt-br, yue. The detector keeps those codes so you can decide how to display them. Choose the approach that fits your product:

  • Small, fixed list? Maintain a hand-written mapping and fall back to the raw code for anything unexpected.
FASTTEXT_DISPLAY_NAMES = {
    "en": "English",
    "zh": "Chinese",
    "zh-cn": "Chinese (China)",
    "zh-tw": "Chinese (Taiwan)",
    "pt": "Portuguese",
    "pt-br": "Portuguese (Brazil)",
    "yue": "Cantonese",
    "wuu": "Wu Chinese",
    "arz": "Egyptian Arabic",
    "ckb": "Central Kurdish",
    "kab": "Kabyle",
}

def code_to_display_name(code: str) -> str:
    return FASTTEXT_DISPLAY_NAMES.get(code.lower(), code)

print(code_to_display_name("pt-br"))
print(code_to_display_name("de"))
  • Need coverage for all 176 fastText languages? Use a language database library that understands subtags and scripts. Two popular libraries are langcodes and pycountry.
# pip install langcodes
from langcodes import Language

LANG_OVERRIDES = {
    "pt-br": "Portuguese (Brazil)",
    "zh-cn": "Chinese (China)",
    "zh-tw": "Chinese (Taiwan)",
    "yue": "Cantonese",
}

def fasttext_to_name(code: str) -> str:
    normalized = code.replace("_", "-").lower()
    if normalized in LANG_OVERRIDES:
        return LANG_OVERRIDES[normalized]
    try:
        return Language.get(normalized).display_name("en")
    except Exception:
        base = normalized.split("-")[0]
        try:
            return Language.get(base).display_name("en")
        except Exception:
            return code

from fast_langdetect import detect
result = detect("Olá mundo", model="full", k=1)
print(fasttext_to_name(result[0]["lang"]))

pycountry works similarly (pip install pycountry). Use pycountry.languages.lookup("pt") for fuzzy matching or pycountry.languages.get(alpha_2="pt") for exact lookups, and pair it with a small override dictionary for non-standard tags such as pt-br, zh-cn, or dialect codes like yue.

# pip install pycountry
import pycountry

FASTTEXT_OVERRIDES = {
    "pt-br": "Portuguese (Brazil)",
    "zh-cn": "Chinese (China)",
    "zh-tw": "Chinese (Taiwan)",
    "yue": "Cantonese",
}

def fasttext_to_name_pycountry(code: str) -> str:
    normalized = code.replace("_", "-").lower()
    if normalized in FASTTEXT_OVERRIDES:
        return FASTTEXT_OVERRIDES[normalized]
    try:
        return pycountry.languages.lookup(normalized).name
    except LookupError:
        base = normalized.split("-")[0]
        try:
            return pycountry.languages.lookup(base).name
        except LookupError:
            return code

from fast_langdetect import detect
result = detect("Olá mundo", model="full", k=1)
print(fasttext_to_name_pycountry(result[0]["lang"]))

Load Custom Models

from importlib import resources
from fast_langdetect import LangDetectConfig, LangDetector

with resources.path("fast_langdetect.resources", "lid.176.ftz") as model_path:
    config = LangDetectConfig(custom_model_path=str(model_path))
    detector = LangDetector(config)
    print(detector.detect("Hello world", k=1))

When using a custom model via custom_model_path, the model parameter in detect() calls is ignored since your custom model file is always loaded directly. The model="lite", model="full", and model="auto" parameters only apply when using the built-in models.

Benchmark 📊

For detailed benchmark results, refer to zafercavdar/fasttext-langdetect#benchmark.

References 📚

[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

@article{joulin2016bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.01759},
  year={2016}
}

[2] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification models

@article{joulin2016fasttext,
  title={FastText.zip: Compressing text classification models},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1612.03651},
  year={2016}
}

License 📄

  • Code: Released under the MIT License (see LICENSE).
  • Models: This package uses the pre-trained fastText language identification models (lid.176.ftz bundled for offline use and lid.176.bin downloaded as needed). These models are licensed under the Creative Commons Attribution-ShareAlike 3.0 (CC BY-SA 3.0) license.
  • Attribution: fastText language identification models by Facebook AI Research. See the fastText docs and license for details:
  • Note: If you redistribute or modify the model files, you must comply with CC BY-SA 3.0. Inference usage via this library does not change the license of the model files themselves.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_langdetect-1.0.1.tar.gz (796.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fast_langdetect-1.0.1-py3-none-any.whl (790.0 kB view details)

Uploaded Python 3

File details

Details for the file fast_langdetect-1.0.1.tar.gz.

File metadata

  • Download URL: fast_langdetect-1.0.1.tar.gz
  • Upload date:
  • Size: 796.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: pdm/2.26.8 CPython/3.12.3 Linux/6.17.0-1010-azure

File hashes

Hashes for fast_langdetect-1.0.1.tar.gz
Algorithm Hash digest
SHA256 22c651bf576aff3cb90edf070071f256996eabe612111ea9ced5b935e4376413
MD5 26d72ecd7c042e6a1eac92645d6f0b5e
BLAKE2b-256 2483639b1afc54b4cc73bf56c7f8efe50218b967364a9503947612b2b49212fa

See more details on using hashes here.

File details

Details for the file fast_langdetect-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: fast_langdetect-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 790.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: pdm/2.26.8 CPython/3.12.3 Linux/6.17.0-1010-azure

File hashes

Hashes for fast_langdetect-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d965844dfe44bb5e6042779dbc592618f227d447b752c4e2e503b0fd6abe5a4f
MD5 00ca89a5ba756724d6ff64b45ff534e8
BLAKE2b-256 461ce4171f5235c2052ffcba1e19f8243cc6643ff678615d6cc803924343ed6f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page