Skip to main content

Quickly detect text language and segment language

Project description

fast-langdetect 🚀

PyPI version Downloads Downloads

Overview

fast-langdetect is an ultra-fast and highly accurate language detection library based on FastText, a library developed by Facebook. Its incredible speed and accuracy make it 80x faster than conventional methods and deliver up to 95% accuracy.

  • Supported Python 3.9 to 3.13.
  • Works offline with the lite model
  • No numpy required (thanks to @dalf).

Background

This project builds upon zafercavdar/fasttext-langdetect with enhancements in packaging. For more information about the underlying model, see the official FastText documentation: Language Identification.

Memory note

The lite model runs offline and is memory-friendly; the full model is larger and offers higher accuracy.

Approximate memory usage (RSS after load):

  • Lite: ~45–60 MB
  • Full: ~170–210 MB
  • Auto: tries full first, falls back to lite only on MemoryError.

Notes:

  • Measurements vary by Python version, OS, allocator, and import graph; treat these as practical ranges.
  • Validate on your system if constrained; see examples/memory_usage_check.py (credit: script by github@JackyHe398`).
  • Run memory checks in a clean terminal session. IDEs/REPLs may preload frameworks and inflate peak RSS (ru_maxrss), leading to very large peaks with near-zero deltas.

Choose the model that best fits your constraints.

Installation 💻

To install fast-langdetect, you can use either pip or pdm:

Using pip

pip install fast-langdetect

Using pdm

pdm add fast-langdetect

Usage 🖥️

For higher accuracy, prefer the full model via detect(text, model='full'). For robust behavior under memory pressure, use detect(text, model='auto') which falls back to the lite model only on MemoryError.

Prerequisites

  • If the sample is too long or too short, the accuracy will be reduced.
  • The model will be downloaded to system temporary directory by default. You can customize it by:
    • Setting FTLANG_CACHE environment variable
    • Using LangDetectConfig(cache_dir="your/path")

Simple Usage (Recommended)

Call by model explicitly — clear and predictable, and use k to get multiple candidates. The function always returns a list of results:

from fast_langdetect import detect

# Lite model (offline, smaller, faster) — never falls back
print(detect("Hello", model='lite', k=1))          # -> [{'lang': 'en', 'score': ...}]

# Full model (downloaded to cache, higher accuracy) — never falls back
print(detect("Hello", model='full', k=1))          # -> [{'lang': 'en', 'score': ...}]

# Auto mode: try full, fallback to lite only on MemoryError
print(detect("Hello", model='auto', k=1))          # -> [{'lang': 'en', 'score': ...}]

# Multilingual: top 3 candidates (always a list)
print(detect("Hello 世界 こんにちは", model='auto', k=3))

If you need a custom cache directory, pass LangDetectConfig:

from fast_langdetect import LangDetectConfig, detect

cfg = LangDetectConfig(cache_dir="/custom/cache/path")
print(detect("Hello", model='full', config=cfg))

# Set a default model via config and let calls omit model
cfg_lite = LangDetectConfig(model="lite")
print(detect("Hello", config=cfg_lite))          # uses lite by default
print(detect("Bonjour", config=cfg_lite))        # uses lite by default
print(detect("Hello", model='full', config=cfg_lite))  # per-call override to full

Native API (Recommended)

from fast_langdetect import detect, LangDetector, LangDetectConfig

# Simple detection (uses config default if not provided; defaults to 'auto')
print(detect("Hello, world!", k=1))
# Output: [{'lang': 'en', 'score': 0.98}]

# Using full model for better accuracy
print(detect("Hello, world!", model='full', k=1))
# Output: [{'lang': 'en', 'score': 0.99}]

# Custom configuration
config = LangDetectConfig(cache_dir="/custom/cache/path", model="auto")  # Custom cache + default model
detector = LangDetector(config)

# Omit model to use config.model; pass model to override
result = detector.detect("Hello world", k=1)
print(result)  # [{'lang': 'en', 'score': 0.98}]

# Multiline text is handled automatically (newlines are replaced)
multiline_text = "Hello, world!\nThis is a multiline text."
print(detect(multiline_text, k=1))
# Output: [{'lang': 'en', 'score': 0.85}]

# Multi-language detection
results = detect(
    "Hello 世界 こんにちは",
    model='auto',
    k=3               # Return top 3 languages (auto model loading)
)
print(results)
# Output: [
#     {'lang': 'ja', 'score': 0.4}, 
#     {'lang': 'zh', 'score': 0.3}, 
#     {'lang': 'en', 'score': 0.2}
# ]

Fallback Policy (Keep It Simple)

  • Only MemoryError triggers fallback (in model='auto'): when loading the full model runs out of memory, it falls back to the lite model.
  • I/O/network/permission/path/integrity errors raise standard exceptions (e.g., FileNotFoundError, PermissionError) or library-specific errors where applicable — no silent fallback.
  • model='lite' and model='full' never fallback by design.

Errors

  • Base error: FastLangdetectError (library-specific failures).
  • Model loading failures: ModelLoadError.
  • Standard Python exceptions (e.g., ValueError, TypeError, FileNotFoundError, MemoryError) propagate when they are not library-specific.

Convenient detect_language Function

from fast_langdetect import detect_language

# Single language detection
print(detect_language("Hello, world!"))
# Output: EN

print(detect_language("Привет, мир!"))
# Output: RU

print(detect_language("你好,世界!"))
# Output: ZH

Load Custom Models

# Load model from local file
config = LangDetectConfig(custom_model_path="/path/to/your/model.bin")
detector = LangDetector(config)
result = detector.detect("Hello world", model='auto', k=1)

Splitting Text by Language 🌐

For text splitting based on language, please refer to the split-lang repository.

Input Handling

You can control log verbosity and input normalization via LangDetectConfig:

from fast_langdetect import LangDetectConfig, LangDetector

config = LangDetectConfig(
    max_input_length=80,    # default: auto-truncate long inputs for stable results
)
detector = LangDetector(config)
print(detector.detect("Some very long text..."))
  • Newlines are always replaced with spaces to avoid FastText errors (silent, no log).
  • When truncation happens, a WARNING is logged because it may reduce accuracy.
  • max_input_length=80 truncates overly long inputs; set None to disable if you prefer no truncation.

Cache Directory Behavior

  • Default cache: if cache_dir is not set, models are stored under a system temp-based directory specified by FTLANG_CACHE or an internal default. This directory is created automatically when needed.
  • User-provided cache_dir: if you set LangDetectConfig(cache_dir=...) to a path that does not exist, the library raises FileNotFoundError instead of silently creating or using another location. Create the directory yourself if that’s intended.

Advanced Options (Optional)

The constructor exposes a few advanced knobs (proxy, normalize_input, max_input_length). These are rarely needed for typical usage and can be ignored. Prefer detect(..., model=...) unless you know you need them.

Language Codes → English Names

The detector returns fastText language codes (e.g., en, zh, ja, pt-br). To present user-friendly names, you can map codes to English names using a third-party library. Example using langcodes:

# pip install langcodes
from langcodes import Language

OVERRIDES = {
    # fastText-specific or variant tags commonly used
    "yue": "Cantonese",
    "wuu": "Wu Chinese",
    "arz": "Egyptian Arabic",
    "ckb": "Central Kurdish",
    "kab": "Kabyle",
    "zh-cn": "Chinese (China)",
    "zh-tw": "Chinese (Taiwan)",
    "pt-br": "Portuguese (Brazil)",
}

def code_to_english_name(code: str) -> str:
    code = code.replace("_", "-").lower()
    if code in OVERRIDES:
        return OVERRIDES[code]
    try:
        # Display name in English; e.g. 'Portuguese (Brazil)'
        return Language.get(code).display_name("en")
    except Exception:
        # Try the base language (e.g., 'pt' from 'pt-br')
        base = code.split("-")[0]
        try:
            return Language.get(base).display_name("en")
        except Exception:
            return code

# Usage
from fast_langdetect import detect
result = detect("Olá mundo", model='full', k=1)
print(code_to_english_name(result[0]["lang"]))  # Portuguese (Brazil) or Portuguese

Alternatively, pycountry can be used for ISO 639 lookups (install with pip install pycountry), combined with a small override dict for non-standard tags like pt-br, zh-cn, yue, etc.

Benchmark 📊

For detailed benchmark results, refer to zafercavdar/fasttext-langdetect#benchmark.

References 📚

[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

@article{joulin2016bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.01759},
  year={2016}
}

[2] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification models

@article{joulin2016fasttext,
  title={FastText.zip: Compressing text classification models},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1612.03651},
  year={2016}
}

License 📄

  • Code: Released under the MIT License (see LICENSE).
  • Models: This package uses the pre-trained fastText language identification models (lid.176.ftz bundled for offline use and lid.176.bin downloaded as needed). These models are licensed under the Creative Commons Attribution-ShareAlike 3.0 (CC BY-SA 3.0) license.
  • Attribution: fastText language identification models by Facebook AI Research. See the fastText docs and license for details:
  • Note: If you redistribute or modify the model files, you must comply with CC BY-SA 3.0. Inference usage via this library does not change the license of the model files themselves.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_langdetect-1.0.0.tar.gz (796.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fast_langdetect-1.0.0-py3-none-any.whl (789.9 kB view details)

Uploaded Python 3

File details

Details for the file fast_langdetect-1.0.0.tar.gz.

File metadata

  • Download URL: fast_langdetect-1.0.0.tar.gz
  • Upload date:
  • Size: 796.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: pdm/2.25.9 CPython/3.12.3 Linux/6.11.0-1018-azure

File hashes

Hashes for fast_langdetect-1.0.0.tar.gz
Algorithm Hash digest
SHA256 ea8ac6a8914e0ff1bfc1bbc0f25992eb913ddb69e63ea1b24e907e263d0cd113
MD5 ca065ecf66b867478d27d6883e94d3d4
BLAKE2b-256 531585b0137066be418b6249d8e8d98e2b16c072c65b80c293b9438fdea1be5e

See more details on using hashes here.

File details

Details for the file fast_langdetect-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: fast_langdetect-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 789.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: pdm/2.25.9 CPython/3.12.3 Linux/6.11.0-1018-azure

File hashes

Hashes for fast_langdetect-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 aab9e3435cc667ac8ba8b1a38872f75492f65b7087901d0f3a02a88d436cd22a
MD5 288682dac8cec519e512e3aa3ab2253c
BLAKE2b-256 f6710db1ac89f8661048ebc22d62f503a2e147cb6872c5f2aeb659c1f02c1694

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page