Skip to main content

Quickly detect text language and segment language

Project description

fast-langdetect 🚀

PyPI version Downloads Downloads

Overview

fast-langdetect provides ultra-fast and highly accurate language detection based on FastText, a library developed by Facebook. This package is 80x faster than traditional methods and offers 95% accuracy.

It supports Python versions 3.9 to 3.12.

Support offline usage.

This project builds upon zafercavdar/fasttext-langdetect with enhancements in packaging.

For more information on the underlying FastText model, refer to the official documentation: FastText Language Identification.

[!NOTE] This library requires over 200MB of memory to use in low memory mode.

Installation 💻

To install fast-langdetect, you can use either pip or pdm:

Using pip

pip install fast-langdetect

Using pdm

pdm add fast-langdetect

Usage 🖥️

For optimal performance and accuracy in language detection, use detect(text, low_memory=False) to load the larger model.

The model will be downloaded to the /tmp/fasttext-langdetect directory upon first use.

Native API (Recommended)

[!NOTE] This function assumes to be given a single line of text. You should remove \n characters before passing the text. If the sample is too long or too short, the accuracy will decrease (for example, in the case of too short, Chinese will be predicted as Japanese).

from fast_langdetect import detect, detect_multilingual

# Single language detection
print(detect("Hello, world!"))
# Output: {'lang': 'en', 'score': 0.12450417876243591}

# `use_strict_mode` determines whether the model loading process should enforce strict conditions before using fallback options.
# If `use_strict_mode` is set to True, we will load only the selected model, not the fallback model.
print(detect("Hello, world!", low_memory=False, use_strict_mode=True))

# How to deal with multiline text
multiline_text = """
Hello, world!
This is a multiline text.
But we need remove `\n` characters or it will raise an ValueError.
"""
multiline_text = multiline_text.replace("\n", "")  # NOTE:ITS IMPORTANT TO REMOVE \n CHARACTERS
print(detect(multiline_text))
# Output: {'lang': 'en', 'score': 0.8509423136711121}

print(detect("Привет, мир!")["lang"])
# Output: ru

# Multi-language detection
print(detect_multilingual("Hello, world!你好世界!Привет, мир!"))
# Output: [{'lang': 'ja', 'score': 0.32009604573249817}, {'lang': 'uk', 'score': 0.27781224250793457}, {'lang': 'zh', 'score': 0.17542070150375366}, {'lang': 'sr', 'score': 0.08751443773508072}, {'lang': 'bg', 'score': 0.05222449079155922}]

# Multi-language detection with low memory mode disabled
print(detect_multilingual("Hello, world!你好世界!Привет, мир!", low_memory=False))
# Output: [{'lang': 'ru', 'score': 0.39008623361587524}, {'lang': 'zh', 'score': 0.18235979974269867}, {'lang': 'ja', 'score': 0.08473210036754608}, {'lang': 'sr', 'score': 0.057975586503744125}, {'lang': 'en', 'score': 0.05422825738787651}]

Convenient detect_language Function

from fast_langdetect import detect_language

# Single language detection
print(detect_language("Hello, world!"))
# Output: EN

print(detect_language("Привет, мир!"))
# Output: RU

print(detect_language("你好,世界!"))
# Output: ZH

Splitting Text by Language 🌐

For text splitting based on language, please refer to the split-lang repository.

Benchmark 📊

For detailed benchmark results, refer to zafercavdar/fasttext-langdetect#benchmark.

References 📚

[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

@article{joulin2016bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.01759},
  year={2016}
}

[2] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification models

@article{joulin2016fasttext,
  title={FastText.zip: Compressing text classification models},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1612.03651},
  year={2016}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_langdetect-0.2.2.tar.gz (788.1 kB view details)

Uploaded Source

Built Distribution

fast_langdetect-0.2.2-py3-none-any.whl (786.3 kB view details)

Uploaded Python 3

File details

Details for the file fast_langdetect-0.2.2.tar.gz.

File metadata

  • Download URL: fast_langdetect-0.2.2.tar.gz
  • Upload date:
  • Size: 788.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: pdm/2.19.1 CPython/3.10.12 Linux/6.8.0-1014-azure

File hashes

Hashes for fast_langdetect-0.2.2.tar.gz
Algorithm Hash digest
SHA256 7efcf12321782dda2aaca69a7a32bbff8fedb4ab144a3352037d74e44971de7d
MD5 308f3e333956ede2a33ef2bd3ebe3e3a
BLAKE2b-256 f23f217100346b803e68ebb5974d6162d771e765615471be6de4b9ec769593e7

See more details on using hashes here.

File details

Details for the file fast_langdetect-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: fast_langdetect-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 786.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: pdm/2.19.1 CPython/3.10.12 Linux/6.8.0-1014-azure

File hashes

Hashes for fast_langdetect-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7339f845832d25f421ce6405afce97d1f7cd168ea62c8cfeb9c63bba5d3f1db6
MD5 05c1fb7359784637b38dc22f71c49d04
BLAKE2b-256 07f204ce971c41bead1027d0a12b2398f4c2d3b3c73fa9e1f674da2c6cb0e3e8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page