Skip to main content

Fast and accurate natural language detection. Detector written in Python. Nito-ELD, ELD.

Project description

Efficient Language Detector

supported Python versions license supported languages

Efficient language detector (Nito-ELD or ELD) is a fast and accurate language detector, is one of the fastest non compiled detectors, while its accuracy is within the range of the heaviest and slowest detectors.

It's 100% Python, easy installation and no dependencies other than regex.
ELD is also available in Javascript and PHP.

This is the first version of a port made from the original version in PHP, the structure might not be definitive, the code can be optimized. My knowledge of Python is basic, feel free to suggest improvements.

  1. Installation
  2. How to use
  3. Benchmarks
  4. Languages

Installation

$ pip install eld

Alternatively, download / clone the files can work too, by changing the import path.

How to use?

from eld import LanguageDetector
detector = LanguageDetector()

detect() expects a UTF-8 string, and returns an object, with a 'language' variable, which is either an ISO 639-1 code or None

print(detector.detect('Hola, cómo te llamas?'))
# Object { language: "es", scores(): {"es": 0.53, "et": 0.21, ...}, is_reliable(): True }
# Object { language: None|str, scores(): None|dict, is_reliable(): bool }

print(detector.detect('Hola, cómo te llamas?').language)
# "es"

# if clean_text(True), detect() removes Urls, domains, emails, alphanumerical & numbers
detector.clean_text(True)  # Default is False
  • To reduce the languages to be detected, there are 3 different options, they only need to be executed once. (Check available languages below)
lang_subset = ['en', 'es', 'fr', 'it', 'nl', 'de']

# Option 1
# with dynamic_lang_subset(), detect() executes normally, and then filters excluded languages
detector.dynamic_lang_subset(lang_subset)
# Returns an object with a list named 'languages', with the validated languages or 'None'

# Option 2. lang_subset() Will first remove the excluded languages, from the n-grams database
# For a single detection is slower than dynamic_lang_subset(), but for several will be faster
# If save option is true (default), the new Ngrams subset will be stored, and loaded next call
detector.lang_subset(lang_subset) # lang_subset(langs, save=True) 
# Returns object {success: True, languages: ['de', 'en', ...], error: None, file: 'ngramsM60...'}

# To remove either dynamic_lang_subset() or lang_subset(), call the methods with None as argument
detector.lang_subset(None)

# Finally the optimal way to regularly use a language subset: we create the instance with a file
# The file in the argument can be a subset by lang_subset() or another database like 'ngramsL60'
langSubsetDetect = LanguageDetector('ngramsL60')

Benchmarks

I compared ELD with a different variety of detectors, since the interesting part is the algorithm.

URL Version Language
https://github.com/nitotm/efficient-language-detector-py/ 0.9.0 Python
https://github.com/nitotm/efficient-language-detector/ 1.0.0 PHP
https://github.com/pemistahl/lingua-py 1.3.2 Python
https://github.com/CLD2Owners/cld2 Aug 21, 2015 C++
https://github.com/google/cld3 Aug 28, 2020 C++
https://github.com/wooorm/franc 6.1.0 Javascript

Benchmarks: Tweets: 760KB, short sentences of 140 chars max.; Big test: 10MB, sentences in all 60 languages supported; Sentences: 8MB, this is the Lingua sentences test, minus unsupported languages.
Short sentences is what ELD and most detectors focus on, as very short text is unreliable, but I included the Lingua Word pairs 1.5MB, and Single words 880KB tests to see how they all compare beyond their reliable limits.

These are the results, first, accuracy and then execution time.

accuracy table time table

1. Lingua could have a small advantage as it participates with 54 languages, 6 less.
2. CLD2 and CLD3, return a list of languages, the ones not included in this test where discarded, but usually they return one language, I believe they have a disadvantage. Also, I confirm the results of CLD2 for short text are correct, contrary to the test on the Lingua page, they did not use the parameter "bestEffort = True", their benchmark for CLD2 is unfair.

Lingua is the average accuracy winner, but at what cost, the same test that in ELD or CLD2 is below 10 seconds, in Lingua takes more than 5 hours! It acts like a brute-force software. Also, its lead comes from single and pair words, which are unreliable regardless.

The Python version of NITO-ELD is not the fastest but is still considered fast, as it is faster than any other non compiled detector tested.

I added ELD-L for comparison, which has a 2.3x bigger database, but only increases execution time marginally, a testament to the efficiency of the algorithm. ELD-L is not the main database as it does not improve language detection in sentences.

Here is the average, per benchmark, of Tweets, Big test & Sentences.

Sentences tests average

Languages

These are the ISO 639-1 codes of the 60 supported languages for Nito-ELD v1

'am', 'ar', 'az', 'be', 'bg', 'bn', 'ca', 'cs', 'da', 'de', 'el', 'en', 'es', 'et', 'eu', 'fa', 'fi', 'fr', 'gu', 'he', 'hi', 'hr', 'hu', 'hy', 'is', 'it', 'ja', 'ka', 'kn', 'ko', 'ku', 'lo', 'lt', 'lv', 'ml', 'mr', 'ms', 'nl', 'no', 'or', 'pa', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sq', 'sr', 'sv', 'ta', 'te', 'th', 'tl', 'tr', 'uk', 'ur', 'vi', 'yo', 'zh'

Full name languages:

Amharic, Arabic, Azerbaijani (Latin), Belarusian, Bulgarian, Bengali, Catalan, Czech, Danish, German, Greek, English, Spanish, Estonian, Basque, Persian, Finnish, French, Gujarati, Hebrew, Hindi, Croatian, Hungarian, Armenian, Icelandic, Italian, Japanese, Georgian, Kannada, Korean, Kurdish (Arabic), Lao, Lithuanian, Latvian, Malayalam, Marathi, Malay (Latin), Dutch, Norwegian, Oriya, Punjabi, Polish, Portuguese, Romanian, Russian, Slovak, Slovene, Albanian, Serbian (Cyrillic), Swedish, Tamil, Telugu, Thai, Tagalog, Turkish, Ukrainian, Urdu, Vietnamese, Yoruba, Chinese

Future improvements

  • Train from bigger datasets, and more languages.
  • The tokenizer could separate characters from languages that have their own alphabet, potentially improving accuracy and reducing the N-grams database. Retraining and testing is needed.

Donate / Hire
If you wish to Donate for open source improvements, Hire me for private modifications / upgrades, or to Contact me, use the following link: https://linktr.ee/nitotm

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eld-1.0.6.tar.gz (5.3 MB view details)

Uploaded Source

Built Distribution

eld-1.0.6-py3-none-any.whl (5.4 MB view details)

Uploaded Python 3

File details

Details for the file eld-1.0.6.tar.gz.

File metadata

  • Download URL: eld-1.0.6.tar.gz
  • Upload date:
  • Size: 5.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.4

File hashes

Hashes for eld-1.0.6.tar.gz
Algorithm Hash digest
SHA256 68f750069cabab1294b54020bd2c7e2ce72b42779261b0446168f1f7171d97a7
MD5 2746814c9ca574f5a90baebba7272334
BLAKE2b-256 a554eb43ee088126bde0f849446190ac7db07280b5eb52b60d57481c992ccf42

See more details on using hashes here.

File details

Details for the file eld-1.0.6-py3-none-any.whl.

File metadata

  • Download URL: eld-1.0.6-py3-none-any.whl
  • Upload date:
  • Size: 5.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.4

File hashes

Hashes for eld-1.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 175f570537e8cdf65d48b6e3d2c14438e236d9e5ac5b5b5005d07cd8dab797b2
MD5 61e129e0a6730f7a330a3fec189fd514
BLAKE2b-256 4809425b7c6dd560a55d7d168e040694d5ec2660edd8c48860c1d1f5f9edd525

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page