Skip to main content

Language Identification library.

Project description

GeezSwitch

GitHub issues PyPI CircleCI

Language Identification (LI) library for 60 languages, adapted from Michal Danilak's great package langdetect, adding support for low-resource languages that use the Ge'ez script as a writing system based on the GeezSwitch dataset.

The GeezSwitch dataset was published in the paper "GeezSwitch: Language Identification in Typologically Related Low-resourced East African Languages" at LREC 2022 and the data can be found here.

Installation

$ pip install geezswitch

Supported Python versions 2.7, 3.4+.

Languages

The library supports identification across 60 languages in total.

Support for five languages that use the Ge'ez script based on the GeezSwitch dataset. Using ISO 639-3 codes since some of these languages were not included in ISO 639-1 codes.

amh (Amharic), byn (Blin), gez (Ge'ez), tig (Tigre), tir (Tigrinya)

Support for 55 languages inherited from the original langdetect package. Keeping ISO 639-1 codes for backward compatibility:

af, ar, bg, bn, ca, cs, cy, da, de, el, en, es, et, fa, fi, fr, gu, he,
hi, hr, hu, id, it, ja, kn, ko, lt, lv, mk, ml, mr, ne, nl, no, pa, pl,
pt, ro, ru, sk, sl, so, sq, sv, sw, ta, te, th, tl, tr, uk, ur, vi, zh-cn,
zh-tw

Basic usage

To detect the language of the text:

>>> from geezswitch import detect
>>> detect("ብኮምፒዩተር ናይ ምስራሕ ክእለት")
'tir'
>>> detect("ኳዅረስ ይድ ባሪ ፣ ይት እሺ ይት ገውሪ")
'byn'
>>> detect("ወዲብለ ታክያተ ክልኦት አሕድ")
'tig'
>>> detect("ነጭ አበባ ያለው ተክል")
'amh'
>>> detect("ወይቤሎ ዮናታን ሐሰ ለከ ወእምከመሰ")
'gez'

To find out the probabilities for the top languages:

>>> from geezswitch import detect_langs
>>> detect_langs("Otec matka syn.")
[sk:0.572770823327, pl:0.292872522702, cs:0.134356653968]

NOTE

The language detection algorithm is non-deterministic, which means that if you run it on a text which is either too short or too ambiguous, you might get different results everytime you run it.

To enforce consistent results, call following code before the first language detection:

from geezswitch import DetectorFactory
DetectorFactory.seed = 0

How to add new language?

New language contributions are very welcome, particularly, for languagees written in the Ge'ez script. You can either use the steps below or just contribute example text for the target language, and we can help with the integration.

Language identification works best when the model is trained on examples of many languages.

To add a new language, you need to create a new language profile. The easiest way to do it is to use the langdetect.jar tool, which can generate language profiles from Wikipedia abstract database files or plain text.

Wikipedia abstract database files can be retrieved from "Wikipedia Downloads" (http://download.wikimedia.org/). They form '(language code)wiki-(version)-abstract.xml' (e.g. 'enwiki-20101004-abstract.xml' ).

usage: java -jar langdetect.jar --genprofile -d [directory path] [language codes]

  • Specify the directory which has abstract databases by -d option.
  • This tool can handle gzip compressed file.

Remark: The database filename in Chinese is like 'zhwiki-(version)-abstract-zh-cn.xml' or zhwiki-(version)-abstract-zh-tw.xml', so that it must be modified 'zh-cnwiki-(version)-abstract.xml' or 'zh-twwiki-(version)-abstract.xml'.

To generate language profile from a plain text, use the genprofile-text command.

usage: java -jar langdetect.jar --genprofile-text -l [language code] [text file path]

For more details see language-detection Wiki.

Original project

This library is adapted from langdetect, which in return is a direct port of Google's language-detection library from Java to Python. For more information, please refer to those repos.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geezswitch-1.0.1.tar.gz (1.2 MB view hashes)

Uploaded Source

Built Distribution

geezswitch-1.0.1-py3-none-any.whl (1.2 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page