Language Identification library.
Project description
GeezSwitch
Language Identification (LI) library for 60 languages, adapted from Michal Danilak's great package langdetect, adding support for low-resource languages that use the Ge'ez script as a writing system based on the GeezSwitch dataset.
The GeezSwitch dataset was published in the paper "GeezSwitch: Language Identification in Typologically Related Low-resourced East African Languages" at LREC 2022 and the data can be found here.
Installation
$ pip install geezswitch
Supported Python versions 2.7, 3.4+.
Languages
The library supports identification across 60 languages in total.
Support for five languages that use the Ge'ez script based on the GeezSwitch dataset. Using ISO 639-3 codes since some of these languages were not included in ISO 639-1 codes.
amh (Amharic), byn (Blin), gez (Ge'ez), tig (Tigre), tir (Tigrinya)
Support for 55 languages inherited from the original langdetect
package. Keeping ISO 639-1 codes for backward compatibility:
af, ar, bg, bn, ca, cs, cy, da, de, el, en, es, et, fa, fi, fr, gu, he,
hi, hr, hu, id, it, ja, kn, ko, lt, lv, mk, ml, mr, ne, nl, no, pa, pl,
pt, ro, ru, sk, sl, so, sq, sv, sw, ta, te, th, tl, tr, uk, ur, vi, zh-cn,
zh-tw
Basic usage
To detect the language of the text:
>>> from geezswitch import detect
>>> detect("ብኮምፒዩተር ናይ ምስራሕ ክእለት")
'tir'
>>> detect("ኳዅረስ ይድ ባሪ ፣ ይት እሺ ይት ገውሪ")
'byn'
>>> detect("ወዲብለ ታክያተ ክልኦት አሕድ")
'tig'
>>> detect("ነጭ አበባ ያለው ተክል")
'amh'
>>> detect("ወይቤሎ ዮናታን ሐሰ ለከ ወእምከመሰ")
'gez'
To find out the probabilities for the top languages:
>>> from geezswitch import detect_langs
>>> detect_langs("Otec matka syn.")
[sk:0.572770823327, pl:0.292872522702, cs:0.134356653968]
NOTE
The language detection algorithm is non-deterministic, which means that if you run it on a text which is either too short or too ambiguous, you might get different results everytime you run it.
To enforce consistent results, call following code before the first language detection:
from geezswitch import DetectorFactory
DetectorFactory.seed = 0
How to add new language?
New language contributions are very welcome, particularly, for languagees written in the Ge'ez script. You can either use the steps below or just contribute example text for the target language, and we can help with the integration.
Language identification works best when the model is trained on examples of many languages.
To add a new language, you need to create a new language profile. The easiest way to do it is to use the langdetect.jar tool, which can generate language profiles from Wikipedia abstract database files or plain text.
Wikipedia abstract database files can be retrieved from "Wikipedia Downloads" (http://download.wikimedia.org/). They form '(language code)wiki-(version)-abstract.xml' (e.g. 'enwiki-20101004-abstract.xml' ).
usage: java -jar langdetect.jar --genprofile -d [directory path] [language codes]
- Specify the directory which has abstract databases by -d option.
- This tool can handle gzip compressed file.
Remark: The database filename in Chinese is like 'zhwiki-(version)-abstract-zh-cn.xml' or zhwiki-(version)-abstract-zh-tw.xml', so that it must be modified 'zh-cnwiki-(version)-abstract.xml' or 'zh-twwiki-(version)-abstract.xml'.
To generate language profile from a plain text, use the genprofile-text command.
usage: java -jar langdetect.jar --genprofile-text -l [language code] [text file path]
For more details see language-detection Wiki.
Original project
This library is adapted from langdetect, which in return is a direct port of Google's language-detection library from Java to Python. For more information, please refer to those repos.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file geezswitch-1.0.1.tar.gz
.
File metadata
- Download URL: geezswitch-1.0.1.tar.gz
- Upload date:
- Size: 1.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.8.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 63b085218010a8c04a3b6aa8f49d16d6ebfc8653781fbdff43a08e4dbe2a9e8e |
|
MD5 | 4423f6db27348697542a315ba884fc0f |
|
BLAKE2b-256 | 6316d0bcc5311421ff3f02cb7e7a7393c2765c10396e7b073185f6bd2e165346 |
File details
Details for the file geezswitch-1.0.1-py3-none-any.whl
.
File metadata
- Download URL: geezswitch-1.0.1-py3-none-any.whl
- Upload date:
- Size: 1.2 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.8.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4e7279518cf7c494753dc6f89091a8a4b44df7321eb87cc1d4675ff5cc8c3911 |
|
MD5 | 193c5b00606525020459095947fdc7ae |
|
BLAKE2b-256 | 74f50236c59db9e6b40a4903c9e42ed3a85d121d1b8a73859af2059eff29503b |