Skip to main content

Language Identification library.

Project description

GeezSwitch

GitHub issues PyPI CircleCI

Language Identification (LI) library for 60 languages, adapted from Michal Danilak's great package langdetect, adding support for low-resource languages that use the Ge'ez script as a writing system based on the GeezSwitch dataset.

The GeezSwitch dataset was published in the paper "GeezSwitch: Language Identification in Typologically Related Low-resourced East African Languages" at LREC 2022 and the data can be found here.

Installation

$ pip install geezswitch

Supported Python versions 2.7, 3.4+.

Languages

The library supports identification across 60 languages in total.

Support for five languages that use the Ge'ez script based on the GeezSwitch dataset. Using ISO 639-3 codes since some of these languages were not included in ISO 639-1 codes.

amh (Amharic), byn (Blin), gez (Ge'ez), tig (Tigre), tir (Tigrinya)

Support for 55 languages inherited from the original langdetect package. Keeping ISO 639-1 codes for backward compatibility:

af, ar, bg, bn, ca, cs, cy, da, de, el, en, es, et, fa, fi, fr, gu, he,
hi, hr, hu, id, it, ja, kn, ko, lt, lv, mk, ml, mr, ne, nl, no, pa, pl,
pt, ro, ru, sk, sl, so, sq, sv, sw, ta, te, th, tl, tr, uk, ur, vi, zh-cn,
zh-tw

Basic usage

To detect the language of the text:

>>> from geezswitch import detect
>>> detect("ብኮምፒዩተር ናይ ምስራሕ ክእለት")
'tir'
>>> detect("ኳዅረስ ይድ ባሪ ፣ ይት እሺ ይት ገውሪ")
'byn'
>>> detect("ወዲብለ ታክያተ ክልኦት አሕድ")
'tig'
>>> detect("ነጭ አበባ ያለው ተክል")
'amh'
>>> detect("ወይቤሎ ዮናታን ሐሰ ለከ ወእምከመሰ")
'gez'

To find out the probabilities for the top languages:

>>> from geezswitch import detect_langs
>>> detect_langs("Otec matka syn.")
[sk:0.572770823327, pl:0.292872522702, cs:0.134356653968]

NOTE

The language detection algorithm is non-deterministic, which means that if you run it on a text which is either too short or too ambiguous, you might get different results everytime you run it.

To enforce consistent results, call following code before the first language detection:

from geezswitch import DetectorFactory
DetectorFactory.seed = 0

How to add new language?

New language contributions are very welcome, particularly, for languagees written in the Ge'ez script. You can either use the steps below or just contribute example text for the target language, and we can help with the integration.

Language identification works best when the model is trained on examples of many languages.

To add a new language, you need to create a new language profile. The easiest way to do it is to use the langdetect.jar tool, which can generate language profiles from Wikipedia abstract database files or plain text.

Wikipedia abstract database files can be retrieved from "Wikipedia Downloads" (http://download.wikimedia.org/). They form '(language code)wiki-(version)-abstract.xml' (e.g. 'enwiki-20101004-abstract.xml' ).

usage: java -jar langdetect.jar --genprofile -d [directory path] [language codes]

  • Specify the directory which has abstract databases by -d option.
  • This tool can handle gzip compressed file.

Remark: The database filename in Chinese is like 'zhwiki-(version)-abstract-zh-cn.xml' or zhwiki-(version)-abstract-zh-tw.xml', so that it must be modified 'zh-cnwiki-(version)-abstract.xml' or 'zh-twwiki-(version)-abstract.xml'.

To generate language profile from a plain text, use the genprofile-text command.

usage: java -jar langdetect.jar --genprofile-text -l [language code] [text file path]

For more details see language-detection Wiki.

Original project

This library is adapted from langdetect, which in return is a direct port of Google's language-detection library from Java to Python. For more information, please refer to those repos.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geezswitch-1.0.1.tar.gz (1.2 MB view details)

Uploaded Source

Built Distribution

geezswitch-1.0.1-py3-none-any.whl (1.2 MB view details)

Uploaded Python 3

File details

Details for the file geezswitch-1.0.1.tar.gz.

File metadata

  • Download URL: geezswitch-1.0.1.tar.gz
  • Upload date:
  • Size: 1.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.8.12

File hashes

Hashes for geezswitch-1.0.1.tar.gz
Algorithm Hash digest
SHA256 63b085218010a8c04a3b6aa8f49d16d6ebfc8653781fbdff43a08e4dbe2a9e8e
MD5 4423f6db27348697542a315ba884fc0f
BLAKE2b-256 6316d0bcc5311421ff3f02cb7e7a7393c2765c10396e7b073185f6bd2e165346

See more details on using hashes here.

File details

Details for the file geezswitch-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: geezswitch-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.8.12

File hashes

Hashes for geezswitch-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4e7279518cf7c494753dc6f89091a8a4b44df7321eb87cc1d4675ff5cc8c3911
MD5 193c5b00606525020459095947fdc7ae
BLAKE2b-256 74f50236c59db9e6b40a4903c9e42ed3a85d121d1b8a73859af2059eff29503b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page