An NLP library for Uzbek. It includes morphological analysis, language identification, transliterators and tokenizers.
Project description
fitrat
An NLP library for Uzbek. It includes morphological analysis, transliterators, language identifiers, tokenizers and many more.
It is named after historian and linguist Abdurauf Fitrat, who was one of the creators of Modern Uzbek as well as the first Uzbek professor.
Usage
Installation
pip install fitrat
Transliteration
We used hfst library for creating transliterators. This library provides finite-state transducers, a finite-state machines that come very handy for efficient mapping one text to another.
from fitrat import Transliterator, WritingType
t = Transliterator(to=WritingType.LAT)
result = t.convert("Кеча циркка бордим.")
print(result)
# Kecha sirkka bordim.
t2 = Transliterator(to=WritingType.CYR)
result = t2.convert("Kecha sirkka bordim.")
print(result)
# Кеча циркка бордим.
While Cyrillic-Latin conversion is rule-based and simple, the converse is not true. We included special pre-compiled exceptions transducer for Latin-Cyrillic that handles all (to our knowledge) exceptions. We'll continue working on improving on our exceptions list.
If you want to compile the transliterators from source, you have to use hfst-dev or hfst library. The package uses only pre-compiled binaries and hfstol library for efficient lookup.
Language Identification
We can recognize Uzbek text, both Latin or Cyrillic. Additionally, we can recognize other major languages, such as Russian, English, Arabic and etc.
from fitrat import LanguageDetector
lang_detector = LanguageDetector()
print(lang_detector.is_uzbek("bu o'zbekchada yozilgan matn"))
# True
print(lang_detector.is_uzbek("бу нотугри йозилган булсаям, лекин узбекча матн"))
# True
print(lang_detector.is_uzbek("Текст на русском языке"))
# False
Tokenization
from fitrat import word_tokenize
s = "Bugun o'zbekchada gapirishga qaror qildim!"
print(word_tokenize(s))
# ['Bugun', "o'zbekchada", 'gapirishga', 'qaror', 'qildim', '!']
Authors
- Mukhammadsaid Mamasaidov
- Jasur Yusupov
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fitrat-0.0.9.tar.gz.
File metadata
- Download URL: fitrat-0.0.9.tar.gz
- Upload date:
- Size: 16.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cce818d4247bb29bd9c279833c66c4e736458808fcc551e4775ea1cbdbedd45e
|
|
| MD5 |
acdb0e6eb5e9055d57b4769fbde3d09f
|
|
| BLAKE2b-256 |
15a8bd57eb9098fbef77fa377e421bb314041f1b8ea2e5a6733d2b0f24562441
|
File details
Details for the file fitrat-0.0.9-py3-none-any.whl.
File metadata
- Download URL: fitrat-0.0.9-py3-none-any.whl
- Upload date:
- Size: 15.8 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fdfbaaf627420bb1a66d0061ebcf8589d7b378ca5e32e3e146bdb4d1bf5511a8
|
|
| MD5 |
6702612730cb858c160172bb33da91aa
|
|
| BLAKE2b-256 |
a8f422eb8e9e7bcc846655754ce20965ee6d93af85e693c8814d7747a5e2d55f
|