Skip to main content

An NLP library for Uzbek. It includes morphological analysis, language identification, transliterators and tokenizers.

Project description

fitrat

Abdurauf Fitrat

An NLP library for Uzbek. It includes morphological analysis, transliterators, language identifiers, tokenizers and many more.

It is named after historian and linguist Abdurauf Fitrat, who was one of the creators of Modern Uzbek as well as the first Uzbek professor.


Usage

Installation

pip install fitrat

Transliteration

We used hfst library for creating transliterators. This library provides finite-state transducers, a finite-state machines that come very handy for efficient mapping one text to another.

from fitrat import Transliterator, WritingType

t = Transliterator(to=WritingType.LAT)
result = t.convert("Кеча циркка бордим.")
print(result)
# Kecha sirkka bordim.

t2 = Transliterator(to=WritingType.CYR)
result = t2.convert("Kecha sirkka bordim.")
print(result)
# Кеча циркка бордим.

While Cyrillic-Latin conversion is rule-based and simple, the converse is not true. We included special pre-compiled exceptions transducer for Latin-Cyrillic that handles all (to our knowledge) exceptions. We'll continue working on improving on our exceptions list.

If you want to compile the transliterators from source, you have to use hfst-dev or hfst library. The package uses only pre-compiled binaries and hfstol library for efficient lookup.

Language Identification

We can recognize Uzbek text, both Latin or Cyrillic. Additionally, we can recognize other major languages, such as Russian, English, Arabic and etc.

from fitrat import LanguageDetector

lang_detector = LanguageDetector()

print(lang_detector.is_uzbek("bu o'zbekchada yozilgan matn"))
# True

print(lang_detector.is_uzbek("бу нотугри йозилган булсаям, лекин узбекча матн"))
# True

print(lang_detector.is_uzbek("Текст на русском языке"))
# False

Tokenization

from fitrat import word_tokenize

s = "Bugun o'zbekchada gapirishga qaror qildim!"
print(word_tokenize(s))
# ['Bugun', "o'zbekchada", 'gapirishga', 'qaror', 'qildim', '!']

Authors

  • Mukhammadsaid Mamasaidov
  • Jasur Yusupov

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fitrat-0.0.9.tar.gz (16.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fitrat-0.0.9-py3-none-any.whl (15.8 MB view details)

Uploaded Python 3

File details

Details for the file fitrat-0.0.9.tar.gz.

File metadata

  • Download URL: fitrat-0.0.9.tar.gz
  • Upload date:
  • Size: 16.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.5

File hashes

Hashes for fitrat-0.0.9.tar.gz
Algorithm Hash digest
SHA256 cce818d4247bb29bd9c279833c66c4e736458808fcc551e4775ea1cbdbedd45e
MD5 acdb0e6eb5e9055d57b4769fbde3d09f
BLAKE2b-256 15a8bd57eb9098fbef77fa377e421bb314041f1b8ea2e5a6733d2b0f24562441

See more details on using hashes here.

File details

Details for the file fitrat-0.0.9-py3-none-any.whl.

File metadata

  • Download URL: fitrat-0.0.9-py3-none-any.whl
  • Upload date:
  • Size: 15.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.5

File hashes

Hashes for fitrat-0.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 fdfbaaf627420bb1a66d0061ebcf8589d7b378ca5e32e3e146bdb4d1bf5511a8
MD5 6702612730cb858c160172bb33da91aa
BLAKE2b-256 a8f422eb8e9e7bcc846655754ce20965ee6d93af85e693c8814d7747a5e2d55f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page