Skip to main content

Klingon NLP toolkit

Project description

yajwI’ is a Klingon NLP toolkit that includes basic tokenization, morphological analysis and POS tagging.

It heavily uses the boQwI’ dictionary.

Installation

yajwI’ requires Python 3.8 or newer.

It can be installed from PyPI:

pip install yajwiz

Updating and using the boQwI’ dictionary

When yajwI’ is first imported, it will download a copy of the boQwI’ dictionary. After this the update_dictionary() function must be called whenever the dictionary needs to be updated. The function will check for updates and install them.

The downloaded dictionary can be accessed through the load_dictionary() function.

>>> import yajwiz
>>> yajwiz.update_dictionary()
>>> dictionary = yajwiz.load_dictionary()
>>> dictionary.version
'2021.03.18a'

Tokenization

The library includes very simple tokenization.

>>> import yajwiz
>>> yajwiz.tokenize("Hegh neH chav qoH. qanchoHpa' qoH, Hegh qoH.")
[('WORD', 'Hegh'), ('SPACE', ' '), ('WORD', 'neH'), ('SPACE', ' '), ('WORD', 'chav'), ('SPACE', ' '), ('WORD', 'qoH'), ('PUNCT', '.'), ('SPACE', ' '), ('WORD', "qanchoHpa'"), ('SPACE', ' '), ('WORD', 'qoH'), ('PUNCT', ','), ('SPACE', ' '), ('WORD', 'Hegh'), ('SPACE', ' '), ('WORD', 'qoH'), ('PUNCT', '.')]

Morphological analysis

The yajwiz.analyze function parses a word and returns a list of possible parses and a lot of extra information.

>>> yajwiz.analyze("yInwI'")
[{'BOQWIZ_ID': 'yIn:n',
  'BOQWIZ_POS': 'n:klcp1',
  'LEMMA': 'yIn',
  'PARTS': ['yIn:n', "-wI':n"],
  'POS': 'N',
  'SUFFIX': {'N4': "-wI'"},
  'UNGRAMMATICAL': 'ILLEGAL PLURAL OR POSSESSIVE SUFFIX',
  'WORD': "yInwI'",
  'XPOS': 'N',
  'XPOS_GSUFF': 'N'},
 {'BOQWIZ_ID': 'yIn:v',
  'BOQWIZ_POS': 'v:t_c,klcp1',
  'LEMMA': 'yIn',
  'PARTS': ['yIn:v', "-wI':v"],
  'POS': 'V',
  'SUFFIX': {'V9': "-wI'"},
  'WORD': "yInwI'",
  'XPOS': 'VT',
  'XPOS_GSUFF': "VT.wI'"}]

Currently the analyzer is very permissive and does allow using wrong plurals and possessive suffixes (eg. yInwI’ instead of yInwIj). It will try to mark this kind of errors with 'UNGRAMMATICAL': True. It detects the following errors:

  • Using -pu’, -wI’, -lI’, etc. when the noun is not a person noun

  • Using -Du’ when the noun is not a body part

  • Using -vIS without using -taH

  • Using -lu’ with an illegal verb prefix

  • Using intransitive verbs with prefixes indicating object

  • Using -ghach without any other verb suffix

  • Using aspect suffix with -jaj

There is also a simpler function yajwiz.split_to_morphemes, that returns a set of tuples of strings (usually there will be only one tuple in the set):

>>> yajwiz.split_to_morphemes("yInwI'")
{('yIn', "-wI'")}

List of Parts of Speech

XPOS

Explanation

VS

Stative verb

VT

Transitive verb

VI

Intransitive verb

VA

Transitive and intransitive verb

V?

Verb with unknown transitivity

NL

Person noun

NB

Body part noun

PRON

Pronoun (including ‘Iv and nuq: it is a noun that can function as a copula)

NUM

Number

N

Other noun

ADV

Adverb

EXCL

Exclamation

CONJ

Conjunction

QUES

Question word (other than ‘Iv and nuq)

UNK

Unknown

Grammar checker

yajwI’ can be used to find common grammar errors. You can either use the method yajwiz.get_errors or the following command line interface:

python -m yajwiz.grammar_check file.txt

CONLL-U files and POS tagger

CONLL-U files are a popular data format for storing annotated linguistic data.

yajwI’ can generate CONLL-U files filled with morphological information (it does not support dependency parsing).

Below is an example script that first parses a text without a trained POS tagger, then trains a POS tagger with it and finally parses the text with the tagger and saves the result to a CONLL-U file.

import yajwiz

with open("prose-corpus.txt", "r") as f:
    text = f.read()

conllu = yajwiz.text_to_conllu(text)

tagger = yajwiz.Tagger()
tagger.train(yajwiz.conllu_to_tagged_list(conllu))

conllu = yajwiz.text_to_conllu(text, tagger)

with open("prose-corpus.conllu", "w") as f:
    f.write(conllu)

Without a trained POS tagger, ambiguous words will be left without a tag:

# Hegh neH chav qoH.
1   Hegh    _       _       _       _       _       _       _       _
2   neH     _       _       _       _       _       _       _       _
3   chav    _       _       _       _       _       _       _       _
4   qoH     qoH     NOUN    N       _       _       _       _       _
5   .       .       PUNCT   PUNCT   _       _       _       _       _

# qanchoHpa' qoH, Hegh qoH.
1   qanchoHpa'      qan     VERB    V?.pa'  Person=3|ObjPerson=3,0  _       _       _       SuffixV3=-choH|SuffixV9=-pa'
2   qoH     qoH     NOUN    N       _       _       _       _       _
3   ,       ,       PUNCT   PUNCT   _       _       _       _       _
4   Hegh    _       _       _       _       _       _       _       _
5   qoH     qoH     NOUN    N       _       _       _       _       _
6   .       .       PUNCT   PUNCT   _       _       _       _       _

After training the tagger, it will take the “best guess” when deciding the POS.

# Hegh neH chav qoH.
1   Hegh    Hegh    VERB    VT      Person=3|ObjPerson=3,0  _       _       _       _
2   neH     neH     ADV     ADV     _       _       _       _       _
3   chav    chav    VERB    VT      Person=3|ObjPerson=3,0  _       _       _       _
4   qoH     qoH     NOUN    N       _       _       _       _       _
5   .       .       PUNCT   PUNCT   _       _       _       _       _

# qanchoHpa' qoH, Hegh qoH.
1   qanchoHpa'      qan     VERB    V?.pa'  Person=3|ObjPerson=3,0  _       _       _       SuffixV3=-choH|SuffixV9=-pa'
2   qoH     qoH     NOUN    N       _       _       _       _       _
3   ,       ,       PUNCT   PUNCT   _       _       _       _       _
4   Hegh    Hegh    VERB    VT      Person=3|ObjPerson=3,0  _       _       _       _
5   qoH     qoH     NOUN    N       _       _       _       _       _
6   .       .       PUNCT   PUNCT   _       _       _       _       _

In this example the tagger made a mistake: it classified the first Hegh as VT, although it should be N. I don’t have a correctly tagged corpus, so evaluating the tagger is currently impossible. :(

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yajwiz-0.10.3.tar.gz (21.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yajwiz-0.10.3-py3-none-any.whl (2.2 MB view details)

Uploaded Python 3

File details

Details for the file yajwiz-0.10.3.tar.gz.

File metadata

  • Download URL: yajwiz-0.10.3.tar.gz
  • Upload date:
  • Size: 21.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.9.6

File hashes

Hashes for yajwiz-0.10.3.tar.gz
Algorithm Hash digest
SHA256 e706c039d92e5441109756fc8eee77c651b72f514fba46d6ca75a71e39b08820
MD5 bedc479c56bab656d6b8ded5e35ab9c6
BLAKE2b-256 51660e4eb32e8733cbe4560b0a98e9df31b48a16aad629f4292e16d36bb41d46

See more details on using hashes here.

File details

Details for the file yajwiz-0.10.3-py3-none-any.whl.

File metadata

  • Download URL: yajwiz-0.10.3-py3-none-any.whl
  • Upload date:
  • Size: 2.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.9.6

File hashes

Hashes for yajwiz-0.10.3-py3-none-any.whl
Algorithm Hash digest
SHA256 5d837f86b3d262b0ac9dfd1fe400f6fb389dbe3e106d99d8a5aabd9b5578e2c1
MD5 f8d4fbec7487129d90a435153c375e06
BLAKE2b-256 2bd0fb513313655bd3e4b2394c02c7b906698032a8502b73d31e539e21bce784

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page