mordl

Morphological parser (POS, lemmata, NER etc.)

These details have not been verified by PyPI

Project links

Homepage

Project description

MorDL: Morphological Tagger (POS, lemmata, NER etc.)

MorDL is a tool to organize the pipeline for complete morphological sentence parsing (POS-tagging, lemmatization, morphological feature tagging) and Named-entity recognition.

Scores (accuracy) on SynTagRus test dataset: UPOS: 99.35%; FEATS: 98.87% (tokens), 99.31% (tags); LEMMA: 99.50%. In all experiments, we used seed=42. Some other seed values may help to achive better results. Models' hyperparameters are also allowed to tune.

The validation with the official evaluation script of CoNLL 2018 Shared Task:

For the inference on the SynTagRus test corpus, when predicted fields were emptied and all other fields were stayed intact, the scores are the same as outlined above.
The inference of UPOS - FEATS - LEMMA taggers applied serially resulted with scores: UPOS: 99.35%; UFeats: 98.36%; AllTags: 98.21; Lemmas: 98.88%.

For completeness, we included that script in our distribution, so you can use it for your model evaluation, too. To simplify it, we also made a wrapper mordl.conll18_ud_eval for it.

Installation

pip

MorDL supports Python 3.6 and Transformers 4.3.3 or later. To install via pip, run:

$ pip install mordl

If you currently have a previous version of MorDL installed, run:

$ pip install mordl -U

From Source

Alternatively, you can install MorDL from the source of this git repository:

$ git clone https://github.com/fostroll/mordl.git
$ cd mordl
$ pip install -e .

This gives you access to examples that are not included in the PyPI package.

Usage

Our taggers use separate models, so they can be used independently. But to achieve best results FEATS tagger uses UPOS tags during training. And LEMMA and NER taggers use both UPOS and FEATS tags. Thus, for a fully untagged corpus, the tagging pipeline is serially applying the taggers, like shown below (assuming that our goal is NER and we already have trained taggers of all types):

from mordl import UposTagger, FeatsTagger, NeTagger

tagger_u, tagger_f, tagger_n = UposTagger(), FeatsTagger(), NeTagger()
tagger_u.load('upos_model')
tagger_f.load('feats_model')
tagger_n.load('misc-ne_model')

tagger_n.predict(
    tagger_f.predict(
        tagger_u.predict('untagged.conllu')
    ), save_to='result.conllu'
)

Any tagger in our pipeline may be replaced with a better one if you have it. The weakness of separate taggers is that they take more space. If all models were created with BERT embeddings, and you load them in memory simultaneously, they may eat up to 9Gb on GPU. If it does not fit to your GPU, during loading, you can use params device and dataset_device to distribute your models on various GPUs. Alternatively, if you need just to tag some corpus once, you may load models serially:

tagger = UposTagger()
tagger.load('upos_model')
tagger.predict('untagged.conllu', save_to='result_upos.conllu')
del tagger  # just for sure
tagger = FeatsTagger()
tagger.load('feats_model')
tagger.predict('result_upos.conllu', save_to='result_feats.conllu')
del tagger
tagger = NeTagger()
tagger_n.load('misc-ne_model')
tagger.predict('result_feats.conllu', save_to='result.conllu')
del tagger

Don't use identical names for input and output file names when you call the .predict() methods. Normally, there will be no problem, because the methods by default load all the input file in memory before tagging. But if the input file is large, you may want to use the split parameter for the methods handle the file by parts. In that case, saving of the first part of the tagging data occurs before loading next. So, identical names will entail data loss.

The training process is also simple. If you have training corpora and you don't want any experiments, just run:

from mordl import UposTagger

tagger = UposTagger()
tagger.load_train_corpus(train_corpus)
tagger.load_test_corpus(dev_corpus)

stat = tagger.train('upos_model', device='cuda:0',
                    stage3_params={'save_as': 'upos_bert_model'})

It is a training pipeline for the UPOS tagger; pipelines for other taggers are identical.

For a more complete understanding of MorDL toolkit usage, refer to the Python notebook with the pipeline example in the examples directory of the MorDL GitHub repository. Also, the detailed descriptions are available in the docs:

MorDL Basics

Part of Speech Tagging

Single Feature Tagging

Multiple Feature Tagging

Lemmata Prediction

Named-entity Recognition

Supplements

Also, you can find training pipelines for different taggers in our example notebook.

This project was developed with the focus on Russian language, but a few nuances we use for it are unlikely to worsen the quality of processing other languages.

MorDL's supports CoNLL-U (if input/output is a file), or Parsed CoNLL-U (if input/output is an object). Also, MorDL's allows Corpuscula's corpora wrappers as input.

License

MorDL is released under the BSD License. See the LICENSE file for more details.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2.0.12

Dec 9, 2021

2.0.11

Dec 8, 2021

2.0.10

Dec 5, 2021

2.0.9

Dec 5, 2021

2.0.8

Dec 5, 2021

2.0.7

Dec 5, 2021

2.0.6

Nov 29, 2021

2.0.5

Nov 28, 2021

2.0.4

Nov 20, 2021

2.0.3

Nov 19, 2021

This version

2.0.2

Nov 15, 2021

2.0.1

Nov 15, 2021

2.0.0

Nov 6, 2021

1.0.38

Jun 22, 2021

1.0.37

May 23, 2021

1.0.35

Apr 5, 2021

1.0.34

Mar 22, 2021

1.0.33

Mar 12, 2021

1.0.32

Dec 3, 2020

1.0.31

Sep 16, 2020

1.0.29

Aug 15, 2020

1.0.28

Aug 15, 2020

1.0.27

Aug 15, 2020

1.0.25

Aug 12, 2020

1.0.24

Aug 10, 2020

1.0.23

Aug 10, 2020

1.0.22

Aug 9, 2020

1.0.21

Aug 8, 2020

1.0.20

Aug 8, 2020

1.0.19

Aug 8, 2020

1.0.18

Aug 8, 2020

1.0.17

Aug 8, 2020

1.0.16

Aug 8, 2020

1.0.15

Aug 8, 2020

1.0.14

Aug 8, 2020

1.0.13

Aug 7, 2020

1.0.12

Aug 7, 2020

1.0.11

Aug 7, 2020

1.0.10

Aug 6, 2020

1.0.9

Aug 6, 2020

1.0.8

Aug 6, 2020

1.0.7

Aug 5, 2020

1.0.6

Aug 4, 2020

1.0.5

Aug 2, 2020

1.0.4

Jul 28, 2020

1.0.3

Jul 28, 2020

1.0.2

Jul 26, 2020

1.0.1

Jul 26, 2020

1.0.0

Jul 26, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mordl-2.0.2.tar.gz (82.2 kB view details)

Uploaded Nov 15, 2021 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mordl-2.0.2-py3-none-any.whl (97.1 kB view details)

Uploaded Nov 15, 2021 Python 3

File details

Details for the file mordl-2.0.2.tar.gz.

File metadata

Download URL: mordl-2.0.2.tar.gz
Upload date: Nov 15, 2021
Size: 82.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for mordl-2.0.2.tar.gz
Algorithm	Hash digest
SHA256	`86452687198a07e09a1a7935c032b102ea3623d8cac897dd00139c3e38aeb790`
MD5	`bde3556c4799e59511d2fc1450f2d380`
BLAKE2b-256	`1da3cf68c8b989acef07159327bb9add942ecfd6d6638613c93d55b97d813627`

See more details on using hashes here.

File details

Details for the file mordl-2.0.2-py3-none-any.whl.

File metadata

Download URL: mordl-2.0.2-py3-none-any.whl
Upload date: Nov 15, 2021
Size: 97.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for mordl-2.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`54b18cee7a7dd3c9a0339f61f573a07d85c87c8829d9aa26f7c14958a517915b`
MD5	`eafc822b9d3afb156113b7d58edae6b5`
BLAKE2b-256	`a1a4cc31d3891cfd29dac646805f354d39159d481ecb3fb673f7a9524b197db3`

See more details on using hashes here.

mordl 2.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MorDL: Morphological Tagger (POS, lemmata, NER etc.)

Installation

pip

From Source

Usage

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes