Skip to main content

Comparing quality and performance of NLP systems for Russian language

Project description

Naeval — comparing quality and performance of NLP systems for Russian language. Naeval is used to evaluate project Natasha components: Razdel, Navec, Slovnet:

Tokenization

See Razdel evalualtion section for more info.

corpora syntag gicrya rnc
errors time errors time errors time errors time
re.findall(\w+|\d+|\p+) 4161 0.5 2660 0.5 2277 0.4 7606 0.4
spacy 4388 6.2 2103 5.8 1740 4.1 4057 3.9
nltk.word_tokenize 14245 3.4 60893 3.3 13496 2.7 41485 2.9
mystem 4514 5.0 3153 4.7 2497 3.7 2028 3.9
mosestokenizer 1886 2.1 1330 1.9 1796 1.6 2123 1.7
segtok.word_tokenize 2772 2.3 1288 2.3 1759 1.8 1229 1.8
aatimofeev/spacy_russian_tokenizer 2930 48.7 719 51.1 678 39.5 2681 52.2
koziev/rutokenizer 2627 1.1 1386 1.0 2893 0.8 9411 0.9
razdel.tokenize 1510 2.9 1483 2.8 322 2.0 2124 2.2

Sentence segmentation

corpora syntag gicrya rnc
errors time errors time errors time errors time
re.split([.?!…]) 20456 0.9 6576 0.6 10084 0.7 23356 1.0
segtok.split_single 19008 17.8 4422 13.4 159738 1.1 164218 2.8
mosestokenizer 41666 8.9 22082 5.7 12663 6.4 50560 7.4
nltk.sent_tokenize 16420 10.1 4350 5.3 7074 5.6 32534 8.9
deeppavlov/rusenttokenize 10192 10.9 1210 7.9 8910 6.8 21410 7.0
razdel.sentenize 9274 6.1 824 3.9 11414 4.5 10594 7.5

Pretrained embeddings

See Navec evalualtion section for more info.

type init, s get, µs disk, mb ram, mb vocab
ruscorpora_upos_cbow_300_20_2019 w2v 12.1 1.6 220.6 236.1 189K
ruwikiruscorpora_upos_skipgram_300_2_2019 w2v 15.7 1.7 290.0 309.4 248K
tayga_upos_skipgram_300_2_2019 w2v 15.7 1.2 290.7 310.9 249K
tayga_none_fasttextcbow_300_10_2019 fasttext 11.3 14.3 2741.9 2746.9 192K
araneum_none_fasttextcbow_300_5_2018 fasttext 7.8 15.4 2752.1 2754.7 195K
hudlit_12B_500K_300d_100q navec 1.0 19.9 50.6 95.3 500K
news_1B_250K_300d_100q navec 0.5 20.3 25.4 47.7 250K
type simlex hj rt ae ae2 lrwc
ruscorpora_upos_cbow_300_20_2019 w2v 0.359 0.685 0.852 0.758 0.896 0.602
ruwikiruscorpora_upos_skipgram_300_2_2019 w2v 0.321 0.723 0.817 0.801 0.860 0.629
tayga_upos_skipgram_300_2_2019 w2v 0.429 0.749 0.871 0.771 0.899 0.639
tayga_none_fasttextcbow_300_10_2019 fasttext 0.369 0.639 0.793 0.682 0.813 0.536
araneum_none_fasttextcbow_300_5_2018 fasttext 0.349 0.671 0.801 0.706 0.793 0.579
hudlit_12B_500K_300d_100q navec 0.310 0.707 0.842 0.931 0.923 0.604
news_1B_250K_300d_100q navec 0.230 0.590 0.784 0.866 0.861 0.589

Morphology taggers

news wiki fiction social poetry
rupostagger 0.673 0.645 0.661 0.641 0.636
rnnmorph 0.896 0.812 0.890 0.860 0.838
maru 0.894 0.808 0.887 0.861 0.840
udpipe 0.918 0.811 0.957 0.870 0.776
spacy 0.919 0.812 0.938 0.836 0.729
deeppavlov 0.940 0.841 0.944 0.870 0.857
deeppavlov_bert 0.951 0.868 0.964 0.892 0.865
init, s disk, mb ram, mb speed, it/s
rupostagger 4.8 3 118 48.0
rnnmorph 8.7 10 289 16.6
maru 15.8 44 370 36.4
udpipe 6.9 45 242 56.2
spacy 10.9 89 579 30.6
deeppavlov 4.0 32 10240 90.0 (gpu)
deeppavlov_bert 20.0 1393 8704 85.0 (gpu)

Syntax parser

news wiki fiction social poetry
uas las uas las uas las uas las uas las
udpipe 0.873 0.823 0.622 0.531 0.910 0.876 0.700 0.624 0.625 0.534
spacy 0.876 0.818 0.770 0.665 0.880 0.833 0.757 0.666 0.657 0.544
deeppavlov_bert 0.962 0.910 0.882 0.786 0.963 0.929 0.844 0.761 0.784 0.691
init, s disk, mb ram, mb speed, it/s
udpipe 6.9 45 242 56.2
spacy 10.9 89 579 31.6
deeppavlov_bert 34.0 1427 8704 75.0 (gpu)

NER

See Slovnet evalualtion section for more info.

factru gareev ne5 bsnlp
f1 PER LOC ORG PER ORG PER LOC ORG PER LOC ORG
deeppavlov 0.910 0.886 0.742 0.944 0.798 0.942 0.919 0.881 0.866 0.767 0.624
deeppavlov_bert 0.971 0.928 0.825 0.980 0.916 0.997 0.990 0.976 0.954 0.840 0.741
pullenti 0.905 0.814 0.686 0.939 0.639 0.952 0.862 0.683 0.900 0.769 0.566
texterra 0.900 0.800 0.597 0.888 0.561 0.901 0.777 0.594 0.858 0.783 0.548
tomita 0.929 0.921 0.945 0.881
natasha 0.867 0.753 0.297 0.873 0.347 0.852 0.709 0.394 0.836 0.755 0.350
mitie 0.888 0.861 0.532 0.849 0.452 0.753 0.642 0.432 0.736 0.801 0.524
init, s disk, mb ram, mb speed, articles/s
deeppavlov 5.9 1024 3072 24.3 (gpu)
deeppavlov_bert 34.5 2048 6144 13.1 (gpu)
pullenti 2.9 16 253 6.0
texterra 47.6 193 3379 4.0
tomita 2.0 64 63 29.8
natasha 2.0 1 160 8.8
mitie 28.3 327 261 32.8

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

naeval-0.2.0-py3-none-any.whl (52.6 kB view details)

Uploaded Python 3

File details

Details for the file naeval-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: naeval-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 52.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.6.9

File hashes

Hashes for naeval-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 954fd910d32fa537a799348478d0e4908f8a81182b9e947d5aee325343e57f54
MD5 cef8392492d312c5c7859cf08ec5c912
BLAKE2b-256 2c6624353ca603f862e89bbf4c9def02be8e73a12cb8d324c6c93e8b277687b5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page