Links to russian corpora, functions for loading and parsing
Project description
Links to publicly available Russian corpora + code for loading and parsing. 20+ datasets, 350Gb+ of text.
Usage
For example lets use dump of lenta.ru by @yutkin. Manually download the archive (link in the Reference section):
wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz
Use corus
to load the data:
>>> from corus import load_lenta
>>> path = 'lenta-ru-news.csv.gz'
>>> records = load_lenta(path)
>>> next(records)
LentaRecord(
url='https://lenta.ru/news/2018/12/14/cancer/',
title='Названы регионы России с\xa0самой высокой смертностью от\xa0рака',
text='Вице-премьер по социальным вопросам Татьяна Голикова рассказала, в каких регионах России зафиксирована наиболее высокая смертность от рака, сооб...',
topic='Россия',
tags='Общество'
)
Iterate over texts:
>>> records = load_lenta(path)
>>> for record in records:
... text = record.text
... ...
For links to other datasets and their loaders see the Reference section.
Install
corus
supports Python 3.5+, PyPy 3.
$ pip install corus
Reference
Dataset | API from corus import |
Tags | Texts | Uncompressed | Description |
---|---|---|---|---|---|
Lenta.ru |
load_lenta
#
|
news
|
739 351 | 1.66 Gb |
wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz
|
Lib.rus.ec |
load_librusec
#
|
lit
|
301 871 | 144.92 Gb |
Dump of lib.rus.ec prepared for RUSSE workshop
wget http://panchenko.me/data/russe/librusec_fb2.plain.gz
|
Rossiya Segodnya |
load_ria_raw
#
load_ria
#
|
news
|
1 003 869 | 3.70 Gb |
wget https://github.com/RossiyaSegodnya/ria_news_dataset/raw/master/ria.json.gz
|
factRuEval-2016 |
load_factru
#
|
ner
news
|
254 | 969.27 Kb |
Manual PER, LOC, ORG markup prepared for 2016 Dialog competition
wget https://github.com/dialogue-evaluation/factRuEval-2016/archive/master.zip
unzip master.zip
rm master.zip
|
Gareev |
load_gareev
#
|
ner
news
|
97 | 455.02 Kb |
Manual PER, ORG markup (no LOC)
Email Rinat Gareev (gareev-rm@yandex.ru) ask for dataset tar -xvf rus-ner-news-corpus.iob.tar.gz
rm rus-ner-news-corpus.iob.tar.gz
|
Collection5 |
load_ne5
#
|
ner
news
|
1 000 | 2.96 Mb |
News articles with manual PER, LOC, ORG markup
wget http://www.labinform.ru/pub/named_entities/collection5.zip
unzip collection5.zip
rm collection5.zip
|
WiNER |
load_wikiner
#
|
ner
|
203 287 | 36.15 Mb |
Sentences from Wiki auto annotated with PER, LOC, ORG tags
wget https://github.com/dice-group/FOX/raw/master/input/Wikiner/aij-wikiner-ru-wp3.bz2
|
BSNLP-2019 |
load_bsnlp
#
|
ner
|
464 | 1.16 Mb |
Markup prepared for 2019 BSNLP Shared Task
wget http://bsnlp.cs.helsinki.fi/TRAININGDATA_BSNLP_2019_shared_task.zip
wget http://bsnlp.cs.helsinki.fi/TESTDATA_BSNLP_2019_shared_task.zip
unzip TRAININGDATA_BSNLP_2019_shared_task.zip
unzip TESTDATA_BSNLP_2019_shared_task.zip -d test_pl_cs_ru_bg
rm TRAININGDATA_BSNLP_2019_shared_task.zip TESTDATA_BSNLP_2019_shared_task.zip
|
Persons-1000 |
load_persons
#
|
ner
news
|
1 000 | 2.96 Mb |
Same as Collection5, only PER markup + normalized names
wget http://ai-center.botik.ru/Airec/ai-resources/Persons-1000.zip
|
Mokoron Russian Twitter Corpus |
load_mokoron
#
|
social
|
17 633 417 | 1.86 Gb |
Russian Twitter sentiment markup
Manually download https://www.dropbox.com/s/9egqjszeicki4ho/db.sql |
Wikipedia |
load_wiki
#
|
1 541 401 | 12.94 Gb |
Russian Wiki dump
wget https://dumps.wikimedia.org/ruwiki/latest/ruwiki-latest-pages-articles.xml.bz2
|
|
GramEval2020 |
load_gramru
#
|
162 372 | 30.04 Mb |
wget https://github.com/dialogue-evaluation/GramEval2020/archive/master.zip
unzip master.zip
mv GramEval2020-master/dataTrain train
mv GramEval2020-master/dataOpenTest dev
rm -r master.zip GramEval2020-master
wget https://github.com/AlexeySorokin/GramEval2020/raw/master/data/GramEval_private_test.conllu
|
|
OpenCorpora |
load_corpora
#
|
morph
|
4 030 | 20.21 Mb |
wget http://opencorpora.org/files/export/annot/annot.opcorpora.xml.zip
|
RusVectores SimLex-965 |
load_simlex
#
|
emb
sim
|
wget https://rusvectores.org/static/testsets/ru_simlex965_tagged.tsv
wget https://rusvectores.org/static/testsets/ru_simlex965.tsv
|
||
Taiga |
Large collection of Russian texts from various sources: news sites, magazines, literacy, social networks
wget https://linghub.ru/static/Taiga/retagged_taiga.tar.gz
tar -xzvf retagged_taiga.tar.gz
|
||||
Arzamas |
load_taiga_arzamas
#
|
news
|
311 | 4.50 Mb | |
Fontanka |
load_taiga_fontanka
#
|
news
|
342 683 | 786.23 Mb | |
Interfax |
load_taiga_interfax
#
|
news
|
46 429 | 77.55 Mb | |
KP |
load_taiga_kp
#
|
news
|
45 503 | 61.79 Mb | |
Lenta |
load_taiga_lenta
#
|
news
|
36 446 | 95.15 Mb | |
Taiga/N+1 |
load_taiga_nplus1
#
|
news
|
7 696 | 24.96 Mb | |
Magazines |
load_taiga_magazines
#
|
39 890 | 2.19 Gb | ||
Subtitles |
load_taiga_subtitles
#
|
19 011 | 909.08 Mb | ||
Social |
load_taiga_social
#
|
social
|
1 876 442 | 648.18 Mb | |
Proza |
load_taiga_proza
#
|
lit
|
1 732 434 | 38.25 Gb | |
Stihi |
load_taiga_stihi
#
|
9 157 686 | 12.80 Gb | ||
Russian NLP Datasets | Several Russian news datasets from webhose.io, lenta.ru and other news sites. | ||||
Lenta |
load_buriy_lenta
#
|
news
|
699 777 | 1.57 Gb |
wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/lenta.tar.bz2
|
News |
load_buriy_news
#
|
news
|
2 154 801 | 6.84 Gb |
Dump of top 40 news + 20 fashion news sites.
wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2014.tar.bz2
wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part1.tar.bz2
wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part2.tar.bz2
|
Webhose |
load_buriy_webhose
#
|
news
|
285 965 | 859.32 Mb |
Dump from webhose.io, 300 sources for one month.
wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/stress.tar.gz
|
ODS #proj_news_viz | Several news sites scraped by members of #proj_news_viz ODS project. | ||||
Interfax |
load_ods_interfax
#
|
news
|
543 962 | 1.22 Gb | Manually download interfax_v1.csv.zip https://drive.google.com/file/d/1M7z0YoOgpm53IsJ3qOhT_nfiDnGUPeys/view |
Gazeta |
load_ods_gazeta
#
|
news
|
865 847 | 1.63 Gb | Manually download gazeta_v1.csv.zip from https://drive.google.com/file/d/18B8CvHgmwwyz9GWBZ0TS6dE_x6gYnWCb/view |
Universal Dependencies | |||||
GSD |
load_ud_gsd
#
|
morph
syntax
|
5 030 | 1.01 Mb |
wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-dev.conllu
wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-test.conllu
wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-train.conllu
|
Taiga |
load_ud_taiga
#
|
morph
syntax
|
3 264 | 353.80 Kb |
wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-dev.conllu
wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-test.conllu
wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-train.conllu
|
PUD |
load_ud_pud
#
|
morph
syntax
|
1 000 | 207.78 Kb |
wget https://github.com/UniversalDependencies/UD_Russian-PUD/raw/master/ru_pud-ud-test.conllu
|
SynTagRus |
load_ud_syntag
#
|
morph
syntax
|
61 889 | 11.33 Mb |
wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-dev.conllu
wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-test.conllu
wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-train.conllu
|
morphoRuEval-2017 | |||||
General Internet-Corpus |
load_morphoru_gicrya
#
|
morph
|
83 148 | 10.58 Mb |
wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/GIKRYA_texts_new.zip
unzip GIKRYA_texts_new.zip
rm GIKRYA_texts_new.zip
|
Russian National Corpus |
load_morphoru_rnc
#
|
morph
|
98 892 | 12.71 Mb |
wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/RNC_texts.rar
unrar x RNC_texts.rar
rm RNC_texts.rar
|
OpenCorpora |
load_morphoru_corpora
#
|
morph
|
38 510 | 4.80 Mb |
wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/OpenCorpora_Texts.rar
unrar x OpenCorpora_Texts.rar
rm OpenCorpora_Texts.rar
|
RUSSE Russian Semantic Relatedness | |||||
HJ: Human Judgements of Word Pairs |
load_russe_hj
#
|
emb
sim
|
wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/hj.csv
|
||
RT: Synonyms and Hypernyms from the Thesaurus RuThes |
load_russe_rt
#
|
emb
sim
|
wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/rt.csv
|
||
AE: Cognitive Associations from the Sociation.org Experiment |
load_russe_ae
#
|
emb
sim
|
wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-train.csv
wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-test.csv
wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/ae2.csv
|
||
Toloka Datasets | |||||
Lexical Relations from the Wisdom of the Crowd (LRWC) |
load_toloka_lrwc
#
|
emb
sim
|
wget https://tlk.s3.yandex.net/dataset/LRWC.zip
unzip LRWC.zip
rm LRWC.zip
|
Support
- Chat — https://telegram.me/natural_language_processing
- Issues — https://github.com/natasha/corus/issues
Development
Tests:
make test
Add new source:
- Implement
corus/sources/<source>.py
- Add import into
corus/sources/__init__.py
- Add meta into
corus/source/meta.py
- Add example into
docs.ipynb
(check meta table is correct) - Run tests (readme is updated)
Package:
make version
git push
git push --tags
make clean wheel upload
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
corus-0.6.0.tar.gz
(73.2 kB
view hashes)
Built Distribution
corus-0.6.0-py3-none-any.whl
(79.9 kB
view hashes)