Links to russian corpora, functions for loading and parsing
Project description
Links to publicly available Russian corpora + code for loading and parsing. 20+ datasets, 350Gb+ of text.
Usage
For example lets use dump of lenta.ru by @yutkin. Manually download the archive (link in the Reference section):
wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz
Use corus to load the data:
>>> from corus import load_lenta
>>> path = 'lenta-ru-news.csv.gz'
>>> records = load_lenta(path)
>>> next(records)
LentaRecord(
url='https://lenta.ru/news/2018/12/14/cancer/',
title='Названы регионы России с\xa0самой высокой смертностью от\xa0рака',
text='Вице-премьер по социальным вопросам Татьяна Голикова рассказала, в каких регионах России зафиксирована наиболее высокая смертность от рака, сооб...',
topic='Россия',
tags='Общество'
)
Iterate over texts:
>>> records = load_lenta(path)
>>> for record in records:
... text = record.text
... ...
For links to other datasets and their loaders see the Reference section.
Documentation
Materials are in Russian:
Install
corus supports Python 3.5+, PyPy 3.
$ pip install corus
Reference
| Dataset | API from corus import |
Tags | Texts | Uncompressed | Description |
|---|---|---|---|---|---|
| Lenta.ru | |||||
| Lenta.ru v1.0 |
load_lenta
#
|
news
|
739 351 | 1.66 Gb |
wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz
|
| Lenta.ru v1.1+ |
load_lenta2
#
|
news
|
800 975 | 1.94 Gb |
wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.1/lenta-ru-news.csv.bz2
|
| Lib.rus.ec |
load_librusec
#
|
fiction
|
301 871 | 144.92 Gb |
Dump of lib.rus.ec prepared for RUSSE workshop
wget http://panchenko.me/data/russe/librusec_fb2.plain.gz
|
| Rossiya Segodnya |
load_ria_raw
#
load_ria
#
|
news
|
1 003 869 | 3.70 Gb |
wget https://github.com/RossiyaSegodnya/ria_news_dataset/raw/master/ria.json.gz
|
| Mokoron Russian Twitter Corpus |
load_mokoron
#
|
social
sentiment
|
17 633 417 | 1.86 Gb |
Russian Twitter sentiment markup
Manually download https://www.dropbox.com/s/9egqjszeicki4ho/db.sql |
| Wikipedia |
load_wiki
#
|
1 541 401 | 12.94 Gb |
Russian Wiki dump
wget https://dumps.wikimedia.org/ruwiki/latest/ruwiki-latest-pages-articles.xml.bz2
|
|
| GramEval2020 |
load_gramru
#
|
162 372 | 30.04 Mb |
wget https://github.com/dialogue-evaluation/GramEval2020/archive/master.zip
unzip master.zip
mv GramEval2020-master/dataTrain train
mv GramEval2020-master/dataOpenTest dev
rm -r master.zip GramEval2020-master
wget https://github.com/AlexeySorokin/GramEval2020/raw/master/data/GramEval_private_test.conllu
|
|
| OpenCorpora |
load_corpora
#
|
morph
|
4 030 | 20.21 Mb |
wget http://opencorpora.org/files/export/annot/annot.opcorpora.xml.zip
|
| RusVectores SimLex-965 |
load_simlex
#
|
emb
sim
|
wget https://rusvectores.org/static/testsets/ru_simlex965_tagged.tsv
wget https://rusvectores.org/static/testsets/ru_simlex965.tsv
|
||
| Omnia Russica |
load_omnia
#
|
morph
web
fiction
|
489.62 Gb |
Taiga + Wiki + Araneum. Read "Even larger Russian corpus" https://events.spbu.ru/eventsContent/events/2019/corpora/corp_sborn.pdf
Manually download http://bit.ly/2ZT4BY9 |
|
| factRuEval-2016 |
load_factru
#
|
ner
news
|
254 | 969.27 Kb |
Manual PER, LOC, ORG markup prepared for 2016 Dialog competition
wget https://github.com/dialogue-evaluation/factRuEval-2016/archive/master.zip
unzip master.zip
rm master.zip
|
| Gareev |
load_gareev
#
|
ner
news
|
97 | 455.02 Kb |
Manual PER, ORG markup (no LOC)
Email Rinat Gareev (gareev-rm@yandex.ru) ask for dataset tar -xvf rus-ner-news-corpus.iob.tar.gz
rm rus-ner-news-corpus.iob.tar.gz
|
| Collection5 |
load_ne5
#
|
ner
news
|
1 000 | 2.96 Mb |
News articles with manual PER, LOC, ORG markup
wget http://www.labinform.ru/pub/named_entities/collection5.zip
unzip collection5.zip
rm collection5.zip
|
| WiNER |
load_wikiner
#
|
ner
|
203 287 | 36.15 Mb |
Sentences from Wiki auto annotated with PER, LOC, ORG tags
wget https://github.com/dice-group/FOX/raw/master/input/Wikiner/aij-wikiner-ru-wp3.bz2
|
| BSNLP-2019 |
load_bsnlp
#
|
ner
|
464 | 1.16 Mb |
Markup prepared for 2019 BSNLP Shared Task
wget http://bsnlp.cs.helsinki.fi/TRAININGDATA_BSNLP_2019_shared_task.zip
wget http://bsnlp.cs.helsinki.fi/TESTDATA_BSNLP_2019_shared_task.zip
unzip TRAININGDATA_BSNLP_2019_shared_task.zip
unzip TESTDATA_BSNLP_2019_shared_task.zip -d test_pl_cs_ru_bg
rm TRAININGDATA_BSNLP_2019_shared_task.zip TESTDATA_BSNLP_2019_shared_task.zip
|
| Persons-1000 |
load_persons
#
|
ner
news
|
1 000 | 2.96 Mb |
Same as Collection5, only PER markup + normalized names
wget http://ai-center.botik.ru/Airec/ai-resources/Persons-1000.zip
|
| The Russian Drug Reaction Corpus (RuDReC) |
load_rudrec
#
|
ner
|
4 809 | 1.73 Kb |
RuDReC is a new partially annotated corpus of consumer reviews in Russian about pharmaceutical products for the detection of health-related named entities and the effectiveness of pharmaceutical products. Here you can download and work with the annotated part, to get the raw part (1.4M reviews) please refer to https://github.com/cimm-kzn/RuDReC.
wget https://github.com/cimm-kzn/RuDReC/raw/master/data/rudrec_annotated.json
|
| Taiga |
Large collection of Russian texts from various sources: news sites, magazines, literacy, social networks
wget https://linghub.ru/static/Taiga/retagged_taiga.tar.gz
tar -xzvf retagged_taiga.tar.gz
|
||||
| Arzamas |
load_taiga_arzamas
#
|
news
|
311 | 4.50 Mb | |
| Fontanka |
load_taiga_fontanka
#
|
news
|
342 683 | 786.23 Mb | |
| Interfax |
load_taiga_interfax
#
|
news
|
46 429 | 77.55 Mb | |
| KP |
load_taiga_kp
#
|
news
|
45 503 | 61.79 Mb | |
| Lenta |
load_taiga_lenta
#
|
news
|
36 446 | 95.15 Mb | |
| Taiga/N+1 |
load_taiga_nplus1
#
|
news
|
7 696 | 24.96 Mb | |
| Magazines |
load_taiga_magazines
#
|
39 890 | 2.19 Gb | ||
| Subtitles |
load_taiga_subtitles
#
|
19 011 | 909.08 Mb | ||
| Social |
load_taiga_social
#
|
social
|
1 876 442 | 648.18 Mb | |
| Proza |
load_taiga_proza
#
|
fiction
|
1 732 434 | 38.25 Gb | |
| Stihi |
load_taiga_stihi
#
|
9 157 686 | 12.80 Gb | ||
| Russian NLP Datasets | Several Russian news datasets from webhose.io, lenta.ru and other news sites. | ||||
| News |
load_buriy_news
#
|
news
|
2 154 801 | 6.84 Gb |
Dump of top 40 news + 20 fashion news sites.
wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2014.tar.bz2
wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part1.tar.bz2
wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part2.tar.bz2
|
| Webhose |
load_buriy_webhose
#
|
news
|
285 965 | 859.32 Mb |
Dump from webhose.io, 300 sources for one month.
wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/webhose-2016.tar.bz2
|
| ODS #proj_news_viz | Several news sites scraped by members of #proj_news_viz ODS project. | ||||
| Interfax |
load_ods_interfax
#
|
news
|
543 961 | 1.22 Gb |
wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/interfax.csv.gz
|
| Gazeta |
load_ods_gazeta
#
|
news
|
865 847 | 1.63 Gb |
wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/gazeta.csv.gz
|
| Izvestia |
load_ods_izvestia
#
|
news
|
86 601 | 307.19 Mb |
wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/iz.csv.gz
|
| Meduza |
load_ods_meduza
#
|
news
|
71 806 | 270.11 Mb |
wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/meduza.csv.gz
|
| RIA |
load_ods_ria
#
|
news
|
101 543 | 233.88 Mb |
wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/ria.csv.gz
|
| Russia Today |
load_ods_rt
#
|
news
|
106 644 | 187.12 Mb |
wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/rt.csv.gz
|
| TASS |
load_ods_tass
#
|
news
|
1 135 635 | 3.27 Gb |
wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/tass-001.csv.gz
|
| Universal Dependencies | |||||
| GSD |
load_ud_gsd
#
|
morph
syntax
|
5 030 | 1.01 Mb |
wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-dev.conllu
wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-test.conllu
wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-train.conllu
|
| Taiga |
load_ud_taiga
#
|
morph
syntax
|
3 264 | 353.80 Kb |
wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-dev.conllu
wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-test.conllu
wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-train.conllu
|
| PUD |
load_ud_pud
#
|
morph
syntax
|
1 000 | 207.78 Kb |
wget https://github.com/UniversalDependencies/UD_Russian-PUD/raw/master/ru_pud-ud-test.conllu
|
| SynTagRus |
load_ud_syntag
#
|
morph
syntax
|
61 889 | 11.33 Mb |
wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-dev.conllu
wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-test.conllu
wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-train.conllu
|
| morphoRuEval-2017 | |||||
| General Internet-Corpus |
load_morphoru_gicrya
#
|
morph
|
83 148 | 10.58 Mb |
wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/GIKRYA_texts_new.zip
unzip GIKRYA_texts_new.zip
rm GIKRYA_texts_new.zip
|
| Russian National Corpus |
load_morphoru_rnc
#
|
morph
|
98 892 | 12.71 Mb |
wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/RNC_texts.rar
unrar x RNC_texts.rar
rm RNC_texts.rar
|
| OpenCorpora |
load_morphoru_corpora
#
|
morph
|
38 510 | 4.80 Mb |
wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/OpenCorpora_Texts.rar
unrar x OpenCorpora_Texts.rar
rm OpenCorpora_Texts.rar
|
| RUSSE Russian Semantic Relatedness | |||||
| HJ: Human Judgements of Word Pairs |
load_russe_hj
#
|
emb
sim
|
wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/hj.csv
|
||
| RT: Synonyms and Hypernyms from the Thesaurus RuThes |
load_russe_rt
#
|
emb
sim
|
wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/rt.csv
|
||
| AE: Cognitive Associations from the Sociation.org Experiment |
load_russe_ae
#
|
emb
sim
|
wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-train.csv
wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-test.csv
wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/ae2.csv
|
||
| Toloka Datasets | |||||
| Lexical Relations from the Wisdom of the Crowd (LRWC) |
load_toloka_lrwc
#
|
emb
sim
|
wget https://tlk.s3.yandex.net/dataset/LRWC.zip
unzip LRWC.zip
rm LRWC.zip
|
||
| The Russian Adverse Drug Reaction Corpus of Tweets (RuADReCT) |
load_ruadrect
#
|
social
|
9 515 | 2.09 Mb |
This corpus was developed for the Social Media Mining for Health Applications (#SMM4H) Shared Task 2020
wget https://github.com/cimm-kzn/RuDReC/raw/master/data/RuADReCT.zip
unzip RuADReCT.zip
rm RuADReCT.zip
|
Support
- Chat — https://t.me/natural_language_processing
- Issues — https://github.com/natasha/corus/issues
- Commercial support — https://lab.alexkuk.ru
Add new source
- Implement
corus/sources/<source>.py - Add import into
corus/sources/__init__.py - Add meta into
corus/source/meta.py - Add example into
docs.ipynb(check meta table is correct) - Run tests (readme is updated)
Development
Dev env
python -m venv ~/.venvs/natasha-corus
source ~/.venvs/natasha-corus/bin/activate
pip install -r requirements/dev.txt
pip install -e .
python -m ipykernel install --user --name natasha-corus
Lint + update docs
make lint
make exec-docs
Release
# Update setup.py version
git commit -am 'Up version'
git tag v0.10.0
git push
git push --tags
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file corus-0.10.0.tar.gz.
File metadata
- Download URL: corus-0.10.0.tar.gz
- Upload date:
- Size: 76.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0e203f4fb96b841822ca34a79c2004564ec68a1bcf247ab09e08e49b0a7563e9
|
|
| MD5 |
cdf056d3171481018d543e92b674436d
|
|
| BLAKE2b-256 |
797e50769ae67af426bb53727fdfbf34e768edb14f5e4900f4110174588666e3
|
File details
Details for the file corus-0.10.0-py3-none-any.whl.
File metadata
- Download URL: corus-0.10.0-py3-none-any.whl
- Upload date:
- Size: 83.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7b8da75d9fab0c3ee0d52a9fd575965dcd93fa1818da01a91bff178b3ad90bc7
|
|
| MD5 |
01619d7269db12d678cfc61e80962f4a
|
|
| BLAKE2b-256 |
26102c40454156b8bc65bdce019785aa508487b3b5cc07b35fd2c2da3d9b1418
|