Skip to main content

Links to russian corpora, functions for loading and parsing

Project description

CI

Links to publicly available Russian corpora + code for loading and parsing. 20+ datasets, 350Gb+ of text.

Usage

For example lets use dump of lenta.ru by @yutkin. Manually download the archive (link in the Reference section):

wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz

Use corus to load the data:

>>> from corus import load_lenta

>>> path = 'lenta-ru-news.csv.gz'
>>> records = load_lenta(path)
>>> next(records)

LentaRecord(
    url='https://lenta.ru/news/2018/12/14/cancer/',
    title='Названы регионы России с\xa0самой высокой смертностью от\xa0рака',
    text='Вице-премьер по социальным вопросам Татьяна Голикова рассказала, в каких регионах России зафиксирована наиболее высокая смертность от рака, сооб...',
    topic='Россия',
    tags='Общество'
)

Iterate over texts:

>>> records = load_lenta(path)
>>> for record in records:
...     text = record.text
...     ...

For links to other datasets and their loaders see the Reference section.

Documentation

Materials are in Russian:

Install

corus supports Python 3.5+, PyPy 3.

$ pip install corus

Reference

Dataset API from corus import Tags Texts Uncompressed Description
Lenta.ru
Lenta.ru v1.0 load_lenta # news 739 351 1.66 Gb wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz
Lenta.ru v1.1+ load_lenta2 # news 800 975 1.94 Gb wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.1/lenta-ru-news.csv.bz2
Lib.rus.ec load_librusec # fiction 301 871 144.92 Gb Dump of lib.rus.ec prepared for RUSSE workshop

wget http://panchenko.me/data/russe/librusec_fb2.plain.gz
Rossiya Segodnya load_ria_raw #
load_ria #
news 1 003 869 3.70 Gb wget https://github.com/RossiyaSegodnya/ria_news_dataset/raw/master/ria.json.gz
Mokoron Russian Twitter Corpus load_mokoron # social sentiment 17 633 417 1.86 Gb Russian Twitter sentiment markup

Manually download https://www.dropbox.com/s/9egqjszeicki4ho/db.sql
Wikipedia load_wiki # 1 541 401 12.94 Gb Russian Wiki dump

wget https://dumps.wikimedia.org/ruwiki/latest/ruwiki-latest-pages-articles.xml.bz2
GramEval2020 load_gramru # 162 372 30.04 Mb wget https://github.com/dialogue-evaluation/GramEval2020/archive/master.zip
unzip master.zip
mv GramEval2020-master/dataTrain train
mv GramEval2020-master/dataOpenTest dev
rm -r master.zip GramEval2020-master
wget https://github.com/AlexeySorokin/GramEval2020/raw/master/data/GramEval_private_test.conllu
OpenCorpora load_corpora # morph 4 030 20.21 Mb wget http://opencorpora.org/files/export/annot/annot.opcorpora.xml.zip
RusVectores SimLex-965 load_simlex # emb sim wget https://rusvectores.org/static/testsets/ru_simlex965_tagged.tsv
wget https://rusvectores.org/static/testsets/ru_simlex965.tsv
Omnia Russica load_omnia # morph web fiction 489.62 Gb Taiga + Wiki + Araneum. Read "Even larger Russian corpus" https://events.spbu.ru/eventsContent/events/2019/corpora/corp_sborn.pdf

Manually download http://bit.ly/2ZT4BY9
factRuEval-2016 load_factru # ner news 254 969.27 Kb Manual PER, LOC, ORG markup prepared for 2016 Dialog competition

wget https://github.com/dialogue-evaluation/factRuEval-2016/archive/master.zip
unzip master.zip
rm master.zip
Gareev load_gareev # ner news 97 455.02 Kb Manual PER, ORG markup (no LOC)

Email Rinat Gareev (gareev-rm@yandex.ru) ask for dataset
tar -xvf rus-ner-news-corpus.iob.tar.gz
rm rus-ner-news-corpus.iob.tar.gz
Collection5 load_ne5 # ner news 1 000 2.96 Mb News articles with manual PER, LOC, ORG markup

wget http://www.labinform.ru/pub/named_entities/collection5.zip
unzip collection5.zip
rm collection5.zip
WiNER load_wikiner # ner 203 287 36.15 Mb Sentences from Wiki auto annotated with PER, LOC, ORG tags

wget https://github.com/dice-group/FOX/raw/master/input/Wikiner/aij-wikiner-ru-wp3.bz2
BSNLP-2019 load_bsnlp # ner 464 1.16 Mb Markup prepared for 2019 BSNLP Shared Task

wget http://bsnlp.cs.helsinki.fi/TRAININGDATA_BSNLP_2019_shared_task.zip
wget http://bsnlp.cs.helsinki.fi/TESTDATA_BSNLP_2019_shared_task.zip
unzip TRAININGDATA_BSNLP_2019_shared_task.zip
unzip TESTDATA_BSNLP_2019_shared_task.zip -d test_pl_cs_ru_bg
rm TRAININGDATA_BSNLP_2019_shared_task.zip TESTDATA_BSNLP_2019_shared_task.zip
Persons-1000 load_persons # ner news 1 000 2.96 Mb Same as Collection5, only PER markup + normalized names

wget http://ai-center.botik.ru/Airec/ai-resources/Persons-1000.zip
The Russian Drug Reaction Corpus (RuDReC) load_rudrec # ner 4 809 1.73 Kb RuDReC is a new partially annotated corpus of consumer reviews in Russian about pharmaceutical products for the detection of health-related named entities and the effectiveness of pharmaceutical products. Here you can download and work with the annotated part, to get the raw part (1.4M reviews) please refer to https://github.com/cimm-kzn/RuDReC.

wget https://github.com/cimm-kzn/RuDReC/raw/master/data/rudrec_annotated.json
Taiga Large collection of Russian texts from various sources: news sites, magazines, literacy, social networks

wget https://linghub.ru/static/Taiga/retagged_taiga.tar.gz
tar -xzvf retagged_taiga.tar.gz
Arzamas load_taiga_arzamas # news 311 4.50 Mb
Fontanka load_taiga_fontanka # news 342 683 786.23 Mb
Interfax load_taiga_interfax # news 46 429 77.55 Mb
KP load_taiga_kp # news 45 503 61.79 Mb
Lenta load_taiga_lenta # news 36 446 95.15 Mb
Taiga/N+1 load_taiga_nplus1 # news 7 696 24.96 Mb
Magazines load_taiga_magazines # 39 890 2.19 Gb
Subtitles load_taiga_subtitles # 19 011 909.08 Mb
Social load_taiga_social # social 1 876 442 648.18 Mb
Proza load_taiga_proza # fiction 1 732 434 38.25 Gb
Stihi load_taiga_stihi # 9 157 686 12.80 Gb
Russian NLP Datasets Several Russian news datasets from webhose.io, lenta.ru and other news sites.
News load_buriy_news # news 2 154 801 6.84 Gb Dump of top 40 news + 20 fashion news sites.

wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2014.tar.bz2
wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part1.tar.bz2
wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part2.tar.bz2
Webhose load_buriy_webhose # news 285 965 859.32 Mb Dump from webhose.io, 300 sources for one month.

wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/webhose-2016.tar.bz2
ODS #proj_news_viz Several news sites scraped by members of #proj_news_viz ODS project.
Interfax load_ods_interfax # news 543 961 1.22 Gb wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/interfax.csv.gz
Gazeta load_ods_gazeta # news 865 847 1.63 Gb wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/gazeta.csv.gz
Izvestia load_ods_izvestia # news 86 601 307.19 Mb wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/iz.csv.gz
Meduza load_ods_meduza # news 71 806 270.11 Mb wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/meduza.csv.gz
RIA load_ods_ria # news 101 543 233.88 Mb wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/ria.csv.gz
Russia Today load_ods_rt # news 106 644 187.12 Mb wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/rt.csv.gz
TASS load_ods_tass # news 1 135 635 3.27 Gb wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/tass-001.csv.gz
Universal Dependencies
GSD load_ud_gsd # morph syntax 5 030 1.01 Mb wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-dev.conllu
wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-test.conllu
wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-train.conllu
Taiga load_ud_taiga # morph syntax 3 264 353.80 Kb wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-dev.conllu
wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-test.conllu
wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-train.conllu
PUD load_ud_pud # morph syntax 1 000 207.78 Kb wget https://github.com/UniversalDependencies/UD_Russian-PUD/raw/master/ru_pud-ud-test.conllu
SynTagRus load_ud_syntag # morph syntax 61 889 11.33 Mb wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-dev.conllu
wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-test.conllu
wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-train.conllu
morphoRuEval-2017
General Internet-Corpus load_morphoru_gicrya # morph 83 148 10.58 Mb wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/GIKRYA_texts_new.zip
unzip GIKRYA_texts_new.zip
rm GIKRYA_texts_new.zip
Russian National Corpus load_morphoru_rnc # morph 98 892 12.71 Mb wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/RNC_texts.rar
unrar x RNC_texts.rar
rm RNC_texts.rar
OpenCorpora load_morphoru_corpora # morph 38 510 4.80 Mb wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/OpenCorpora_Texts.rar
unrar x OpenCorpora_Texts.rar
rm OpenCorpora_Texts.rar
RUSSE Russian Semantic Relatedness
HJ: Human Judgements of Word Pairs load_russe_hj # emb sim wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/hj.csv
RT: Synonyms and Hypernyms from the Thesaurus RuThes load_russe_rt # emb sim wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/rt.csv
AE: Cognitive Associations from the Sociation.org Experiment load_russe_ae # emb sim wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-train.csv
wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-test.csv
wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/ae2.csv
Toloka Datasets
Lexical Relations from the Wisdom of the Crowd (LRWC) load_toloka_lrwc # emb sim wget https://tlk.s3.yandex.net/dataset/LRWC.zip
unzip LRWC.zip
rm LRWC.zip
The Russian Adverse Drug Reaction Corpus of Tweets (RuADReCT) load_ruadrect # social 9 515 2.09 Mb This corpus was developed for the Social Media Mining for Health Applications (#SMM4H) Shared Task 2020

wget https://github.com/cimm-kzn/RuDReC/raw/master/data/RuADReCT.zip
unzip RuADReCT.zip
rm RuADReCT.zip

Support

Add new source

  1. Implement corus/sources/<source>.py
  2. Add import into corus/sources/__init__.py
  3. Add meta into corus/source/meta.py
  4. Add example into docs.ipynb (check meta table is correct)
  5. Run tests (readme is updated)

Development

Dev env

python -m venv ~/.venvs/natasha-corus
source ~/.venvs/natasha-corus/bin/activate

pip install -r requirements/dev.txt
pip install -e .

python -m ipykernel install --user --name natasha-corus

Lint + update docs

make lint
make exec-docs

Release

# Update setup.py version

git commit -am 'Up version'
git tag v0.10.0

git push
git push --tags

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corus-0.10.0.tar.gz (76.5 kB view hashes)

Uploaded Source

Built Distribution

corus-0.10.0-py3-none-any.whl (83.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page