corus

Links to russian corpora, functions for loading and parsing

These details have not been verified by PyPI

Project links

Homepage

Project description

Links to publicly available Russian corpora + code for loading and parsing. 20+ datasets, 350Gb+ of text.

Usage

For example lets use dump of lenta.ru by @yutkin. Manually download the archive (link in the Reference section):

wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz

Use corus to load the data:

>>> from corus import load_lenta

>>> path = 'lenta-ru-news.csv.gz'
>>> records = load_lenta(path)
>>> next(records)

LentaRecord(
    url='https://lenta.ru/news/2018/12/14/cancer/',
    title='Названы регионы России с\xa0самой высокой смертностью от\xa0рака',
    text='Вице-премьер по социальным вопросам Татьяна Голикова рассказала, в каких регионах России зафиксирована наиболее высокая смертность от рака, сооб...',
    topic='Россия',
    tags='Общество'
)

Iterate over texts:

>>> records = load_lenta(path)
>>> for record in records:
...     text = record.text
...     ...

For links to other datasets and their loaders see the Reference section.

Documentation

Materials are in Russian:

Install

corus supports Python 3.5+, PyPy 3.

$ pip install corus

Reference

Dataset	API `from corus import`	Tags	Texts	Uncompressed	Description
Lenta.ru
Lenta.ru v1.0	`load_lenta` `#`	`news`	739 351	1.66 Gb	`wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz`
Lenta.ru v1.1+	`load_lenta2` `#`	`news`	800 975	1.94 Gb	`wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.1/lenta-ru-news.csv.bz2`
Lib.rus.ec	`load_librusec` `#`	`fiction`	301 871	144.92 Gb	Dump of lib.rus.ec prepared for RUSSE workshop `wget http://panchenko.me/data/russe/librusec_fb2.plain.gz`
Rossiya Segodnya	`load_ria_raw` `#` `load_ria` `#`	`news`	1 003 869	3.70 Gb	`wget https://github.com/RossiyaSegodnya/ria_news_dataset/raw/master/ria.json.gz`
Mokoron Russian Twitter Corpus	`load_mokoron` `#`	`social` `sentiment`	17 633 417	1.86 Gb	Russian Twitter sentiment markup Manually download https://www.dropbox.com/s/9egqjszeicki4ho/db.sql
Wikipedia	`load_wiki` `#`		1 541 401	12.94 Gb	Russian Wiki dump `wget https://dumps.wikimedia.org/ruwiki/latest/ruwiki-latest-pages-articles.xml.bz2`
GramEval2020	`load_gramru` `#`		162 372	30.04 Mb	`wget https://github.com/dialogue-evaluation/GramEval2020/archive/master.zip` `unzip master.zip` `mv GramEval2020-master/dataTrain train` `mv GramEval2020-master/dataOpenTest dev` `rm -r master.zip GramEval2020-master` `wget https://github.com/AlexeySorokin/GramEval2020/raw/master/data/GramEval_private_test.conllu`
OpenCorpora	`load_corpora` `#`	`morph`	4 030	20.21 Mb	`wget http://opencorpora.org/files/export/annot/annot.opcorpora.xml.zip`
RusVectores SimLex-965	`load_simlex` `#`	`emb` `sim`			`wget https://rusvectores.org/static/testsets/ru_simlex965_tagged.tsv` `wget https://rusvectores.org/static/testsets/ru_simlex965.tsv`
Omnia Russica	`load_omnia` `#`	`morph` `web` `fiction`		489.62 Gb	Taiga + Wiki + Araneum. Read "Even larger Russian corpus" https://events.spbu.ru/eventsContent/events/2019/corpora/corp_sborn.pdf Manually download http://bit.ly/2ZT4BY9
factRuEval-2016	`load_factru` `#`	`ner` `news`	254	969.27 Kb	Manual PER, LOC, ORG markup prepared for 2016 Dialog competition `wget https://github.com/dialogue-evaluation/factRuEval-2016/archive/master.zip` `unzip master.zip` `rm master.zip`
Gareev	`load_gareev` `#`	`ner` `news`	97	455.02 Kb	Manual PER, ORG markup (no LOC) Email Rinat Gareev (gareev-rm@yandex.ru) ask for dataset `tar -xvf rus-ner-news-corpus.iob.tar.gz` `rm rus-ner-news-corpus.iob.tar.gz`
Collection5	`load_ne5` `#`	`ner` `news`	1 000	2.96 Mb	News articles with manual PER, LOC, ORG markup `wget http://www.labinform.ru/pub/named_entities/collection5.zip` `unzip collection5.zip` `rm collection5.zip`
WiNER	`load_wikiner` `#`	`ner`	203 287	36.15 Mb	Sentences from Wiki auto annotated with PER, LOC, ORG tags `wget https://github.com/dice-group/FOX/raw/master/input/Wikiner/aij-wikiner-ru-wp3.bz2`
BSNLP-2019	`load_bsnlp` `#`	`ner`	464	1.16 Mb	Markup prepared for 2019 BSNLP Shared Task `wget http://bsnlp.cs.helsinki.fi/TRAININGDATA_BSNLP_2019_shared_task.zip` `wget http://bsnlp.cs.helsinki.fi/TESTDATA_BSNLP_2019_shared_task.zip` `unzip TRAININGDATA_BSNLP_2019_shared_task.zip` `unzip TESTDATA_BSNLP_2019_shared_task.zip -d test_pl_cs_ru_bg` `rm TRAININGDATA_BSNLP_2019_shared_task.zip TESTDATA_BSNLP_2019_shared_task.zip`
Persons-1000	`load_persons` `#`	`ner` `news`	1 000	2.96 Mb	Same as Collection5, only PER markup + normalized names `wget http://ai-center.botik.ru/Airec/ai-resources/Persons-1000.zip`
The Russian Drug Reaction Corpus (RuDReC)	`load_rudrec` `#`	`ner`	4 809	1.73 Kb	RuDReC is a new partially annotated corpus of consumer reviews in Russian about pharmaceutical products for the detection of health-related named entities and the effectiveness of pharmaceutical products. Here you can download and work with the annotated part, to get the raw part (1.4M reviews) please refer to https://github.com/cimm-kzn/RuDReC. `wget https://github.com/cimm-kzn/RuDReC/raw/master/data/rudrec_annotated.json`
Taiga	Large collection of Russian texts from various sources: news sites, magazines, literacy, social networks `wget https://linghub.ru/static/Taiga/retagged_taiga.tar.gz` `tar -xzvf retagged_taiga.tar.gz`
Arzamas	`load_taiga_arzamas` `#`	`news`	311	4.50 Mb
Fontanka	`load_taiga_fontanka` `#`	`news`	342 683	786.23 Mb
Interfax	`load_taiga_interfax` `#`	`news`	46 429	77.55 Mb
KP	`load_taiga_kp` `#`	`news`	45 503	61.79 Mb
Lenta	`load_taiga_lenta` `#`	`news`	36 446	95.15 Mb
Taiga/N+1	`load_taiga_nplus1` `#`	`news`	7 696	24.96 Mb
Magazines	`load_taiga_magazines` `#`		39 890	2.19 Gb
Subtitles	`load_taiga_subtitles` `#`		19 011	909.08 Mb
Social	`load_taiga_social` `#`	`social`	1 876 442	648.18 Mb
Proza	`load_taiga_proza` `#`	`fiction`	1 732 434	38.25 Gb
Stihi	`load_taiga_stihi` `#`		9 157 686	12.80 Gb
Russian NLP Datasets	Several Russian news datasets from webhose.io, lenta.ru and other news sites.
News	`load_buriy_news` `#`	`news`	2 154 801	6.84 Gb	Dump of top 40 news + 20 fashion news sites. `wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2014.tar.bz2` `wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part1.tar.bz2` `wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part2.tar.bz2`
Webhose	`load_buriy_webhose` `#`	`news`	285 965	859.32 Mb	Dump from webhose.io, 300 sources for one month. `wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/webhose-2016.tar.bz2`
ODS #proj_news_viz	Several news sites scraped by members of #proj_news_viz ODS project.
Interfax	`load_ods_interfax` `#`	`news`	543 961	1.22 Gb	`wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/interfax.csv.gz`
Gazeta	`load_ods_gazeta` `#`	`news`	865 847	1.63 Gb	`wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/gazeta.csv.gz`
Izvestia	`load_ods_izvestia` `#`	`news`	86 601	307.19 Mb	`wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/iz.csv.gz`
Meduza	`load_ods_meduza` `#`	`news`	71 806	270.11 Mb	`wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/meduza.csv.gz`
RIA	`load_ods_ria` `#`	`news`	101 543	233.88 Mb	`wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/ria.csv.gz`
Russia Today	`load_ods_rt` `#`	`news`	106 644	187.12 Mb	`wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/rt.csv.gz`
TASS	`load_ods_tass` `#`	`news`	1 135 635	3.27 Gb	`wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/tass-001.csv.gz`
Universal Dependencies
GSD	`load_ud_gsd` `#`	`morph` `syntax`	5 030	1.01 Mb	`wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-dev.conllu` `wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-test.conllu` `wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-train.conllu`
Taiga	`load_ud_taiga` `#`	`morph` `syntax`	3 264	353.80 Kb	`wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-dev.conllu` `wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-test.conllu` `wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-train.conllu`
PUD	`load_ud_pud` `#`	`morph` `syntax`	1 000	207.78 Kb	`wget https://github.com/UniversalDependencies/UD_Russian-PUD/raw/master/ru_pud-ud-test.conllu`
SynTagRus	`load_ud_syntag` `#`	`morph` `syntax`	61 889	11.33 Mb	`wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-dev.conllu` `wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-test.conllu` `wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-train.conllu`
morphoRuEval-2017
General Internet-Corpus	`load_morphoru_gicrya` `#`	`morph`	83 148	10.58 Mb	`wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/GIKRYA_texts_new.zip` `unzip GIKRYA_texts_new.zip` `rm GIKRYA_texts_new.zip`
Russian National Corpus	`load_morphoru_rnc` `#`	`morph`	98 892	12.71 Mb	`wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/RNC_texts.rar` `unrar x RNC_texts.rar` `rm RNC_texts.rar`
OpenCorpora	`load_morphoru_corpora` `#`	`morph`	38 510	4.80 Mb	`wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/OpenCorpora_Texts.rar` `unrar x OpenCorpora_Texts.rar` `rm OpenCorpora_Texts.rar`
RUSSE Russian Semantic Relatedness
HJ: Human Judgements of Word Pairs	`load_russe_hj` `#`	`emb` `sim`			`wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/hj.csv`
RT: Synonyms and Hypernyms from the Thesaurus RuThes	`load_russe_rt` `#`	`emb` `sim`			`wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/rt.csv`
AE: Cognitive Associations from the Sociation.org Experiment	`load_russe_ae` `#`	`emb` `sim`			`wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-train.csv` `wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-test.csv` `wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/ae2.csv`
Toloka Datasets
Lexical Relations from the Wisdom of the Crowd (LRWC)	`load_toloka_lrwc` `#`	`emb` `sim`			`wget https://tlk.s3.yandex.net/dataset/LRWC.zip` `unzip LRWC.zip` `rm LRWC.zip`
The Russian Adverse Drug Reaction Corpus of Tweets (RuADReCT)	`load_ruadrect` `#`	`social`	9 515	2.09 Mb	This corpus was developed for the Social Media Mining for Health Applications (#SMM4H) Shared Task 2020 `wget https://github.com/cimm-kzn/RuDReC/raw/master/data/RuADReCT.zip` `unzip RuADReCT.zip` `rm RuADReCT.zip`

Support

Chat — https://t.me/natural_language_processing
Issues — https://github.com/natasha/corus/issues
Commercial support — https://lab.alexkuk.ru

Add new source

Implement corus/sources/<source>.py
Add import into corus/sources/__init__.py
Add meta into corus/source/meta.py
Add example into docs.ipynb (check meta table is correct)
Run tests (readme is updated)

Development

Dev env

python -m venv ~/.venvs/natasha-corus
source ~/.venvs/natasha-corus/bin/activate

pip install -r requirements/dev.txt
pip install -e .

python -m ipykernel install --user --name natasha-corus

Lint + update docs

make lint
make exec-docs

Release

# Update setup.py version

git commit -am 'Up version'
git tag v0.10.0

git push
git push --tags

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.10.0

Jul 24, 2023

0.9.0

Mar 9, 2021

0.7.0

Jul 28, 2020

0.6.0

Mar 26, 2020

0.5.0

Jan 15, 2020

0.4.0

Aug 30, 2019

0.3.0

Jun 27, 2019

0.2.0

May 31, 2019

0.1.1

May 4, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corus-0.10.0.tar.gz (76.5 kB view details)

Uploaded Jul 24, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

corus-0.10.0-py3-none-any.whl (83.7 kB view details)

Uploaded Jul 24, 2023 Python 3

File details

Details for the file corus-0.10.0.tar.gz.

File metadata

Download URL: corus-0.10.0.tar.gz
Upload date: Jul 24, 2023
Size: 76.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for corus-0.10.0.tar.gz
Algorithm	Hash digest
SHA256	`0e203f4fb96b841822ca34a79c2004564ec68a1bcf247ab09e08e49b0a7563e9`
MD5	`cdf056d3171481018d543e92b674436d`
BLAKE2b-256	`797e50769ae67af426bb53727fdfbf34e768edb14f5e4900f4110174588666e3`

See more details on using hashes here.

File details

Details for the file corus-0.10.0-py3-none-any.whl.

File metadata

Download URL: corus-0.10.0-py3-none-any.whl
Upload date: Jul 24, 2023
Size: 83.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for corus-0.10.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7b8da75d9fab0c3ee0d52a9fd575965dcd93fa1818da01a91bff178b3ad90bc7`
MD5	`01619d7269db12d678cfc61e80962f4a`
BLAKE2b-256	`26102c40454156b8bc65bdce019785aa508487b3b5cc07b35fd2c2da3d9b1418`

See more details on using hashes here.

corus 0.10.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Usage

Documentation

Install

Reference

Support

Add new source

Development

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes