rnc · PyPI

API for Russian National Corpus

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Reason this release was yanked:

Very old, RIP

Project description

API for Russian National Corpus

Installation

pip install rnc

Structure

Corpus object contains list of obtained examples. There're two types of examples:

If out is normal, API uses normal example, which name is equal to the Corpus class name:

ru = rnc.MainCorpus(...)
ru.request_examples()

print(type(ru[0]))
>>> MainExample

if out is kwic, API uses KwicExample.

Examples' objects fields

Usage

import rnc

ru = rnc.MainCorpus(
    query='корпус', 
    p_count=5,
    file='filename.csv',
    marker=str.upper,
    **kwargs
)

ru.request_examples()

query – one str or dict with tags. Words to found, you should give the vocabulary form of them.
p_count – count of PAGES.
file – path of local csv file, optional. Example: file='data\\filename.csv'.
marker – function, with which found wordforms'll be marked, optional.
kwargs – additional params.

Corpora you can use.

Full query form

query = {
    'word1': {
        'gramm': 'acc', # grammar tags for lexgramm search
        'flags': 'bdot' # additional tags for lexgramm search
    },
    # you can get as a value one string or dict of params
    # params are: any name of dict key, name of tag (you can see them below)  
    'word2': {
        'gramm': { 
            # the NAMES of these keys might be any
            'pos (any name)': 'S' or ['S', 'A'], # one value or list of values,
            'case (any name)': 'acc' or ['acc', 'nom'],
        },
        'flags': {}, # all the same to here
        # distance between first and second words
        'min': 1,  
        'max': 3
    },  
}

corp = rnc.MainCorpus(
    query=query,
    p_count=5,
    file='filename.csv',
    marker=str.upper,
    **kwargs
)
corp.reques_examples()

Lexgramm search params

Query as a string

Also you can pass as a query a string with the vocabulary forms of the words, divided by space: query = 'get down' or query = 'я получить'. Distance between them'll be default.

Additional request params

These params are optional, you can ignore them. Here the default values is shown.

corp = rnc.ParallelCorpus(
    query=query, 
    p_count=5,
    file='filename.csv',
    marker=str.upper,

    dpp=5, # documents per page
    spd=10, # sentences per document
    text='lexgramm' or 'lexform', # way to search
    out='normal' or 'kwic', # output format
    kwsz=5, # if out=kwic, count of words in context
    sort='i_grtagging', # way to sort the results
    subcorpus='', # see below how to set it
    accent=0, # with accentology (1) or without (0), if it's available
)

Sort keys

API can work with local base too

ru = rnc.SpokenCorpus(file='local_database.csv') # it must exist
print(ru)

If the file exists, API works with it and you can't request new examples.

If you work with a file, it's not demanded to pass any argument to Corpus except for the file name (via file=...).

Working with corpora

corp = rnc.corpus_name(...)

corp.request_examples() – request examples. There's an exception if:
- Data still exist.
- No results found.
- Requested page doesn't exist (if there're 10 pages in the Corpus, but you've requested > 10).
- There's a mistake in the request.
- You have no access to Internet.
- There's a problem while getting access to Corpus.
- another problems...
corp.data – list of examples (only getter)
corp.query – query (only getter).
corp.forms_in_query – requested wordforms (only getter).
corp.p_count – requested count of pages (only getter).
corp.file – path to the local csv file (only getter).
corp.marker – marker (only getter).
corp.params – dict, HTTP tags (only getter).
corp.found_wordforms – dict with found wordforms and their frequency (only getter).
corp.ex_type – type of example (only getter).
corp.amount_of_docs – amount of docs where the query was found.
corp.amount_of_contexts – amount of contexts where the query was found.
corp.graphic_link – link to the distribution by years graphic.
corp.dump() – write two files: csv file with all data and json file with request params.
corp.copy() – create a copy.
corp.shuffle() – shuffle data.
corp.sort_data(key=, reverse=) – sort the list of examples. Here HTTP keys don't work, key is applied to Example objects.
corp.pop(index) – remove and return the example at the index.
corp.clear() – empty the data list.
corp.filter(key) – remove some examples from the data list using the key. Key is applied to the Example objects.
corp.url – URL of the first Corpus page (only getter).
corp.open_url() – open the first Corpus page.
corp.open_graphic() – open the distribution by years graphic.

Magic methods:

corp.dpp or another request param (only getter).
corp() – the same as request_examples().
str(corp) or print(corp) – str with info about Corpus, enumerated examples. By default Corpus shows first 50 examples, but you can change it or turn the restriction off.

Info about Corpus:
```
Russian National Corpus (https://ruscorpora.ru)
Class: CorpusName, len = amount of examples 
Pages: n of 'words' requested
```
len(corp) – count of examples.
bool(corp) – whether data exist.
corp[index or slice] – get element at the index or create new obj with sliced data:

from_2_to_10 = corp[2:10:2]

del corp[10] or del corp[:10] – remove some examples from the data list.
Also you can use cycle for. For example we want to see only left context (out=kwic) and source:

corp = rnc.ParallelCorpus(
    'corpus', 5, 
    out='kwic', kwsz=7, 
    subcorpus=rnc.Subcorpus.Parallel.English
)
corp.request_examples()

for r in corp:
    print(r.left)
    print(r.src)

Compare corp length with int or length of another Corpus obj.

corp >
corp >=
corp <
corp <=

Set default values to all objects you'll create:

corpus_name.set_dpp(value) – change default document per page value.
corpus_name.set_spd(value) – change default sentences per document value.
corpus_name.set_text(value) – change default search way.
corpus_name.set_sort(value) – change default sort key.
corpus_name.set_min(value) – change default min distance between words.
corpus_name.set_max(value) – change default max distance between words.
corpus_name.set_restrict_show(value) – change default amount of shown examples in print. If it is equal to False, the Corpus shows all examples.

Corpora features

ParallelCorpus

Query might be in the language you want or in Russian.

MultilingualParaCorpus

Working with files removed.
Param subcorpus not demanded by default, but it might be passed, see how to section below.

MultimodalCorpus

corp.download_all() – download all media files. It's recommended to use this method instead of expl.download_file.

ATTENTION

Don't forget to call this function

corp.request_examples()

If you've requested more than 10 pages, RNC returns 429 error (Too many requests). For example requesting 100 pages you should wait about 3 minutes:
If you want to see messages like that:

rnc.set_stream_handlers_level('INFO' or 'DEBUG')

If you want to turn off all messages:

rnc.set_stream_handlers_level('CRITICAL')

Don't call the marker you pass

RIGHT:

ru = rnc.MainCorpus(marker=str.upper)

WRONG:

ru = rnc.MainCorpus(marker=str.upper())

Pass an empty string as a param if you don't want to set them

query = {
    'word1': '',
    'word2': {'min': 2, 'max': 5}
}

If accent=1, marker doesn't work.

How to

How to set sort?

Sort keys.

How to set language in ParallelCorpus?

en = rnc.ParallelCorpus('get', 5, subcorpus=rnc.Subcorpus.Parallel.English)

If you want to search something by several languages, choose and set the subcorpus in the site, pass this param to Corpus.

How to set subcorpus?

There're default keys in rnc.Subcorpus.Person (working checked in MainCorpus) – Russian writers and poets:

Pushkin
Dostoyevsky
TolstoyLN
Chekhov
Gogol
Turgenev

Example:

ru = rnc.MainCorpus('нету', 1, subcorpus=rnc.Subcorpus.Person.Pushkin)

Documentation
Source

If you found a bug (add logs to the mail, please) or have an idea to improve the API write to me – alniconim@gmail.com.

P.S. If your native is Russian or you know it well, please write me in Russian.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.10.0

Aug 8, 2022

0.9.0

Mar 8, 2022

0.8.0

Mar 8, 2022

0.7.0

Aug 21, 2021

0.6.5

Apr 30, 2021

0.6.4.1

Feb 16, 2021

0.6.4

Dec 29, 2020

0.6.3 yanked

Dec 29, 2020

Reason this release was yanked:

deepcopy doesn't work

0.6.2

Dec 21, 2020

0.6.1

Dec 9, 2020

0.6 yanked

Dec 9, 2020

Reason this release was yanked:

Stream handler level is NOTSET

This version

0.5 yanked

Aug 15, 2020

Reason this release was yanked:

Very old, RIP

0.4.1 yanked

Aug 12, 2020

Reason this release was yanked:

Very old, RIP

0.4 yanked

Aug 8, 2020

Reason this release was yanked:

Very old, RIP

0.3.2 yanked

Aug 6, 2020

Reason this release was yanked:

Very old, RIP

0.3.1 yanked

Aug 4, 2020

Reason this release was yanked:

Very old, RIP

0.3 yanked

Aug 2, 2020

Reason this release was yanked:

Very old, RIP

0.2.1 yanked

Jul 31, 2020

Reason this release was yanked:

Very old, RIP

0.2 yanked

Jul 26, 2020

Reason this release was yanked:

Very old, RIP

0.1 yanked

Jul 26, 2020

Reason this release was yanked:

Very old, RIP

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rnc-0.5.tar.gz (33.3 kB view hashes)

Uploaded Aug 15, 2020 Source

Built Distribution

rnc-0.5-py3-none-any.whl (32.8 kB view hashes)

Uploaded Aug 15, 2020 Python 3

Hashes for rnc-0.5.tar.gz

Hashes for rnc-0.5.tar.gz
Algorithm	Hash digest
SHA256	`c63381a8c2a29bd122157ec1624d1922cdb2671a6f823bd5ee7ee8fc7782026e`
MD5	`b9b5a454ebb089eeebc1ac08cd89c365`
BLAKE2b-256	`56708a3b86f9a510613b0acd15f24402faaf17eb69f6595717cefe9d8e2d507d`

Hashes for rnc-0.5-py3-none-any.whl

Hashes for rnc-0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c764c8dec9bff8020da28def8fab98bd90ecb696bceedc6541c7a3f43c7e1679`
MD5	`852f75d1e3004b6d4fa0ed358808f5e3`
BLAKE2b-256	`6b35481802bbd483c3f5ae8d89d5ea04b676d66c0af534dd7c710276ded88580`

rnc 0.5

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Project description

API for Russian National Corpus

Installation

Structure

Usage

Full query form

Query as a string

Additional request params

API can work with local base too

Working with corpora

Corpora features

ParallelCorpus

MultilingualParaCorpus

MultimodalCorpus

ATTENTION

How to

How to set sort?

How to set language in ParallelCorpus?

How to set subcorpus?

Documentation
Source

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

rnc 0.5

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Project description

API for Russian National Corpus

Installation

Structure

Usage

Full query form

Query as a string

Additional request params

API can work with local base too

Working with corpora

Corpora features

ParallelCorpus

MultilingualParaCorpus

MultimodalCorpus

ATTENTION

How to

How to set sort?

How to set language in ParallelCorpus?

How to set subcorpus?

Documentation Source

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

Documentation
Source