Skip to main content

API for Russian National Corpus

Reason this release was yanked:

Very old, RIP

Project description

API for Russian National Corpus

Installation

pip install rnc

Structure

Corpus object contains list of obtained examples. There're two types of examples:

  • If out is normal, API uses normal example, which name is equal to the Corpus class name:
ru = rnc.MainCorpus(...)
ru.request_examples()

print(type(ru[0]))
>>> MainExample
  • if out is kwic, API uses KwicExample.

Examples' objects fields


Usage

import rnc

ru = rnc.MainCorpus(
    query='корпус', 
    p_count=5,
    file='filename.csv',
    marker=str.upper,
    **kwargs
)

ru.request_examples()
  • query – one str or dict with tags. Words to found, you should give the vocabulary form of them.
  • p_count – count of PAGES.
  • file – path of local csv file, optional. Example: file='data\\filename.csv'.
  • marker – function, with which found wordforms'll be marked, optional.
  • kwargs – additional params.

Corpora you can use.


Full query form
query = {
    'word1': {
        'gramm': 'acc', # grammar tags for lexgramm search
        'flags': 'bdot' # additional tags for lexgramm search
    },
    # you can get as a value one string or dict of params
    # params are: any name of dict key, name of tag (you can see them below)  
    'word2': {
        'gramm': { 
            # the NAMES of these keys might be any
            'pos (any name)': 'S' or ['S', 'A'], # one value or list of values,
            'case (any name)': 'acc' or ['acc', 'nom'],
        },
        'flags': {}, # all the same to here
        # distance between first and second words
        'min': 1,  
        'max': 3
    },  
}

corp = rnc.MainCorpus(
    query=query,
    p_count=5,
    file='filename.csv',
    marker=str.upper,
    **kwargs
)
corp.reques_examples()

Lexgramm search params


Query as a string

Also you can pass as a query a string with the vocabulary forms of the words, divided by space: query = 'get down' or query = 'я получить'. Distance between them'll be default.


Additional request params

These params are optional, you can ignore them. Here the default values is shown.

corp = rnc.ParallelCorpus(
    query=query, 
    p_count=5,
    file='filename.csv',
    marker=str.upper,

    dpp=5, # documents per page
    spd=10, # sentences per document
    text='lexgramm' or 'lexform', # way to search
    out='normal' or 'kwic', # output format
    kwsz=5, # if out=kwic, count of words in context
    sort='i_grtagging', # way to sort the results
    subcorpus='', # see below how to set it
    accent=0, # with accentology (1) or without (0), if it's available
)

Sort keys

API can work with local base too
ru = rnc.SpokenCorpus(file='local_database.csv') # it must exist
print(ru)

If the file exists, API works with it and you can't request new examples.

If you work with a file, it's not demanded to pass any argument to Corpus except for the file name (via file=...).


Working with corpora

corp = rnc.corpus_name(...) 
  • corp.request_examples() – request examples. There's an exception if:
    • Data still exist.
    • No results found.
    • Requested page doesn't exist (if there're 10 pages in the Corpus, but you've requested > 10).
    • There's a mistake in the request.
    • You have no access to Internet.
    • There's a problem while getting access to Corpus.
    • another problems...
  • corp.data – list of examples (only getter)
  • corp.query – query (only getter).
  • corp.forms_in_query – requested wordforms (only getter).
  • corp.p_count – requested count of pages (only getter).
  • corp.file – path to the local csv file (only getter).
  • corp.marker – marker (only getter).
  • corp.params – dict, HTTP tags (only getter).
  • corp.found_wordforms – dict with found wordforms and their frequency (only getter).
  • corp.ex_type – type of example (only getter).
  • corp.amount_of_docs – amount of docs where the query was found.
  • corp.amount_of_contexts – amount of contexts where the query was found.
  • corp.graphic_link – link to the distribution by years graphic.
  • corp.dump() – write two files: csv file with all data and json file with request params.
  • corp.copy() – create a copy.
  • corp.shuffle() – shuffle data.
  • corp.sort_data(key=, reverse=) – sort the list of examples. Here HTTP keys don't work, key is applied to Example objects.
  • corp.pop(index) – remove and return the example at the index.
  • corp.clear() – empty the data list.
  • corp.filter(key) – remove some examples from the data list using the key. Key is applied to the Example objects.
  • corp.url – URL of the first Corpus page (only getter).
  • corp.open_url() – open the first Corpus page.
  • corp.open_graphic() – open the distribution by years graphic.

Magic methods:

  • corp.dpp or another request param (only getter).

  • corp() – the same as request_examples().

  • str(corp) or print(corp) – str with info about Corpus, enumerated examples. By default Corpus shows first 50 examples, but you can change it or turn the restriction off.

    Info about Corpus:

    Russian National Corpus (https://ruscorpora.ru)
    Class: CorpusName, len = amount of examples 
    Pages: n of 'words' requested
    
  • len(corp) – count of examples.

  • bool(corp) – whether data exist.

  • corp[index or slice] – get element at the index or create new obj with sliced data:

from_2_to_10 = corp[2:10:2]
  • del corp[10] or del corp[:10] – remove some examples from the data list.

  • Also you can use cycle for. For example we want to see only left context (out=kwic) and source:

corp = rnc.ParallelCorpus(
    'corpus', 5, 
    out='kwic', kwsz=7, 
    subcorpus=rnc.Subcorpus.Parallel.English
)
corp.request_examples()

for r in corp:
    print(r.left)
    print(r.src)

Compare corp length with int or length of another Corpus obj.

  • corp >
  • corp >=
  • corp <
  • corp <=

Set default values to all objects you'll create:

  • corpus_name.set_dpp(value) – change default document per page value.
  • corpus_name.set_spd(value) – change default sentences per document value.
  • corpus_name.set_text(value) – change default search way.
  • corpus_name.set_sort(value) – change default sort key.
  • corpus_name.set_min(value) – change default min distance between words.
  • corpus_name.set_max(value) – change default max distance between words.
  • corpus_name.set_restrict_show(value) – change default amount of shown examples in print. If it is equal to False, the Corpus shows all examples.

Corpora features

ParallelCorpus
  • Query might be in the language you want or in Russian.
MultilingualParaCorpus
  • Working with files removed.
  • Param subcorpus not demanded by default, but it might be passed, see how to section below.
MultimodalCorpus
  • corp.download_all() – download all media files. It's recommended to use this method instead of expl.download_file.

ATTENTION

  • Don't forget to call this function
corp.request_examples()
  • If you've requested more than 10 pages, RNC returns 429 error (Too many requests). For example requesting 100 pages you should wait about 3 minutes: 100 pages
  • If you want to see messages like that:
rnc.set_stream_handlers_level('INFO' or 'DEBUG')
  • If you want to turn off all messages:
rnc.set_stream_handlers_level('CRITICAL')
  • Don't call the marker you pass

RIGHT:

ru = rnc.MainCorpus(marker=str.upper)

WRONG:

ru = rnc.MainCorpus(marker=str.upper())
  • Pass an empty string as a param if you don't want to set them
query = {
    'word1': '',
    'word2': {'min': 2, 'max': 5}
}
  • If accent=1, marker doesn't work.

How to

How to set sort?

Sort keys.

How to set language in ParallelCorpus?
en = rnc.ParallelCorpus('get', 5, subcorpus=rnc.Subcorpus.Parallel.English)

If you want to search something by several languages, choose and set the subcorpus in the site, pass this param to Corpus.

How to set subcorpus?

There're default keys in rnc.Subcorpus.Person (working checked in MainCorpus) – Russian writers and poets:

  • Pushkin
  • Dostoyevsky
  • TolstoyLN
  • Chekhov
  • Gogol
  • Turgenev

Example:

ru = rnc.MainCorpus('нету', 1, subcorpus=rnc.Subcorpus.Person.Pushkin)

OR

1 2 3 4


Documentation
Source

If you found a bug (add logs to the mail, please) or have an idea to improve the API write to me – alniconim@gmail.com.

P.S. If your native is Russian or you know it well, please write me in Russian.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rnc-0.5.tar.gz (33.3 kB view hashes)

Uploaded Source

Built Distribution

rnc-0.5-py3-none-any.whl (32.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page