API for Russian National Corpus
Reason this release was yanked:
Very old, RIP
Project description
API for Russian National Corpus
Installation
pip install rnc
Structure
Corpus object contains list of obtained examples.
There're two types of examples:
- If
out
isnormal
, API uses normal example, which name is equal to the Corpus class name:
ru = rnc.MainCorpus(...)
ru.request_examples()
print(type(ru[0]))
>>> MainExample
- if
out
iskwic
, API usesKwicExample
.
Examples' objects fields
Usage
import rnc
ru = rnc.MainCorpus(
query='корпус',
p_count=5,
file='filename.csv',
marker=str.upper,
**kwargs
)
ru.request_examples()
- query – one str or dict with tags. Words to found, you should give the vocabulary form of them.
- p_count – count of PAGES.
- file – name of local csv file, optional.
- marker – function, with which found wordforms'll be marked, optional.
- kwargs – additional params.
Corpora you can use.
Full query form
query = {
'word1': {
'gramm': 'acc', # grammar tags for lexgramm search
'flags': 'bdot' # additional tags for lexgramm search
},
# you can get as a value one string or dict of params
# params are: any name of dict key, name of tag (you can see them below)
'word2': {
'gramm': {
# the NAMES of these keys might be any
'pos (any name)': 'S' or ['S', 'A'], # one value or list of values,
'case (any name)': 'acc' or ['acc', 'nom'],
},
'flags': {}, # all the same to here
# distance between first and second words
'min': 1,
'max': 3
},
}
corp = rnc.MainCorpus(
query=query,
p_count=5,
file='filename.csv',
marker=str.upper,
**kwargs
)
corp.reques_examples()
Query as a string
Also you can pass as a query a string with the vocabulary forms of the words, divided by space:
query = 'get down'
or query = 'я получить'
. Distance between them'll be default.
Additional request params
These params are optional, you can ignore them. Here the default values is shown.
corp = rnc.ParallelCorpus(
query=query,
p_count=5,
file='filename.csv',
marker=str.upper,
dpp=5, # documents per page
spd=10, # sentences per document
text='lexgramm' or 'lexform', # way to search
out='normal' or 'kwic', # output format
kwsz=5, # if out=kwic, count of words in context
sort='i_grtagging', # way to sort the results
subcorpus='', # see below how to set it
accent=0, # with accentology (1) or without (0), if it's available
)
API can work with local base too
ru = rnc.SpokenCorpus(file='local_database.csv') # it must exist
print(ru)
If the file exists, API works with it and you can't request new examples.
If you work with a file, it's not demanded to pass any argument to Corpus
except for the file name (via file=...
).
Working with corpora
corp = rnc.corpus_name(...)
corp.request_examples()
– request examples. There's an exception if:- Data still exist.
- No results found.
- Requested page doesn't exist (if there're 10 pages in the Corpus, but you've requested > 10).
- There's a mistake in the request.
- You have no access to Internet.
- There's a problem while getting access to Corpus.
- another problems...
corp.data
– list of examples (only getter)corp.query
– query (only getter).corp.forms_in_query
– requested wordforms (only getter).corp.p_count
– requested count of pages (only getter).corp.file
– path to the local csv file (only getter).corp.marker
– marker (only getter).corp.params
– dict, HTTP tags (only getter).corp.found_wordforms
– dict with found wordforms and their frequency (only getter).corp.ex_type
– type of example (only getter).corp.dump()
– write two files: csv file with all data and json file with request params.corp.copy()
– create a copy.corp.shuffle()
– shuffle data.corp.sort(key=, reverse=)
– sort the list of examples. Here HTTP keys don't work, key is applied to Example objects.corp.pop(index)
– remove and return the example at the index.corp.clear()
– empty the data list.corp.filter(key)
– remove some examples from the data list using the key. Key is applied to the Example objects.corp.url
– URL of the first Corpus page (only getter).corp.open_url()
– open the first Corpus page.
Magic methods:
-
corp.dpp
or another request param (only getter). -
corp()
– the same asrequest_examples()
. -
str(corp) or print(corp)
– str with info about Corpus, enumerated examples. By default Corpus shows first 50 examples, but you can change it or turn the restriction off.Info about Corpus:
Russian National Corpus (https://ruscorpora.ru) Class: CorpusName, len = amount of examples Pages: n of 'words' requested
-
len(corp)
– count of examples. -
bool(corp)
– whether data exist. -
corp[index or slice]
– get element at the index or create new obj with sliced data:
from_2_to_10 = corp[2:10:2]
-
del corp[10]
ordel corp[:10]
– remove some examples from the data list. -
Also you can use cycle
for
. For example we want to see only left context (out=kwic
) and source:
corp = rnc.ParallelCorpus(
'corpus', 5,
out='kwic', kwsz=7,
subcorpus=rnc.Subcorpus.Parallel.English
)
corp.request_examples()
for r in corp:
print(r.left)
print(r.src)
Compare corp length with int or length of another Corpus obj.
corp >
corp >=
corp <
corp <=
Set default values to all objects you'll create:
corpus_name.set_dpp(value)
– change defaultdocument per page
value.corpus_name.set_spd(value)
– change defaultsentences per document
value.corpus_name.set_text(value)
– change default search way.corpus_name.set_sort(value)
– change default sort key.corpus_name.set_min(value)
– change default min distance between words.corpus_name.set_max(value)
– change default max distance between words.corpus_name.set_restrict_show(value)
– change default amount of shown examples in print. If it is equal toFalse
, the Corpus shows all examples.
Corpora features
ParallelCorpus
- Query might be in the language you want or in Russian.
- Turnover search is not supported.
MultilingualParaCorpus
- Working with files removed.
- Param
subcorpus
not demanded by default, but it might be passed, see how to section below.
ATTENTION
- Don't forget to call this function
corp.request_examples()
- If you've requested more than 10 pages, Corpus returns 429 error (Too many requests).
For example requesting 100 pages you should wait about 3 minutes:
- If you want to see messages like that:
rnc.set_stream_handlers_level('DEBUG')
- If you want to turn off all messages:
rnc.set_stream_handlers_level('CRITICAL')
- Don't call the marker you pass
RIGHT:
ru = rnc.MainCorpus(marker=str.upper)
WRONG:
ru = rnc.MainCorpus(marker=str.upper())
- Pass an empty string as a param if you don't want to set it
query = {
'word1': '',
'word2': {'min': 2, 'max': 5}
}
- If
accent=1
, marker doesn't work.
How to
How to set sort?
Here you can find sort keys and their descriptions.
How to set language in ParallelCorpus?
en = rnc.ParallelCorpus('get', 5, subcorpus=rnc.Subcorpus.Parallel.English)
If you want to search something by several languages, choose and set the subcorpus in the site, pass this param to Corpus.
How to set subcorpus?
There're default keys in rnc.Subcorpus.Person – Russian writers and poets:
- Pushkin
- Dostoyevsky
- TolstoyLN
- Chekhov
- Gogol
- Turgenev
Example:
ru = rnc.MainCorpus('нету', 1, subcorpus=rnc.Subcorpus.Person.Pushkin)
OR
Documentation
Source
If you found a bug or have an idea to improve the API write to me – alniconim@gmail.com.
P.S. If your native is Russian or you know it well, please write me in Russian.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.