API for National Russian Corpus
Reason this release was yanked:
Very old, RIP
Project description
API for National Russian Corpus
Installation
pip install bs4 aiohttp lxml rnc
Structure
A Corpus object contains list of obtained examples.
There're two types of example:
- If
out
isnormal
, API uses normal example, which name is equal to the Corpus class name:
ru = MainCorpus(...)
ru.request_examples()
print(type(ru[0]))
>>> MainExample
- if
out
iskwic
, API usesKwicExample
.
Example objects properties
Usage
import rnc
ru = rnc.corpus_name(
query='корпус',
p_count=5,
file='filename.csv',
**kwargs
)
ru.request_examples()
- query – one str or dict with tags. Word to found, one should give the vocabulary form of it.
- p_count – count of PAGES.
- file – name of local csv file, optional.
- kwargs – additional params.
Full version of query
query = {
'word1': {
'gramm': 'acc', # grammar tags for lexgramm search
'flags': 'bdot' # additional tags for lexgramm search
},
# you can get as a value one string or dict of params
# params are: any name of dict key, name of tag (you can see them below)
'word2': {
'gramm': {
# the NAMES of these keys may be any
'pos (any name)': 'S' or ['S', 'A'], # one value or list of values,
'case (any name)': 'acc' or ['acc', 'nom'],
},
'flags': {}, # all the same to here
# distance between first and second words
'min': 1,
'max': 3
},
}
corp = rnc.corpus_name(
query=query,
p_count=5,
file='filename.csv',
**kwargs
)
corp.reques_examples()
Additional params
These params are optional, you can ignore them.
ru = rnc.corpus_name(
query=query,
p_count=5,
file='filename.csv',
marker=str.upper, # function, with which found wordforms'll be marked
dpp=5, # documents per page
spd=1, # sentences per document
text='lexgramm' or 'lexform', # way to search
out='normal' or 'kwic', # output format
kwsz=5, # if out=kwic, count of words in context
sort='sort_key', # way to sort the results
subcorpus='', # see below how to set it
accent=0, # with accentology (1) or without (0), if it's available
)
API can works with local base too
ru = rnc.corpus_name(file='local_database.csv') # it must exist
print(ru)
If the file exists, API works with it and you can't request new examples.
Working with corpora
corp = rnc.corpus_name(...)
corp.request_examples()
– request examples. There's an exception if:- Data still exist.
- No results found.
- Requested page doesn't exist (if there're 10 pages in the Corpus, but you've requested > 10).
- There's a mistake in the request.
- You have no access to Internet.
- There's a problem while getting access to Corpus.
- another problems...
corp()
– the same asrequest_examples()
.corp.data
– list of examples.corp.found_wordforms
– dict with found wordforms and their frequency.corp.dump()
– write two files: csv file with all data and json file with request params.corp.copy()
– create a copy.corp.shuffle()
– shuffle data.corp.pop(index)
– remove and return the example at the index from the data list.corp.sort(key=, reverse=)
– sort the list of examples. Here HTTP keys doesn't work.corp.url
– URl to first page of the Corpus result.corp.open_url()
– open first page of the Corpus result.corp.add_pages()
– in developing...str(corp)
– str with info about Corpus, enumerated examples.len(corp)
– count if examples.bool(corp)
– whether data exist.corp.dpp
or another request param.corp[index or slice]
– get element at the index or create new obj with sliced data:
first_ten = corp[:10]
Compare corp length with length of another obj or int.
corp >
corp >=
corp <
corp <=
Also you can use cycle for. For example we want to see only left context (out=kwic) and source:
corp = rnc.corpus_name('корпус', 5, out='kwic', kwsz=7)
corp.request_examples()
for r in corp:
print(r.left)
print(r.src)
ATTENTION
- Don't forget to call this function
corp.request_examples()
- If you've requested more than 10 pages, Corpus returns 429 error (Too many requests). For example requesting 100 pages you should wait about 3 minutes:
- If you want to see messages like that:
rnc.corpora.stream_handler.setLevel(level='DEBUG')
How to
How to set sort?
Here you can find sort keys and their descriptions.
How to set subcorpus?
There're default keys in rnc.Subcorpus.Person – Russian writers and poets:
- Pushkin
- Dostoyevsky
- TolstoyLN
- Chekhov
- Gogol
- Turgenev
Example:
ru = rnc.MainCorpus('нету', 1, subcorpus=rnc.Subcorpus.Person.Pushkin)
OR
If you found a bug or have an idea to improve the API write to me – alniconim@gmail.com.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.