Python interface to a free corpus subset from ruscorpora.ru
Project description
This package provides Python interface to a free corpus subset available at http://ruscorpora.ru.
Installation
pip install ruscorpora-tools
Usage
Corpus downloading
Download and unpack the archive with XML files from http://www.ruscorpora.ru/corpora-usage.html
Corpus reading
ruscorpora.parse_xml function parses single XML file and returns an iterator over sentences; each sentence is a list of ruscorpora.Token instances, annotated with a list of ruscorpora.Annotation instances.
ruscorpora.simplify simplifies a result of ruscorpora.parse_xml by removing ambiguous annotations, joining split tokens and removing accent information.
>>> import ruscorpora as rnc >>> for sent in rnc.simplify(rnc.parse('fiction.xml')): ... print(sent)
Development
Development happens at github and bitbucket:
The issue tracker is at github: https://github.com/kmike/ruscorpora-tools/issues
Feel free to submit ideas, bugs, pull requests (git or hg) or regular patches.
Running tests
Make sure tox is installed and run
$ tox
from the source checkout. Tests should pass under python 2.6..3.3 and pypy > 1.8.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.