Tools for processing language data.
Project description
What is CorPy?
A fancy plural for corpus ;) Also, a collection of handy but not especially mutually integrated tools for dealing with linguistic data. It abstracts away functionality which is often needed in practice for teaching and/or day to day work at the Czech National Corpus, without aspiring to be a fully featured or consistent NLP framework.
The short URL to the docs is: https://corpy.rtfd.io/
Here’s an idea of what you can do with CorPy:
add linguistic annotation to raw textual data using either UDPipe or MorphoDiTa
wrangle corpora in the vertical format devised originally for CWB, used also by (No)SketchEngine
plus some command line utilities
Installation
$ pip3 install corpy
Requirements
Only recent versions of Python 3 (3.6+) are supported by design.
Development
Dependencies and building the docs
The canonical dependency requirements are listed in pyproject.toml and frozen in poetry.lock. However, in order to use autodoc to build the API docs, the package has to be installed, and corpy has dependencies that are too resource-intensive to build on ReadTheDocs.
The solution is to use a dummy setup.py which lists only the dependencies needed to build the docs properly, and mock all other dependencies by listing them in autodoc_mock_imports in docs/conf.py. This dummy setup.py is used to install corpy only on ReadTheDocs (via the appropriate config option in .readthedocs.yml). The same goes for the MANIFEST.in file, which duplicates the tool.poetry.include entries in pyproject.toml for the sole benefit of ReadTheDocs.
License
Copyright © 2016–present ÚČNK/David Lukeš
Distributed under the GNU General Public License v3.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.