corpy

Tools for processing language data.

These details have not been verified by PyPI

Project links

Project description

What is CorPy?

A fancy plural for corpus ;) Also, a collection of handy but not especially mutually integrated tools for dealing with linguistic data. It abstracts away functionality which is often needed in practice for teaching and/or day to day work at the Czech National Corpus, without aspiring to be a fully featured or consistent NLP framework.

The short URL to the docs is: https://corpy.rtfd.io/

Here’s an idea of what you can do with CorPy:

add linguistic annotation to raw textual data using either UDPipe or MorphoDiTa

Note

Should I pick UDPipe or MorphoDiTa?

UDPipe is the successor to MorphoDiTa, extending and improving upon the original codebase. It has more features at the cost of being somewhat more complex: it does both morphological tagging (including lemmatization) and syntactic parsing, and it handles a number of different input and output formats. You can also download pre-trained models for many different languages.

By contrast, MorphoDiTa only has pre-trained models for Czech and English, and only performs morphological tagging (including lemmatization). However, its output is more straightforward – it just splits your text into tokens and annotates them, whereas UDPipe can (depending on the model) introduce additional tokens necessary for a more explicit analysis, add multi-word tokens etc. This is because UDPipe is tailored to the type of linguistic analysis conducted within the UniversalDependencies project, using the CoNLL-U data format.

MorphoDiTa can also help you if you just want to tokenize text and don’t have a language model available.

easily generate word clouds
generate phonetic transcripts of Czech texts
wrangle corpora in the vertical format devised originally for CWB, used also by (No)SketchEngine
plus some command line utilities

Installation

$ pip3 install corpy

Requirements

Only recent versions of Python 3 (3.6+) are supported by design.

Development

Dependencies and building the docs

The canonical dependency requirements are listed in pyproject.toml and frozen in poetry.lock. However, in order to use autodoc to build the API docs, the package has to be installed, and corpy has dependencies that are too resource-intensive to build on ReadTheDocs.

The solution is to use a dummy setup.py which lists only the dependencies needed to build the docs properly, and mock all other dependencies by listing them in autodoc_mock_imports in docs/conf.py. This dummy setup.py is used to install corpy only on ReadTheDocs (via the appropriate config option in .readthedocs.yml). The same goes for the MANIFEST.in file, which duplicates the tool.poetry.include entries in pyproject.toml for the sole benefit of ReadTheDocs.

License

Distributed under the GNU General Public License v3.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.6.1

Apr 5, 2023

0.6

Mar 14, 2023

0.5.2

Mar 13, 2023

0.5.1

Mar 13, 2023

0.5

Jan 17, 2023

0.4.1

Jan 3, 2022

0.4.0

Sep 8, 2021

0.3.1

May 1, 2021

0.3.0

Feb 6, 2021

0.2.4

Jan 26, 2021

0.2.3

Aug 20, 2019

0.2.2

Jun 19, 2019

This version

0.2.1

Jun 14, 2019

0.2.0

May 27, 2019

0.1.2

May 23, 2019

0.1.1

May 23, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corpy-0.2.1.tar.gz (29.4 kB view hashes)

Uploaded Jun 14, 2019 Source

Built Distribution

corpy-0.2.1-py3-none-any.whl (31.3 kB view hashes)

Uploaded Jun 14, 2019 Python 3

Hashes for corpy-0.2.1.tar.gz

Hashes for corpy-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`33e217f86aa25d9f34b3ea4101ec4ceaecf5aff82166450c833c2a8094bafc12`
MD5	`cdb5b62b34492616ae0d94a2f50b2914`
BLAKE2b-256	`2c6f4a3cf1961b9a71ab1022b56587996fb3609e3a97508234ca26da7285ef41`

Hashes for corpy-0.2.1-py3-none-any.whl

Hashes for corpy-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0bbe7a0482c94e5338afbf643782cffb2c7b781b67d877e1c5ed5be136d817dc`
MD5	`5499507e74cec627f72f0f0fece6e05d`
BLAKE2b-256	`2b50a88adbf865a06bb7dc859fafbc02b81e664a4dc440a115cb7dad6c671839`