Skip to main content

A library for preprocessing.

Project description

A library for processing text data

cophi is a Python library for handling, modeling and processing text corpora. You can easily pipe a collection of text files using the high-level API:

corpus, metadata = cophi.corpus(directory="british-fiction-corpus",
                                filepath_pattern="**/*.txt",
                                encoding="utf-8",
                                lowercase=True,
                                token_pattern=r"\p{L}+\p{P}?\p{L}+")

You can also plug the DARIAH-DKPro-Wrapper into this pipeline to lemmatize text, or just keep certain word types.

Check out the introducing Jupyter notebook.

Getting started

To install the latest stable version:

$ pip install cophi

To install the latest development version:

$ pip install --upgrade git+https://github.com/cophi-wue/cophi-toolbox.git@testing

Available complexity measures

There are also a plenty of complexity metrics for measuring the lexical richness of (literary) texts.

Measures that use sample size and vocabulary size:

  • Type-Token Ratio TTR
  • Guiraud’s R
  • Herdan’s C
  • Dugast’s k
  • Maas’ a2
  • Dugast’s U
  • Tuldava’s LN
  • Brunet’s W
  • Carroll’s CTTR
  • Summer’s S

Measures that use part of the frequency spectrum:

  • Honoré’s H
  • Sichel’s S
  • Michéa’s M

Measures that use the whole frequency spectrum:

  • Entropy S
  • Yule’s K
  • Simpson’s D
  • Herdan’s Vm

Parameters of probabilistic models:

  • Orlov’s Z

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for cophi, version 1.3.2
Filename, size File type Python version Upload date Hashes
Filename, size cophi-1.3.2-py3-none-any.whl (17.1 kB) File type Wheel Python version py3 Upload date Hashes View
Filename, size cophi-1.3.2.tar.gz (14.5 kB) File type Source Python version None Upload date Hashes View

Supported by

Pingdom Pingdom Monitoring Google Google Object Storage and Download Analytics Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page