Skip to main content

A library for preprocessing.

Project description

A library for processing text data

cophi is a Python library for handling, modeling and processing text corpora. You can easily pipe a collection of text files using the high-level API:

corpus, metadata = cophi.corpus(directory="british-fiction-corpus",
                                filepath_pattern="**/*.txt",
                                encoding="utf-8",
                                lowercase=True,
                                token_pattern=r"\p{L}+\p{P}?\p{L}+")

You can also plug the DARIAH-DKPro-Wrapper into this pipeline to lemmatize text, or just keep certain word types.

Check out the introducing Jupyter notebook.

Getting started

To install the latest stable version:

$ pip install cophi

To install the latest development version:

$ pip install --upgrade git+https://github.com/cophi-wue/cophi-toolbox.git@testing

Available complexity measures

There are also a plenty of complexity metrics for measuring the lexical richness of (literary) texts.

Measures that use sample size and vocabulary size:

  • Type-Token Ratio TTR
  • Guiraud’s R
  • Herdan’s C
  • Dugast’s k
  • Maas’ a2
  • Dugast’s U
  • Tuldava’s LN
  • Brunet’s W
  • Carroll’s CTTR
  • Summer’s S

Measures that use part of the frequency spectrum:

  • Honoré’s H
  • Sichel’s S
  • Michéa’s M

Measures that use the whole frequency spectrum:

  • Entropy S
  • Yule’s K
  • Simpson’s D
  • Herdan’s Vm

Parameters of probabilistic models:

  • Orlov’s Z

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
cophi-1.3.2-py3-none-any.whl (17.1 kB) Copy SHA256 hash SHA256 Wheel py3
cophi-1.3.2.tar.gz (14.5 kB) Copy SHA256 hash SHA256 Source None

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page