Skip to main content

A library for preprocessing.

Project description

A library for processing text data

cophi is a Python library for handling, modeling and processing text corpora. You can easily pipe a collection of text files using the high-level API:

corpus, metadata = cophi.corpus(directory="british-fiction-corpus",
                                filepath_pattern="**/*.txt",
                                encoding="utf-8",
                                lowercase=True,
                                token_pattern=r"\p{L}+\p{P}?\p{L}+")

You can also plug the DARIAH-DKPro-Wrapper into this pipeline to lemmatize text, or just keep certain word types.

Check out the introducing Jupyter notebook.

Getting started

To install the latest stable version:

$ pip install cophi

To install the latest development version:

$ pip install --upgrade git+https://github.com/cophi-wue/cophi-toolbox.git@testing

Available complexity measures

There are also a plenty of complexity metrics for measuring the lexical richness of (literary) texts.

Measures that use sample size and vocabulary size:

  • Type-Token Ratio TTR
  • Guiraud’s R
  • Herdan’s C
  • Dugast’s k
  • Maas’ a2
  • Dugast’s U
  • Tuldava’s LN
  • Brunet’s W
  • Carroll’s CTTR
  • Summer’s S

Measures that use part of the frequency spectrum:

  • Honoré’s H
  • Sichel’s S
  • Michéa’s M

Measures that use the whole frequency spectrum:

  • Entropy S
  • Yule’s K
  • Simpson’s D
  • Herdan’s Vm

Parameters of probabilistic models:

  • Orlov’s Z

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cophi-1.3.2.tar.gz (14.5 kB view details)

Uploaded Source

Built Distribution

cophi-1.3.2-py3-none-any.whl (17.1 kB view details)

Uploaded Python 3

File details

Details for the file cophi-1.3.2.tar.gz.

File metadata

  • Download URL: cophi-1.3.2.tar.gz
  • Upload date:
  • Size: 14.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.26.0 CPython/3.7.2

File hashes

Hashes for cophi-1.3.2.tar.gz
Algorithm Hash digest
SHA256 ffefc3997105dbd93dd8403c0bd7a452f5516d97d2119648dd135f765dce7e33
MD5 67c4b2a3af54300000e60b2108b298b7
BLAKE2b-256 20df520517d7092c8a579c8edab8c919f291d0b3a204e8116557e5977afd9b79

See more details on using hashes here.

File details

Details for the file cophi-1.3.2-py3-none-any.whl.

File metadata

  • Download URL: cophi-1.3.2-py3-none-any.whl
  • Upload date:
  • Size: 17.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.26.0 CPython/3.7.2

File hashes

Hashes for cophi-1.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 bafa4a504b700fd098d6a801c3aa4c4fa8e670f38c66f23a04953f04fc252272
MD5 7d2fd24fa1fb0d57e0cbcd16c75a35f6
BLAKE2b-256 97e7fb9fd78982253a9950e5ca2618a5755a58b1b5b28380bc34381a4bf3aa46

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page