A library for preprocessing.
Project description
A library for preprocessing
cophi
is a Python library for handling, modeling and processing text corpora. You
can easily pipe a collection of text files using the high-level API:
corpus, metadata = cophi.corpus(directory="british-fiction-corpus",
filepath_pattern="**/*.txt",
encoding="utf-8",
lowercase=True,
token_pattern=r"\p{L}+\p{P}?\p{L}+")
Getting started
To install the latest stable version:
$ pip install cophi
To install the latest development version:
$ pip install --upgrade git+https://github.com/cophi-wue/cophi-toolbox.git@testing
Check out the introducing Jupyter notebook.
Contents
api
: High-level API.model
: Low-level model classes.complexity
: Measures that assess the linguistic and stylistic complexity of (literary) texts.utils
: Low-level helper functions.
Available complexity measures
Measures that use sample size and vocabulary size:
- Type-Token Ratio TTR
- Guiraud’s R
- Herdan’s C
- Dugast’s k
- Maas’ a2
- Dugast’s U
- Tuldava’s LN
- Brunet’s W
- Carroll’s CTTR
- Summer’s S
Measures that use part of the frequency spectrum:
- Honoré’s H
- Sichel’s S
- Michéa’s M
Measures that use the whole frequency spectrum:
- Entropy S
- Yule’s K
- Simpson’s D
- Herdan’s Vm
Parameters of probabilistic models:
- Orlov’s Z
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
cophi-1.0.3.tar.gz
(11.6 kB
view hashes)
Built Distribution
cophi-1.0.3-py3-none-any.whl
(12.4 kB
view hashes)