A library for preprocessing.
Project description
A library for processing text data
cophi
is a Python library for handling, modeling and processing text corpora. You can easily pipe a collection of text files using the high-level API:
corpus, metadata = cophi.corpus(directory="british-fiction-corpus",
filepath_pattern="**/*.txt",
encoding="utf-8",
lowercase=True,
token_pattern=r"\p{L}+\p{P}?\p{L}+")
You can also plug the DARIAH-DKPro-Wrapper into this pipeline to lemmatize text, or just keep certain word types.
Check out the introducing Jupyter notebook.
Getting started
To install the latest stable version:
$ pip install cophi
To install the latest development version:
$ pip install --upgrade git+https://github.com/cophi-wue/cophi-toolbox.git@testing
Available complexity measures
There are also a plenty of complexity metrics for measuring the lexical richness of (literary) texts.
Measures that use sample size and vocabulary size:
- Type-Token Ratio TTR
- Guiraud’s R
- Herdan’s C
- Dugast’s k
- Maas’ a2
- Dugast’s U
- Tuldava’s LN
- Brunet’s W
- Carroll’s CTTR
- Summer’s S
Measures that use part of the frequency spectrum:
- Honoré’s H
- Sichel’s S
- Michéa’s M
Measures that use the whole frequency spectrum:
- Entropy S
- Yule’s K
- Simpson’s D
- Herdan’s Vm
Parameters of probabilistic models:
- Orlov’s Z
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file cophi-1.3.2.tar.gz
.
File metadata
- Download URL: cophi-1.3.2.tar.gz
- Upload date:
- Size: 14.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.26.0 CPython/3.7.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ffefc3997105dbd93dd8403c0bd7a452f5516d97d2119648dd135f765dce7e33 |
|
MD5 | 67c4b2a3af54300000e60b2108b298b7 |
|
BLAKE2b-256 | 20df520517d7092c8a579c8edab8c919f291d0b3a204e8116557e5977afd9b79 |
File details
Details for the file cophi-1.3.2-py3-none-any.whl
.
File metadata
- Download URL: cophi-1.3.2-py3-none-any.whl
- Upload date:
- Size: 17.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.26.0 CPython/3.7.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bafa4a504b700fd098d6a801c3aa4c4fa8e670f38c66f23a04953f04fc252272 |
|
MD5 | 7d2fd24fa1fb0d57e0cbcd16c75a35f6 |
|
BLAKE2b-256 | 97e7fb9fd78982253a9950e5ca2618a5755a58b1b5b28380bc34381a4bf3aa46 |