Tool for exploring types, tokens, and n-legomena relationships in text.
Project description
Legomena
Tool for exploring types, tokens, and n-legomena relationships in text. Based on Davis 2019 [1] research paper.
Installation
pip install legomena
Data Sources
This package may be driven by any data source, but the author has tested two: the Natural Language ToolKit and the Standard Project Gutenberg Corpus. The former being the gold standard of python NLP applications, but having a rather measly 18-book gutenberg corpus. The latter containing the full 55,000+ book gutenberg corpus, already tokenized and counted. NOTE: The overlap of the two datasets do not agree in their exact type/token counts, their methodology differing, but this package takes type/token counts as raw data and is therefore methodology-agnostic.
# moby dick from NLTK
import nltk
nltk.download("gutenberg")
from nltk.corpus import gutenberg
words = gutenberg.words("melville-moby_dick.txt")
corpus = Corpus(words)
assert corpus.M, corpus.N == (260819, 19317)
# moby dick from SPGC
# NOTE: download and unzip https://zenodo.org/record/2422561/files/SPGC-counts-2018-07-18.zip into DATA_FOLDER
import pandas as pd
fname = "%s/SPGC-counts-2018-07-18/PG2701_counts.txt" % DATA_FOLDER
with open(fname) as f:
df = pd.read_csv(f, delimiter="\t", header=None, names=["word", "freq"])
f.close()
wfd = {str(row.word): int(row.freq) for row in df.itertuples()}
corpus = Corpus(wfd)
assert corpus.M, corpus.N == (210258, 16402)
Basic Usage:
Demo notebooks may be found here. Unit tests may be found here.
# basic properties
corpus.tokens # list of tokens
corpus.types # list of types
corpus.fdist # word frequency distribution dataframe
corpus.WFD # alias for corpus.fdist
corpus.M # number of tokens
corpus.N # number of types
corpus.k # n-legomena vector
corpus.k[n] # n-legomena count (n=1 -> number of hapaxes)
corpus.hapax # list of hapax legomena, alias for corpus.nlegomena(1)
corpus.dis # list of dis legomena, alias for corpus.nlegomena(2)
corpus.tris # list of tris legomena, alias for corpus.nlegomena(3)
corpus.tetrakis # list of tetrakis legomena, alias for corpus.nlegomena(4)
corpus.pentakis # list of pentakis legomena, alias for corpus.nlegomena(5)
# advanced properties
corpus.options # tuple of optional settings
corpus.resolution # number of samples to take to calculate TTR curve
corpus.dimension # n-legomena vector length to pre-compute (max 6)
corpus.seed # random number seed for sampling TTR data
corpus.TTR # type-token ratio dataframe
# basic functions
corpus.nlegomena(n:int) # list of types occurring exactly n times
corpus.sample(m:int) # samples m tokens from corpus *without replacement*
corpus.sample(x:float) # samples proportion x of corpus *without replacement*
Type-Token Models
There are a variety of models in the literature predicting number of types as a function of tokens, the most well-known being Heap's Law. Here are a few implemented, overlaid by the Corpus
class.
# three models
model = HeapsModel() # Heap's Law
model = InfSeriesModel(corpus) # Infinite Series Model [1]
model = LogModel() # Logarithmic Model [1]
# model fitting
m_tokens = corpus.TTR.m_tokens
n_types = corpus.TTR.n_types
model.fit(m_tokens, n_types)
predictions = model.fit_predict(m_tokens, n_types)
# model parameters
model.params
# model predictions
predictions = model.predict(m_tokens)
# log model only
dim = corpus.dimension
predicted_k = model.predict_k(m_tokens, dim)
Demo App
Check out the demo app to explore the type-token and n-legomena counts of a few Project Gutenberg books.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file legomena-1.2.0.tar.gz
.
File metadata
- Download URL: legomena-1.2.0.tar.gz
- Upload date:
- Size: 7.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d52e020cbe2c7aa21f0f05cea0c67f9ac4c1b12e9760bca45a12583d33db5abf |
|
MD5 | 3584bf4265df4bfaa9860c7445178bab |
|
BLAKE2b-256 | f5b39b410e2439a34f7eda7607fe4c7aa08eb1d7d231533eb500c38804df2a02 |
File details
Details for the file legomena-1.2.0-py3-none-any.whl
.
File metadata
- Download URL: legomena-1.2.0-py3-none-any.whl
- Upload date:
- Size: 8.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f315ca5678d30013673280b72784d048149e8015cac4ba0520c1d8d7fa1305ef |
|
MD5 | c62b08ab9118a606671ef021d9761b37 |
|
BLAKE2b-256 | 6ddefa5c1f390561ec50ed652f9e3aae26786bfe85b744fe3184781d577d9327 |