Tool for exploring types, tokens, and n-legomena relationships in text.
Project description
Legomena
Tool for exploring types, tokens, and n-legomena relationships in text. Based on Davis 2019 [1] research paper.
Installation
pip install legomena
Data Sources
This package is driven by two data sources: the Natural Language ToolKit and/or the Standard Project Gutenberg Corpus. The former being the gold standard of python NLP applications, but having a rather weak 16-book gutenberg corpus. The latter containing the full 40,000+ book gutenberg corpus, already tokenized and counted. NOTE: The overlap of the two datasets do not agree in their exact type/token counts, their methodology differing, but this package takes type/token counts as raw data and is therefore methodology-agnostic.
To download either data source from a python console:
nltk.download("gutenberg")
SPGC.download()
Basic Usage:
Demo notebooks may be found here. Unit tests may be found here.
# standard project gutenberg corpus
from legomena import SPGC
corpus = SPGC.get(2701) # moby dick
# natural language toolkit
from legomena import Corpus
corpus = Corpus(gutenberg.words("melville-moby_dick.txt"))
# basic properties
corpus.tokens # list of tokens
corpus.types # list of types
corpus.fdist # word frequency distribution dataframe
corpus.WFD # alias for corpus.fdist
corpus.M # number of tokens
corpus.N # number of types
corpus.k # n-legomena vector
corpus.hapax # number of hapax legomena, alias for corpus.nlegomena(1)
corpus.dis # number of dis legomena, alias for corpus.nlegomena(2)
corpus.tris # number of tris legomena, alias for corpus.nlegomena(3)
corpus.tetrakis # number of tetrakis legomena, alias for corpus.nlegomena(4)
corpus.pentakis # number of pentakis legomena, alias for corpus.nlegomena(5)
# advanced properties
corpus.options # tuple of optional settings
corpus.resolution # number of samples when calculating TTR curve
corpus.dimension # n-legomena vector length to pre-compute
corpus.seed # random number seed for sampling TTR data
corpus.TTR # type-token ratio dataframe
# basic functions
corpus.nlegomena(n:int) # list of types occurring exactly n times
corpus.sample(m:int) # samples m tokens from corpus *without replacement*
corpus.sample(x:float) # samples proportion x of corpus *without replacement*
Type-Token Models
There are a variety of models in the literature predicting number of types as a function of tokens, the most well-known being Heap's Law. Here are a few implemented, overlaid by the Corpus
class.
# three models
model = HeapsModel() # Heap's Law
model = InfSeriesModel(corpus) # Infinite Series Model [1]
model = LogModel(corpus) # Logarithmic Model [1]
# model fitting
m_tokens = corpus.TTR.m_tokens
n_types = corpus.TTR.n_types
model.fit(m_tokens, n_types)
predictions = model.fit_predict(m_tokens, n_types)
# model parameters
model.params
# model predictions
predictions = model.predict(m_tokens)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.