Skip to main content

Tool for exploring types, tokens, and n-legomena relationships in text.

Project description

Legomena

Tool for exploring types, tokens, and n-legomena relationships in text. Based on Davis 2019 [1] research paper.

Installation

pip install legomena

Data Sources

This package is driven by two data sources: the Natural Language ToolKit and/or the Standard Project Gutenberg Corpus. The former being the gold standard of python NLP applications, but having a rather weak 16-book gutenberg corpus. The latter containing the full 40,000+ book gutenberg corpus, already tokenized and counted. NOTE: The overlap of the two datasets do not agree in their exact type/token counts, their methodology differing, but this package takes type/token counts as raw data and is therefore methodology-agnostic.

To download either data source from a python console:

nltk.download("gutenberg")
SPGC.download()

Basic Usage:

Demo notebooks may be found here. Unit tests may be found here.

# standard project gutenberg corpus
from legomena import SPGC
corpus = SPGC.get(2701) # moby dick

# natural language toolkit
from legomena import Corpus
corpus = Corpus(gutenberg.words("melville-moby_dick.txt"))

# basic properties
corpus.tokens  # list of tokens
corpus.types  # list of types
corpus.fdist  # word frequency distribution dataframe
corpus.WFD  # alias for corpus.fdist
corpus.M  # number of tokens
corpus.N  # number of types
corpus.k  # n-legomena vector
corpus.hapax  # number of hapax legomena, alias for corpus.nlegomena(1)
corpus.dis  # number of dis legomena, alias for corpus.nlegomena(2)
corpus.tris  # number of tris legomena, alias for corpus.nlegomena(3)
corpus.tetrakis  # number of tetrakis legomena, alias for corpus.nlegomena(4)
corpus.pentakis  # number of pentakis legomena, alias for corpus.nlegomena(5)

# advanced properties
corpus.options  # tuple of optional settings
corpus.resolution  # number of samples when calculating TTR curve
corpus.dimension  # n-legomena vector length to pre-compute
corpus.seed  # random number seed for sampling TTR data
corpus.TTR  # type-token ratio dataframe

# basic functions
corpus.nlegomena(n:int)  # list of types occurring exactly n times
corpus.sample(m:int)  # samples m tokens from corpus *without replacement*
corpus.sample(x:float)  # samples proportion x of corpus *without replacement*

Type-Token Models

There are a variety of models in the literature predicting number of types as a function of tokens, the most well-known being Heap's Law. Here are a few implemented, overlaid by the Corpus class.

# three models
model = HeapsModel()  # Heap's Law
model = InfSeriesModel(corpus)  # Infinite Series Model [1]
model = LogModel(corpus)  # Logarithmic Model [1]

# model fitting
m_tokens = corpus.TTR.m_tokens
n_types = corpus.TTR.n_types
model.fit(m_tokens, n_types)
predictions = model.fit_predict(m_tokens, n_types)

# model parameters
model.params

# model predictions
predictions = model.predict(m_tokens)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

legomena-1.0.0.tar.gz (8.0 kB view hashes)

Uploaded Source

Built Distribution

legomena-1.0.0-py3-none-any.whl (9.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page