Skip to main content

Tool for exploring types, tokens, and n-legomena relationships in text.

Project description

Legomena

Tool for exploring types, tokens, and n-legomena relationships in text. Based on Davis 2019 [1] research paper.

Testing Status

Installation

pip install legomena

Data Sources

This package may be driven by any data source, but the author has tested two: the Natural Language ToolKit and the Standard Project Gutenberg Corpus. The former being the gold standard of python NLP applications, but having a rather measly 18-book gutenberg corpus. The latter containing the full 55,000+ book gutenberg corpus, already tokenized and counted. NOTE: The overlap of the two datasets do not agree in their exact type/token counts, their methodology differing, but this package takes type/token counts as raw data and is therefore methodology-agnostic.

# moby dick from NLTK
import nltk
nltk.download("gutenberg")
from nltk.corpus import gutenberg
words = gutenberg.words("melville-moby_dick.txt")
corpus = Corpus(words)
assert corpus.M, corpus.N == (260819, 19317)

# moby dick from SPGC
# NOTE: download and unzip https://zenodo.org/record/2422561/files/SPGC-counts-2018-07-18.zip into DATA_FOLDER
import pandas as pd
fname = "%s/SPGC-counts-2018-07-18/PG2701_counts.txt" % DATA_FOLDER
with open(fname) as f:
    df = pd.read_csv(f, delimiter="\t", header=None, names=["word", "freq"])
    f.close()
wfd = {str(row.word): int(row.freq) for row in df.itertuples()}
corpus = Corpus(wfd)
assert corpus.M, corpus.N == (210258, 16402)

Basic Usage:

Demo notebooks may be found here. Unit tests may be found here.

# basic properties
corpus.tokens  # list of tokens
corpus.types  # list of types
corpus.fdist  # word frequency distribution dataframe
corpus.WFD  # alias for corpus.fdist
corpus.M  # number of tokens
corpus.N  # number of types
corpus.k  # n-legomena vector
corpus.k[n]  # n-legomena count (n=1 -> number of hapaxes)
corpus.hapax  # list of hapax legomena, alias for corpus.nlegomena(1)
corpus.dis  # list of dis legomena, alias for corpus.nlegomena(2)
corpus.tris  # list of tris legomena, alias for corpus.nlegomena(3)
corpus.tetrakis  # list of tetrakis legomena, alias for corpus.nlegomena(4)
corpus.pentakis  # list of pentakis legomena, alias for corpus.nlegomena(5)

# advanced properties
corpus.options  # tuple of optional settings
corpus.resolution  # number of samples to take to calculate TTR curve
corpus.dimension  # n-legomena vector length to pre-compute (max 6)
corpus.seed  # random number seed for sampling TTR data
corpus.TTR  # type-token ratio dataframe

# basic functions
corpus.nlegomena(n:int)  # list of types occurring exactly n times
corpus.sample(m:int)  # samples m tokens from corpus *without replacement*
corpus.sample(x:float)  # samples proportion x of corpus *without replacement*

Type-Token Models

There are a variety of models in the literature predicting number of types as a function of tokens, the most well-known being Heap's Law. Here are a few implemented, overlaid by the Corpus class.

# three models
model = HeapsModel()  # Heap's Law
model = InfSeriesModel(corpus)  # Infinite Series Model [1]
model = LogModel()  # Logarithmic Model [1]

# model fitting
m_tokens = corpus.TTR.m_tokens
n_types = corpus.TTR.n_types
model.fit(m_tokens, n_types)
predictions = model.fit_predict(m_tokens, n_types)

# model parameters
model.params

# model predictions
predictions = model.predict(m_tokens)

# log model only
dim = corpus.dimension
predicted_k = model.predict_k(m_tokens, dim)

Demo App

Check out the demo app to explore the type-token and n-legomena counts of a few Project Gutenberg books.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

legomena-1.2.0.tar.gz (7.9 kB view details)

Uploaded Source

Built Distribution

legomena-1.2.0-py3-none-any.whl (8.9 kB view details)

Uploaded Python 3

File details

Details for the file legomena-1.2.0.tar.gz.

File metadata

  • Download URL: legomena-1.2.0.tar.gz
  • Upload date:
  • Size: 7.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.8

File hashes

Hashes for legomena-1.2.0.tar.gz
Algorithm Hash digest
SHA256 d52e020cbe2c7aa21f0f05cea0c67f9ac4c1b12e9760bca45a12583d33db5abf
MD5 3584bf4265df4bfaa9860c7445178bab
BLAKE2b-256 f5b39b410e2439a34f7eda7607fe4c7aa08eb1d7d231533eb500c38804df2a02

See more details on using hashes here.

File details

Details for the file legomena-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: legomena-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 8.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.8

File hashes

Hashes for legomena-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f315ca5678d30013673280b72784d048149e8015cac4ba0520c1d8d7fa1305ef
MD5 c62b08ab9118a606671ef021d9761b37
BLAKE2b-256 6ddefa5c1f390561ec50ed652f9e3aae26786bfe85b744fe3184781d577d9327

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page