Tool for exploring types, tokens, and n-legomena relationships in text.

Project description

Legomena

Tool for exploring types, tokens, and n-legomena relationships in text. Based on Davis 2019 [1] research paper.

Installation

pip install legomena

Data Sources

This package may be driven by any data source, but the author has tested two: the Natural Language ToolKit and the Standard Project Gutenberg Corpus. The former being the gold standard of python NLP applications, but having a rather measly 18-book gutenberg corpus. The latter containing the full 55,000+ book gutenberg corpus, already tokenized and counted. NOTE: The overlap of the two datasets do not agree in their exact type/token counts, their methodology differing, but this package takes type/token counts as raw data and is therefore methodology-agnostic.

# moby dick from NLTK
import nltk
nltk.download("gutenberg")
from nltk.corpus import gutenberg
words = gutenberg.words("melville-moby_dick.txt")
corpus = Corpus(words)
assert corpus.M, corpus.N == (260819, 19317)

# moby dick from SPGC
# NOTE: download and unzip https://zenodo.org/record/2422561/files/SPGC-counts-2018-07-18.zip into DATA_FOLDER
import pandas as pd
fname = "%s/SPGC-counts-2018-07-18/PG2701_counts.txt" % DATA_FOLDER
with open(fname) as f:
    df = pd.read_csv(f, delimiter="\t", header=None, names=["word", "freq"])
    f.close()
wfd = {str(row.word): int(row.freq) for row in df.itertuples()}
corpus = Corpus(wfd)
assert corpus.M, corpus.N == (210258, 16402)

Basic Usage:

Demo notebooks may be found here. Unit tests may be found here.

# basic properties
corpus.tokens  # list of tokens
corpus.types  # list of types
corpus.fdist  # word frequency distribution dataframe
corpus.WFD  # alias for corpus.fdist
corpus.M  # number of tokens
corpus.N  # number of types
corpus.k  # n-legomena vector
corpus.k[n]  # n-legomena count (n=1 -> number of hapaxes)
corpus.hapax  # list of hapax legomena, alias for corpus.nlegomena(1)
corpus.dis  # list of dis legomena, alias for corpus.nlegomena(2)
corpus.tris  # list of tris legomena, alias for corpus.nlegomena(3)
corpus.tetrakis  # list of tetrakis legomena, alias for corpus.nlegomena(4)
corpus.pentakis  # list of pentakis legomena, alias for corpus.nlegomena(5)

# advanced properties
corpus.options  # tuple of optional settings
corpus.resolution  # number of samples to take to calculate TTR curve
corpus.dimension  # n-legomena vector length to pre-compute (max 6)
corpus.seed  # random number seed for sampling TTR data
corpus.TTR  # type-token ratio dataframe

# basic functions
corpus.nlegomena(n:int)  # list of types occurring exactly n times
corpus.sample(m:int)  # samples m tokens from corpus *without replacement*
corpus.sample(x:float)  # samples proportion x of corpus *without replacement*

Type-Token Models

There are a variety of models in the literature predicting number of types as a function of tokens, the most well-known being Heap's Law. Here are a few implemented, overlaid by the Corpus class.

# three models
model = HeapsModel()  # Heap's Law
model = InfSeriesModel(corpus)  # Infinite Series Model [1]
model = LogModel()  # Logarithmic Model [1]

# model fitting
m_tokens = corpus.TTR.m_tokens
n_types = corpus.TTR.n_types
model.fit(m_tokens, n_types)
predictions = model.fit_predict(m_tokens, n_types)

# model parameters
model.params

# model predictions
predictions = model.predict(m_tokens)

# log model only
dim = corpus.dimension
predicted_k = model.predict_k(m_tokens, dim)

Demo App

Check out the demo app to explore the type-token and n-legomena counts of a few Project Gutenberg books.

Project details

Release history Release notifications | RSS feed

This version

1.2.0

Oct 22, 2019

1.1.0

Oct 18, 2019

1.0.0

Oct 16, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

legomena-1.2.0.tar.gz (7.9 kB view details)

Uploaded Oct 22, 2019 Source

Built Distribution

legomena-1.2.0-py3-none-any.whl (8.9 kB view details)

Uploaded Oct 22, 2019 Python 3

File details

Details for the file legomena-1.2.0.tar.gz.

File metadata

Download URL: legomena-1.2.0.tar.gz
Upload date: Oct 22, 2019
Size: 7.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.8

File hashes

Hashes for legomena-1.2.0.tar.gz
Algorithm	Hash digest
SHA256	`d52e020cbe2c7aa21f0f05cea0c67f9ac4c1b12e9760bca45a12583d33db5abf`
MD5	`3584bf4265df4bfaa9860c7445178bab`
BLAKE2b-256	`f5b39b410e2439a34f7eda7607fe4c7aa08eb1d7d231533eb500c38804df2a02`

See more details on using hashes here.

File details

Details for the file legomena-1.2.0-py3-none-any.whl.

File metadata

Download URL: legomena-1.2.0-py3-none-any.whl
Upload date: Oct 22, 2019
Size: 8.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.8

File hashes

Hashes for legomena-1.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f315ca5678d30013673280b72784d048149e8015cac4ba0520c1d8d7fa1305ef`
MD5	`c62b08ab9118a606671ef021d9761b37`
BLAKE2b-256	`6ddefa5c1f390561ec50ed652f9e3aae26786bfe85b744fe3184781d577d9327`

See more details on using hashes here.

legomena 1.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Legomena

Installation

Data Sources

Basic Usage:

Type-Token Models

Demo App

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes