Skip to main content

Compute Moving-Average Type-Token Ratio (MATTR) across POS-based token categories for corpus linguistics research.

Project description

posLD

posLD provides a Python function compute_LD to compute a measure of lexical diversity (Moving-Average Type-Token Ratio; MATTR) across part-of-speech categories (verb and noun) for corpus linguistics research.


Install and Usage

First, install the package using pip:

pip install posLD

You will also need a spaCy English model. The default is en_core_web_lg:

python -m spacy download en_core_web_lg

Other available models and installation instructions: https://spacy.io/usage

Once installed, import the package and call the function, replacing "path/to/your/folder" with the full path to your folder containing .txt files:

import posLD

posLD.compute_LD(input_path="path/to/your/folder")

You can also run it from the terminal:

posLD data/texts/

Function Arguments

Argument Required Description
input_path Yes Folder containing .txt files to process
output_file No Path/name of the output CSV file. Defaults to "LD_results.csv"
window_size No Number of tokens in each sliding window. Defaults to 50. Texts shorter than this receive NA
increment No Number of tokens to advance the window per step. Defaults to 1
spacy_model No spaCy model for POS tagging. Defaults to "en_core_web_lg". Alternatives: "en_core_web_sm", "en_core_web_md", "en_core_web_trf"
individual_output No If True, writes a word list per text to an individual_output/ subfolder. Defaults to True
form No Token form to compute MATTR on: "lemma", "surface", or "both". Defaults to "lemma"

Examples

# Defaults: lemma form, window size 50, increment 1, en_core_web_lg
posLD.compute_LD(input_path="path/to/your/folder")

# Surface form only
posLD.compute_LD(input_path="path/to/your/folder", form="surface")

# Both lemma and surface, custom window size and increment
posLD.compute_LD(input_path="path/to/your/folder", form="both", window_size=100, increment=5)

CLI Arguments

Argument Required Description
input_path Yes Path to folder with .txt files
-o, --output No Output CSV filename (default: LD_results.csv)
-w, --window-size No Sliding window size (default: 50)
-i, --increment No Window increment (default: 1)
-m, --model No spaCy model to use (default: en_core_web_lg)
-f, --form No Token form: lemma, surface, or both (default: lemma)
--no-individual No Skip writing individual output files

Examples

# Defaults
posLD "path/to/your/folder"

# Surface form, window size 100
posLD "path/to/your/folder" -f surface -w 100

# Both forms, custom output
posLD "path/to/your/folder" -f both -o results/LD_results.csv

Features

  • Accepts a folder of .txt files
  • Computes MATTR using a sliding window over tokens
  • Excludes non-alphabetic tokens from all calculations
  • Excludes the verb be from verb and content word lists
  • Excludes proper nouns (e.g., London, Terry) from noun and content word lists
  • Reports NA when a text has fewer tokens than the window size

Computes MATTR for four token categories (columns depend on the form argument):

Category Includes Excludes
all All alphabetic tokens
content Common NOUN, VERB (excl. be), ADJ, ADV Proper nouns, be
verb VERB only be
noun Common NOUN only Proper nouns

Output CSV Columns

Columns included depend on the form argument. With form="lemma" (default):

Column Description
filename Name of the input .txt file
MATTR(N)_all_lemma MATTR over all lemmatized tokens
MATTR(N)_content_lemma MATTR over content word lemmas
MATTR(N)_verb_lemma MATTR over verb lemmas
MATTR(N)_noun_lemma MATTR over noun lemmas

With form="surface", the same four columns appear with _surface instead of _lemma. With form="both", all eight columns are included.

N = the window size used (default 50)


Individual Output Files

When individual_output=True (the default), a subfolder individual_output/ is created next to your CSV. Each .txt file produces a corresponding _output.txt containing a word list — all tokens in original order, with VERB and NOUN POS tags shown.


License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.


Citation

If you use posLD in your research, please cite it as (To be updated)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

posld-0.1.1.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

posld-0.1.1-py3-none-any.whl (6.9 kB view details)

Uploaded Python 3

File details

Details for the file posld-0.1.1.tar.gz.

File metadata

  • Download URL: posld-0.1.1.tar.gz
  • Upload date:
  • Size: 6.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for posld-0.1.1.tar.gz
Algorithm Hash digest
SHA256 16cc54cecd0011660b925f45f33d16e73e3c7342ea335df3b8d1cc3f3504a244
MD5 1f4067fbdfa3c5c75a2cf344c82781b3
BLAKE2b-256 fb8b3ece1602e73ff29a60427d4ad20bbcd2fb9073d48c7e4c23069b27802294

See more details on using hashes here.

File details

Details for the file posld-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: posld-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 6.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for posld-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3cb3bf01b468cfc32fa96506c5dd0087073577433019d52a61b63babae70f6af
MD5 f6b728c8cf7d54c934a1e0a8a8e8d7ef
BLAKE2b-256 cfef642dae0304a370e90955a6bff8b34ac73bdde31fadcbacc90935b7038b3e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page