Skip to main content

Compute Moving-Average Type-Token Ratio (MATTR) across POS-based token categories for corpus linguistics research.

Project description

posLD

posLD provides a Python function compute_LD to compute a measure of lexical diversity (Moving-Average Type-Token Ratio; MATTR) across part-of-speech categories (verb and noun) for corpus linguistics research.


Install and Usage

First, install the package using pip:

pip install posLD

You will also need a spaCy English model. The default is en_core_web_lg:

python -m spacy download en_core_web_lg

Other available models and installation instructions: https://spacy.io/usage

Once installed, import the package and call the function, replacing "path/to/your/folder" with the full path to your folder containing .txt files:

import posLD

posLD.compute_LD(input_path="path/to/your/folder")

You can also run it from the terminal:

posLD data/texts/

Function Arguments

Argument Required Description
input_path Yes Folder containing .txt files to process
output_file No Path/name of the output CSV file. Defaults to "LD_results.csv"
window_size No Number of tokens in each sliding window. Defaults to 50. Texts shorter than this receive NA
increment No Number of tokens to advance the window per step. Defaults to 1
spacy_model No spaCy model for POS tagging. Defaults to "en_core_web_lg". Alternatives: "en_core_web_sm", "en_core_web_md", "en_core_web_trf"
individual_output No If True, writes a word list per text to an individual_output/ subfolder. Defaults to True
form No Token form to compute MATTR on: "lemma", "surface", or "both". Defaults to "lemma"

Examples

# Defaults: lemma form, window size 50, increment 1, en_core_web_lg
posLD.compute_LD(input_path="path/to/your/folder")

# Surface form only
posLD.compute_LD(input_path="path/to/your/folder", form="surface")

# Both lemma and surface, custom window size and increment
posLD.compute_LD(input_path="path/to/your/folder", form="both", window_size=100, increment=5)

CLI Arguments

Argument Required Description
input_path Yes Path to folder with .txt files
-o, --output No Output CSV filename (default: LD_results.csv)
-w, --window-size No Sliding window size (default: 50)
-i, --increment No Window increment (default: 1)
-m, --model No spaCy model to use (default: en_core_web_lg)
-f, --form No Token form: lemma, surface, or both (default: lemma)
--no-individual No Skip writing individual output files

Examples

# Defaults
posLD "path/to/your/folder"

# Surface form, window size 100
posLD "path/to/your/folder" -f surface -w 100

# Both forms, custom output
posLD "path/to/your/folder" -f both -o results/LD_results.csv

Features

  • Accepts a folder of .txt files
  • Computes MATTR using a sliding window over tokens
  • Excludes non-alphabetic tokens from all calculations
  • Excludes the verb be from verb and content word lists
  • Excludes proper nouns (e.g., London, Terry) from noun and content word lists
  • Reports NA when a text has fewer tokens than the window size

Computes MATTR for four token categories (columns depend on the form argument):

Category Includes Excludes
all All alphabetic tokens
content Common NOUN, VERB (excl. be), ADJ, ADV Proper nouns, be
verb VERB only be
noun Common NOUN only Proper nouns

Output CSV Columns

Columns included depend on the form argument. With form="lemma" (default):

Column Description
filename Name of the input .txt file
MATTR(N)_all_lemma MATTR over all lemmatized tokens
MATTR(N)_content_lemma MATTR over content word lemmas
MATTR(N)_verb_lemma MATTR over verb lemmas
MATTR(N)_noun_lemma MATTR over noun lemmas

With form="surface", the same four columns appear with _surface instead of _lemma. With form="both", all eight columns are included.

N = the window size used (default 50)


Individual Output Files

When individual_output=True (the default), a subfolder individual_output/ is created next to your CSV. Each .txt file produces a corresponding _output.txt containing a word list — all tokens in original order, with VERB and NOUN POS tags shown.


License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.


Citation

If you use posLD in your research, please cite it as (To be updated)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

posld-0.1.0.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

posld-0.1.0-py3-none-any.whl (6.9 kB view details)

Uploaded Python 3

File details

Details for the file posld-0.1.0.tar.gz.

File metadata

  • Download URL: posld-0.1.0.tar.gz
  • Upload date:
  • Size: 6.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for posld-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a3c73e67bd18c056c3f097e68cfb3567088e48f423795bd9ae50b58a7742deb4
MD5 42abb059f835c09b16c47a0c6bdfcd50
BLAKE2b-256 dc55b54e73ee00967cdd112b0d512bd2d44c94ae43c0b3cb6b2b15517b630e4e

See more details on using hashes here.

File details

Details for the file posld-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: posld-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 6.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for posld-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 482eaf9d62767280f032c09f4c66f9658d51912914cb46a0a4d60b9b784fabe9
MD5 2666d1aa3a0c7c0649e7f9cad30d117b
BLAKE2b-256 f5b765612d0d908ea51f04fb899c560d944e73af3e87a70720c33fcefa26d2c7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page