Compute Moving-Average Type-Token Ratio (MATTR) across POS-based token categories for corpus linguistics research.
Project description
posLD
posLD provides a Python function compute_LD to compute a measure of lexical diversity (Moving-Average Type-Token Ratio; MATTR) across part-of-speech categories (verb and noun) for corpus linguistics research.
Install and Usage
First, install the package using pip:
pip install posLD
You will also need a spaCy English model. The default is en_core_web_lg:
python -m spacy download en_core_web_lg
Other available models and installation instructions: https://spacy.io/usage
Once installed, import the package and call the function, replacing "path/to/your/folder" with the full path to your folder containing .txt files:
import posLD
posLD.compute_LD(input_path="path/to/your/folder")
You can also run it from the terminal:
posLD data/texts/
Function Arguments
| Argument | Required | Description |
|---|---|---|
input_path |
Yes | Folder containing .txt files to process |
output_file |
No | Path/name of the output CSV file. Defaults to "LD_results.csv" |
window_size |
No | Number of tokens in each sliding window. Defaults to 50. Texts shorter than this receive NA |
increment |
No | Number of tokens to advance the window per step. Defaults to 1 |
spacy_model |
No | spaCy model for POS tagging. Defaults to "en_core_web_lg". Alternatives: "en_core_web_sm", "en_core_web_md", "en_core_web_trf" |
individual_output |
No | If True, writes a word list per text to an individual_output/ subfolder. Defaults to True |
form |
No | Token form to compute MATTR on: "lemma", "surface", or "both". Defaults to "lemma" |
Examples
# Defaults: lemma form, window size 50, increment 1, en_core_web_lg
posLD.compute_LD(input_path="path/to/your/folder")
# Surface form only
posLD.compute_LD(input_path="path/to/your/folder", form="surface")
# Both lemma and surface, custom window size and increment
posLD.compute_LD(input_path="path/to/your/folder", form="both", window_size=100, increment=5)
CLI Arguments
| Argument | Required | Description |
|---|---|---|
input_path |
Yes | Path to folder with .txt files |
-o, --output |
No | Output CSV filename (default: LD_results.csv) |
-w, --window-size |
No | Sliding window size (default: 50) |
-i, --increment |
No | Window increment (default: 1) |
-m, --model |
No | spaCy model to use (default: en_core_web_lg) |
-f, --form |
No | Token form: lemma, surface, or both (default: lemma) |
--no-individual |
No | Skip writing individual output files |
Examples
# Defaults
posLD "path/to/your/folder"
# Surface form, window size 100
posLD "path/to/your/folder" -f surface -w 100
# Both forms, custom output
posLD "path/to/your/folder" -f both -o results/LD_results.csv
Features
- Accepts a folder of
.txtfiles - Computes MATTR using a sliding window over tokens
- Excludes non-alphabetic tokens from all calculations
- Excludes the verb be from verb and content word lists
- Excludes proper nouns (e.g., London, Terry) from noun and content word lists
- Reports
NAwhen a text has fewer tokens than the window size
Computes MATTR for four token categories (columns depend on the form argument):
| Category | Includes | Excludes |
|---|---|---|
all |
All alphabetic tokens | — |
content |
Common NOUN, VERB (excl. be), ADJ, ADV | Proper nouns, be |
verb |
VERB only | be |
noun |
Common NOUN only | Proper nouns |
Output CSV Columns
Columns included depend on the form argument. With form="lemma" (default):
| Column | Description |
|---|---|
filename |
Name of the input .txt file |
MATTR(N)_all_lemma |
MATTR over all lemmatized tokens |
MATTR(N)_content_lemma |
MATTR over content word lemmas |
MATTR(N)_verb_lemma |
MATTR over verb lemmas |
MATTR(N)_noun_lemma |
MATTR over noun lemmas |
With form="surface", the same four columns appear with _surface instead of _lemma. With form="both", all eight columns are included.
N = the window size used (default 50)
Individual Output Files
When individual_output=True (the default), a subfolder individual_output/ is created next to your CSV. Each .txt file produces a corresponding _output.txt containing a word list — all tokens in original order, with VERB and NOUN POS tags shown.
License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Citation
If you use posLD in your research, please cite it as (To be updated)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file posld-0.1.0.tar.gz.
File metadata
- Download URL: posld-0.1.0.tar.gz
- Upload date:
- Size: 6.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a3c73e67bd18c056c3f097e68cfb3567088e48f423795bd9ae50b58a7742deb4
|
|
| MD5 |
42abb059f835c09b16c47a0c6bdfcd50
|
|
| BLAKE2b-256 |
dc55b54e73ee00967cdd112b0d512bd2d44c94ae43c0b3cb6b2b15517b630e4e
|
File details
Details for the file posld-0.1.0-py3-none-any.whl.
File metadata
- Download URL: posld-0.1.0-py3-none-any.whl
- Upload date:
- Size: 6.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
482eaf9d62767280f032c09f4c66f9658d51912914cb46a0a4d60b9b784fabe9
|
|
| MD5 |
2666d1aa3a0c7c0649e7f9cad30d117b
|
|
| BLAKE2b-256 |
f5b765612d0d908ea51f04fb899c560d944e73af3e87a70720c33fcefa26d2c7
|