Compute Moving-Average Type-Token Ratio (MATTR) across POS-based token categories for corpus linguistics research.

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Science/Research
License
- Other/Proprietary License
Programming Language
Topic
- Text Processing :: Linguistic

Project description

posLD

posLD provides a Python function compute_LD to compute a measure of lexical diversity (Moving-Average Type-Token Ratio; MATTR) across part-of-speech categories (verb and noun) for corpus linguistics research.

Install and Usage

First, install the package using pip:

pip install posLD

You will also need a spaCy English model. The default is en_core_web_lg:

python -m spacy download en_core_web_lg

Other available models and installation instructions: https://spacy.io/usage

Once installed, import the package and call the function, replacing "path/to/your/folder" with the full path to your folder containing .txt files:

import posLD

posLD.compute_LD(input_path="path/to/your/folder")

You can also run it from the terminal:

posLD data/texts/

Function Arguments

Argument	Required	Description
`input_path`	Yes	Folder containing `.txt` files to process
`output_file`	No	Path/name of the output CSV file. Defaults to `"LD_results.csv"`
`window_size`	No	Number of tokens in each sliding window. Defaults to `50`. Texts shorter than this receive `NA`
`increment`	No	Number of tokens to advance the window per step. Defaults to `1`
`spacy_model`	No	spaCy model for POS tagging. Defaults to `"en_core_web_lg"`. Alternatives: `"en_core_web_sm"`, `"en_core_web_md"`, `"en_core_web_trf"`
`individual_output`	No	If `True`, writes a word list per text to an `individual_output/` subfolder. Defaults to `True`
`form`	No	Token form to compute MATTR on: `"lemma"`, `"surface"`, or `"both"`. Defaults to `"lemma"`

Examples

# Defaults: lemma form, window size 50, increment 1, en_core_web_lg
posLD.compute_LD(input_path="path/to/your/folder")

# Surface form only
posLD.compute_LD(input_path="path/to/your/folder", form="surface")

# Both lemma and surface, custom window size and increment
posLD.compute_LD(input_path="path/to/your/folder", form="both", window_size=100, increment=5)

CLI Arguments

Argument	Required	Description
`input_path`	Yes	Path to folder with `.txt` files
`-o`, `--output`	No	Output CSV filename (default: `LD_results.csv`)
`-w`, `--window-size`	No	Sliding window size (default: `50`)
`-i`, `--increment`	No	Window increment (default: `1`)
`-m`, `--model`	No	spaCy model to use (default: `en_core_web_lg`)
`-f`, `--form`	No	Token form: `lemma`, `surface`, or `both` (default: `lemma`)
`--no-individual`	No	Skip writing individual output files

Examples

# Defaults
posLD "path/to/your/folder"

# Surface form, window size 100
posLD "path/to/your/folder" -f surface -w 100

# Both forms, custom output
posLD "path/to/your/folder" -f both -o results/LD_results.csv

Features

Accepts a folder of .txt files
Computes MATTR using a sliding window over tokens
Excludes non-alphabetic tokens from all calculations
Excludes the verb be from verb and content word lists
Excludes proper nouns (e.g., London, Terry) from noun and content word lists
Reports NA when a text has fewer tokens than the window size

Computes MATTR for four token categories (columns depend on the form argument):

Category	Includes	Excludes
`all`	All alphabetic tokens	—
`content`	Common NOUN, VERB (excl. be), ADJ, ADV	Proper nouns, be
`verb`	VERB only	be
`noun`	Common NOUN only	Proper nouns

Output CSV Columns

Columns included depend on the form argument. With form="lemma" (default):

Column	Description
`filename`	Name of the input `.txt` file
`MATTR(N)_all_lemma`	MATTR over all lemmatized tokens
`MATTR(N)_content_lemma`	MATTR over content word lemmas
`MATTR(N)_verb_lemma`	MATTR over verb lemmas
`MATTR(N)_noun_lemma`	MATTR over noun lemmas

With form="surface", the same four columns appear with _surface instead of _lemma. With form="both", all eight columns are included.

N = the window size used (default 50)

Individual Output Files

When individual_output=True (the default), a subfolder individual_output/ is created next to your CSV. Each .txt file produces a corresponding _output.txt containing a word list — all tokens in original order, with VERB and NOUN POS tags shown.

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Citation

If you use posLD in your research, please cite it as (To be updated)

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Science/Research
License
- Other/Proprietary License
Programming Language
Topic
- Text Processing :: Linguistic

Release history Release notifications | RSS feed

0.1.1

May 7, 2026

This version

0.1.0

May 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

posld-0.1.0.tar.gz (6.0 kB view details)

Uploaded May 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

posld-0.1.0-py3-none-any.whl (6.9 kB view details)

Uploaded May 7, 2026 Python 3

File details

Details for the file posld-0.1.0.tar.gz.

File metadata

Download URL: posld-0.1.0.tar.gz
Upload date: May 7, 2026
Size: 6.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for posld-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`a3c73e67bd18c056c3f097e68cfb3567088e48f423795bd9ae50b58a7742deb4`
MD5	`42abb059f835c09b16c47a0c6bdfcd50`
BLAKE2b-256	`dc55b54e73ee00967cdd112b0d512bd2d44c94ae43c0b3cb6b2b15517b630e4e`

See more details on using hashes here.

File details

Details for the file posld-0.1.0-py3-none-any.whl.

File metadata

Download URL: posld-0.1.0-py3-none-any.whl
Upload date: May 7, 2026
Size: 6.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for posld-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`482eaf9d62767280f032c09f4c66f9658d51912914cb46a0a4d60b9b784fabe9`
MD5	`2666d1aa3a0c7c0649e7f9cad30d117b`
BLAKE2b-256	`f5b765612d0d908ea51f04fb899c560d944e73af3e87a70720c33fcefa26d2c7`

See more details on using hashes here.

posLD 0.1.0

Navigation

Verified details

Project links

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

posLD

Install and Usage

Function Arguments

Examples

CLI Arguments

Examples

Features

Output CSV Columns

Individual Output Files

License

Citation

Project details

Verified details

Project links

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes