Text preprocessing for downstream linguistic analyses

These details have not been verified by PyPI

Project links

Homepage

Project description

Python Linguistic Analysis Tools (pylats)

Pylats is designed to perform text pre-processing for further linguistic analyses (e.g., measuring lexical diversity and/or lexical sophistication). Currently, advanced features such as lemmatization and POS tagging are available for English, with other languages to follow (Spanish models will be added in the near future, we have plans to release other models as well). However, pylats CAN currently be used with other languages using basic features (see examples below).

Pylats currently uses spacy as a starting point for many of its advanced features. Pylats can be adapted for use with any language with a spacy model. Our team has worked with English, Spanish, French, German, and Korean. Pylats has been tested using spacy version 3.7. To install spacy and a language model, see the spacy installation instructions.

Installation

To install pylats, you can use pip:

pip install pylats

Getting Started

Import

from pylats import lats

Using pylats

pylats is designed to be the first step in conducting linguistic analyses using related analysis tools (such as lexical-diversity).

First, load a spacy model using an appropriate language class. While pylats comes with a few pre-built language classes, users can always create custom language classes.

lats.load_model(lats.EnSm) #English Small Model ("en_core_web_sm")

# class EnSm:
# 	sp = True # use spacy?
# 	model = "en_core_web_sm" # which spacy model to use
# 	splitter = "\n" # paragraph separator
# 	nlp = None # after "load_model" is run, this will be the spacy model higlighted above
# 	sspl = "spacy" # use spacy to split sentences
# 	punctse = [".","?","!"] # what is used to split sentences if spacy is not used
#	mwuDict = None #Currently used in French; {"quelque+chose":{"idxHead":1,"uposNew":"PRON","xposNew":"PRON"}} #in the French Model, token.xpos and token.upos are both upos tags. For "idxHead", 0 means the first word is the syntactic head according to spacy, 2 means the second word is the head, etc.

Second, create a version of your text that has been processed by spacy using an appropriate language class (in this case lats.EnSm):

teststr = "I love pepperoni pizza."
preToks = lats.preProcess(teststr,lats.EnSm)

Finally, we format the text via the Normalize class and the default English word definition parameters class lats.EnDefault to format the text (which includes the lemma, part of speech tag, and whether the word is a content word "cw" or function word "fw"):

normed = lats.Normalize(preToks,lats.EnDefault) #processed text string
print(normed.toks)

#output:
['i_PRON_fw', 'love_VERB_cw', 'pepperoni_NOUN_cw', 'pizza_NOUN_cw']

Paragraphs and sentences

The .toks method will provide a flat list of the tokens in a text. However, it can often be useful to conduct analyses at the sentence and/or paragraph level. The .sents and .paras methods provide a representation of text with nested lists.

para_sample = """I love pepperoni pizza. Sometimes I like to add feta and banana peppers.
This is a second paragraph. In the original string there is a newline character before this paragraph."""

normedp = lats.Normalize(para_sample)

tokens

print(normedp.toks)

['i_PRON_fw', 'love_VERB_cw', 'pepperoni_NOUN_cw', 'pizza_NOUN_cw', 'sometimes_ADV_fw', 'i_PRON_fw', 'like_VERB_cw', 'to_PART_fw', 'add_VERB_cw', 'feta_NOUN_cw', 'and_CCONJ_fw', 'banana_NOUN_cw', 'pepper_NOUN_cw', 'this_PRON_fw', 'be_AUX_fw', 'a_DET_fw', 'second_ADJ_cw', 'paragraph_NOUN_cw', 'in_ADP_fw', 'the_DET_fw', 'original_ADJ_cw', 'string_NOUN_cw', 'there_PRON_fw', 'be_VERB_cw', 'a_DET_fw', 'character_NOUN_cw', 'before_ADP_fw', 'this_DET_fw', 'paragraph_NOUN_cw']

sentences

for x in normedp.sents:
	print(x) #print tokens in each sentence

['i_PRON_fw', 'love_VERB_cw', 'pepperoni_NOUN_cw', 'pizza_NOUN_cw']
['sometimes_ADV_fw', 'i_PRON_fw', 'like_VERB_cw', 'to_PART_fw', 'add_VERB_cw', 'feta_NOUN_cw', 'and_CCONJ_fw', 'banana_NOUN_cw', 'pepper_NOUN_cw']
['this_PRON_fw', 'be_AUX_fw', 'a_DET_fw', 'second_ADJ_cw', 'paragraph_NOUN_cw']
['in_ADP_fw', 'the_DET_fw', 'original_ADJ_cw', 'string_NOUN_cw', 'there_PRON_fw', 'be_VERB_cw', 'a_DET_fw', 'character_NOUN_cw', 'before_ADP_fw', 'this_DET_fw', 'paragraph_NOUN_cw']

paragraphs

for x in normedp.paras:
	print(x) #print sentences each paragraph

[['i_PRON_fw', 'love_VERB_cw', 'pepperoni_NOUN_cw', 'pizza_NOUN_cw'], ['sometimes_ADV_fw', 'i_PRON_fw', 'like_VERB_cw', 'to_PART_fw', 'add_VERB_cw', 'feta_NOUN_cw', 'and_CCONJ_fw', 'banana_NOUN_cw', 'pepper_NOUN_cw']]
[['this_PRON_fw', 'be_AUX_fw', 'a_DET_fw', 'second_ADJ_cw', 'paragraph_NOUN_cw'], ['in_ADP_fw', 'the_DET_fw', 'original_ADJ_cw', 'string_NOUN_cw', 'there_PRON_fw', 'be_VERB_cw', 'a_DET_fw', 'character_NOUN_cw', 'before_ADP_fw', 'this_DET_fw', 'paragraph_NOUN_cw']]

Default parameters

The default parameters for English are included below.

class EnDefault:
	lang = "en"
	punctuation = ['``', "''", "'", '.', ',', '?', '!', ')', '(', '%', '/', '-', '_', '-LRB-', '-RRB-', 'SYM', ':', ';', '"']
	punctse = [".","?","!"]
	abbrvs = ["mrs.","ms.","mr.","dr.","phd."]
	splitter = "\n" #for splitting paragraphs
	rwl = en_rwl #list of attested English words
	sp = True
	sspl = "spacy"
	removel = ['becuase'] #typos and other words not caught by the real words list
	attested = True #filter output using real words list?
	spaces = [" ","  ","   ","    "] #need to add more here
	override = [] #items the system ignores that should be overridden
	posignore = ["PROPN"] #ignore proper nouns
	numbers = ["NUM"] #pos_ tag for numbers
	wordConnect = "_"
	ngramConnect = "__" #for connecting ngrams
	contentPOS = ["VERB","NOUN","PROPN","ADJ","ADV"] #note that PROPN will be overridden by posignore in this case
	advMannerSuff = ["ly"]
	advMannerLex = ["well"]
	includeCwFw = True
	contentLemIgnore = [] #can be added, blank for now
	deprels = ["nsubj","dobj","amod","advmod"]
	depOrder = "dep2head" #options are "dep2head" or "orderofA"
	lemma = True
	lower = True #treat all words as lower case
	pos = "upos" #other options are "pos" for Penn tags and "upos" for universal tags
	morphs = None #Currently used in Spanish and French
	morphsExtra = None #for more complicated situations; Currently used in French

Other languages

Any language with a Spacy model can be used with Pylats with some adaptation. Currently, pre-built parameter settings are available for the following languages:

English

Use EnSm or EnTrf for preprocessing.

Use EnDefault for normalizing

French

Use FrTrf for preprocessing.

Use FrDefault for normalizing

German

Use DeTrf for preprocessing.

Use DeDefault for normalizing

Spanish

Use EsTrf for preprocessing.

Use EsDefault for normalizing

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.64

Apr 10, 2026

0.63.1

Mar 12, 2026

0.63

Mar 12, 2026

0.40

Oct 16, 2025

0.39

Oct 15, 2025

0.38

Oct 15, 2025

0.37

Jun 10, 2022

0.36

Jun 2, 2022

0.35

Jun 1, 2022

0.34

Jun 1, 2022

0.33

May 30, 2022

0.32

May 27, 2022

0.31

May 27, 2022

0.26

Apr 5, 2022

0.25

Apr 4, 2022

0.24

Feb 7, 2022

0.23

Feb 1, 2022

0.22

Feb 1, 2022

0.21

Feb 1, 2022

0.20

Feb 1, 2022

0.19

Feb 1, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pylats-0.64.tar.gz (4.5 MB view details)

Uploaded Apr 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pylats-0.64-py3-none-any.whl (4.6 MB view details)

Uploaded Apr 10, 2026 Python 3

File details

Details for the file pylats-0.64.tar.gz.

File metadata

Download URL: pylats-0.64.tar.gz
Upload date: Apr 10, 2026
Size: 4.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for pylats-0.64.tar.gz
Algorithm	Hash digest
SHA256	`cae8af34f6c0fa52087a46db7e28dae6e9758318fbfba6f8983e529caa2f3d78`
MD5	`dc80d1b66155131bb70d7b0a0f377330`
BLAKE2b-256	`c1b80b090d6f53fd67094b32f220ab4ed7c1a8586aa5ac11cc283e99b56c5ba9`

See more details on using hashes here.

File details

Details for the file pylats-0.64-py3-none-any.whl.

File metadata

Download URL: pylats-0.64-py3-none-any.whl
Upload date: Apr 10, 2026
Size: 4.6 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for pylats-0.64-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4117b3f52f806499454bc5ae93d1637ed1521ba6cd42d50f2441ac1cf0297391`
MD5	`d384b7cc8a1077c49b868ae537b2571e`
BLAKE2b-256	`b9256ddb00bcf0e207a4970474d213128b38918a9e89def8d10b45cb13a8c14c`

See more details on using hashes here.

pylats 0.64

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Python Linguistic Analysis Tools (pylats)

Installation

Getting Started

Import

Using pylats

Paragraphs and sentences

tokens

sentences

paragraphs

Default parameters

Other languages

English

French

German

Spanish

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes