Skip to main content

Text preprocessing for downstream linguistic analyses

Project description

Python Linguistic Analysis Tools (pylats)

Pylats is designed to perform text pre-processing for further linguistic analyses (e.g., measuring lexical diversity and/or lexical sophistication). Currently, advanced features such as lemmatization and POS tagging are available for English, with other languages to follow (Spanish models will be added in the near future, we have plans to release other models as well). However, pylats CAN currently be used with other languages using basic features (see examples below).

Pylats currently uses spacy as a starting point for many of its advanced features if working with English. Pylats was tested using spacy version 3.2 and be default uses the "en_core_web_sm" model. To install spacy and a language model, see the spacy installation instructions.

Installation

To install pylats, you can use pip:

pip install pylats

Getting Started

Import

from pylats import lats

Using pylats

pylats is designed to be the first step in conducting linguistic analyses using related analysis tools (such as lexical-diversity).

pylats uses the Normalize class to take a raw text string and format it:

teststr = "I love pepperoni pizza."
normed = lats.Normalize(teststr) #processed text string
print(normed.toks)
#output:
['i', 'love', 'pepperoni', 'pizza']

Paragraphs and sentences

The .toks method will provide a flat list of the tokens in a text. However, it can often be useful to conduct analyses at the sentence and/or paragraph level. The .sents and .paras methods provide a representation of text with nested lists.

teststr = """I love pepperoni pizza. Sometimes I like to add feta and banana peppers.
This is a second paragraph. In the original string there is a newline character before this paragraph."""

normedp = lats.Normalize(para_sample)

tokens

print(normedp.toks)
['i', 'love', 'pepperoni', 'pizza', 'sometimes', 'i', 'like', 'to', 'add', 'feta', 'and', 'banana', 'peppers', 'this', 'is', 'a', 'second', 'paragraph', 'in', 'the', 'original', 'string', 'there', 'is', 'a', 'newline', 'character', 'before', 'this', 'paragraph']

sentences

for x in normedp.sents:
	print(x) #print tokens in each sentence
['i', 'love', 'pepperoni', 'pizza']
['sometimes', 'i', 'like', 'to', 'add', 'feta', 'and', 'banana', 'peppers']
['this', 'is', 'a', 'second', 'paragraph']
['in', 'the', 'original', 'string', 'there', 'is', 'a', 'newline', 'character', 'before', 'this', 'paragraph']

paragraphs

for x in normed.paras:
	print(x) #print sentences each paragraph
[['i', 'love', 'pepperoni', 'pizza'], ['sometimes', 'i', 'like', 'to', 'add', 'feta', 'and', 'banana', 'peppers']]
[['this', 'is', 'a', 'second', 'paragraph'], ['in', 'the', 'original', 'string', 'there', 'is', 'a', 'newline', 'character', 'before', 'this', 'paragraph']]

Changing parameters

By default, Normalize simply removes punctuation and converts words in the text to lower case. However, a wide range of customizations can be made by adjusting the parameters class.

For example, it may be useful to exclude particular words for some analyses. In studies of lexical diversity, for instance, we probably don't want to include misspelled words (misspelled words would positively contribute to diversity scores, but probably shouldn't). Pylats includes a default list of "real" words drawn from a large corpus of English language which can be used to filter out misspelled ones. Words can also be added to a list of items to remove OR can be added to a list that overrides other lists.

Below, we create a copy of parameters and then make some changes:

new_params = lats.parameters() #create a copy of the parameters class
new_params.attested = True #set the attested attribute to True

Output with default settings:

#with default settings
msp_default = lats.Normalize("This is a smaple sentence")
print(msp_default.toks)
['this', 'is', 'a', 'smaple', 'sentence']

Output with new settings:

msp_new = lats.Normalize("This is a smaple sentence", params = new_params)
print(msp_new.toks)
['this', 'is', 'a', 'sentence']

Default parameters

class parameters:
	punctuation = ['``', "''", "'", '.', ',', '?', '!', ')', '(', '%', '/', '-', '_', '-LRB-', '-RRB-', 'SYM', ':', ';', '"']
	punctse = [".","?","!"]
	abbrvs = ["mrs.","ms.","mr.","dr.","phd."]
	splitter = "\n" #for splitting paragraphs
	rwl = realwords
	sp = True
	sspl = "spacy"
	pos = None #other options are "pos" for Penn tags and "upos" for universal tags
	removel = ['becuase'] #typos and other words not caught by the real words list
	lemma = False
	lower = True #treat all words as lower case
	attested = False #filter output using real words list?
	spaces = [" "] #need to add more here
	override = [] #items the system ignores that should be overridden

Adding part of speech information

If spacy is installed (and activated), part of speech tags can be added to each token, which can be useful in disambiguating homographic tokens (e.g., run as a verb in the sentence I like to run. versus run as a noun in the sentence I went for a run. ). This is helpful in a number of applications, including calculating indices lexical diversity.

pos_params = lats.parameters() #still need to fix this
pos_params.pos = "upos" #for large-grained universal parts of speech
run_sample = lats.Normalize("I like to run. I went for a run.", params = pos_params)
for x in run_sample.sents:
	print(x)
['i_PRON', 'like_VERB', 'to_PART', 'run_VERB']
['i_PRON', 'went_VERB', 'for_ADP', 'a_DET', 'run_NOUN']

Changing the spacy language model

To change the spacy language model that is used by pylats, first make sure that the desired model has been downloaded from spacy. Then, load the model:

#loading the "en_core_web_trf" model
lats.nlp = lats.spacy.load("en_core_web_trf")

Using pylats with languages other than English

The early versions of pylats have an advanced features for English texts and basic features for other languages. As the tool expands, advanced feature support will be added for other languages (for example, advanced features will be added for Spanish in the near future).

To process texts with basic features, simply change parameters.sp to True. The processor will treat text between whitespace as a token. Accordingly, some pre-processing may be necessary.

Example 1 (Spanish):

whtsp_params = lats.parameters() #copy parameters
whtsp_params.sp = False #turn off spacy processing
span_sample = lats.Normalize("Me gustaría aqua con gas",whtsp_params)
print(span_sample.toks)
['me', 'gustaría', 'aqua', 'con', 'gas']

Example 2 (Korean):

whtsp_params = lats.parameters() #copy parameters
whtsp_params.sp = False #turn off spacy processing
kor_sample = lats.Normalize("피자 좀 주세요",whtsp_params)
print(kor_sample.toks)
['피자', '좀', '주세요']

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pylats-0.20.tar.gz (14.6 kB view details)

Uploaded Source

File details

Details for the file pylats-0.20.tar.gz.

File metadata

  • Download URL: pylats-0.20.tar.gz
  • Upload date:
  • Size: 14.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.11

File hashes

Hashes for pylats-0.20.tar.gz
Algorithm Hash digest
SHA256 651c4d0b8550866e1ed9ad1377284eb435652c84c04beb655fd7b7bb6b75fbb4
MD5 ca48cdfd1367f3f92a96cd82b538ebc9
BLAKE2b-256 c102c4de2dda395d468b0443a51354fd5f0291cd1825c01d7012c4b0e9cf4379

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page