Skip to main content

Lexicogrammatical tagging and tag counting tool

Project description

Lexicogrammatical Tagger (LxGrTgr)

Note that LxGrTgr is currently being beta tested and should not be used in research. Once the beta testing concludes, this message will change.

Quick Start Guide

LxGrTgr was developed using Spacy (version 3.5; en_core_web_trf model). Users will need to follow the instructions on Spacy's website to download Spacy for your specific system and the en_core_web_trf model.

Once you have Spacy installed and have dowloaded the en_core_web_trf model, you can use LxGrTgr. To install LxGrTgr, use pip:

pip install lxgrtgr

Demo site

In addition to using the code below, a demo web app (which uses a faster but slightly less accurate NLP backend) is also available.

Import LxGrTgr

First, import LxGrTgr:

import lxgrtgr as lxgr

Tag Strings and Print Output

Then, strings can be tagged and printed:

sample1 = lxgr.tag("This is a very important opportunity that only comes once in a lifetime.")
lxgr.printer(sample1)
0 This this None pro dem sg None None None None None None DT nsubj 1
1 is be None vbmain be pres simple active None None None None VBZ ROOT 1
2 a a None dt art None None None None None None None DT det 5
3 very very rb+adjmod|advmod rb othr None None None None None None None RB advmod 4
4 important important attr+nn+premod jj attr None None None None None None None JJ amod 5
5 opportunity opportunity None nn None nom None None None None None None NN attr 1
6 that that None relpro relpro_that sg None None None None None None WDT nsubj 8
7 only only rb+advl rb advl ly None None None None None None RB advmod 8
8 comes come nn+finite+relcl vbmain vblex pres simple active nmod_cls thatcls rel None VBZ relcl 5
9 once once rb+advl rb advl None None None None None None None RB advmod 8
10 in in None in in_othr None None None None None None None IN prep 9
11 a a None dt art None None None None None None None DT det 12
12 lifetime lifetime None nn None None None None None None None None NN pobj 10
13 . . None None None None None None None None None None . punct 1

These commands can also be combined for efficiency's sake:

lxgr.printer(lxgr.tag("This is a very important opportunity that only comes once in a lifetime."))

Write Output to File

Output can also be written to a file:

lxgr.writer("sample_results/sample1.tsv",sample1)
sample2 = lxgr.tag("I like pizza. I also enjoy eating it because it gives me a reason to drink beer.")
lxgr.writer("sample_results/sample2.tsv",sample2)

Batch Processing Corpora

Corpora come in all shapes and sizes. By default LxGrTgr presumes that each corpus file is represented as a UTF-8 text file and that all corpus files are in the same folder/directory.

Step 1: Tag Corpus Files

To tag a corpus with LxGrTgr, simply use the tagFolder() function.

tagFolder(targetDir,outputDir,suff = ".txt")

targetDir is the folder/directory where your corpus files are. outputDir is the folder where the tagged versions of your corpus files will be written.

An additional optional argument (suff) can also be used. By default, suff = ".txt". If your corpus filenames end in something other than ".txt", be sure to include the suff argument with the correct filename ending.

lxgr.tagFolder("folderWithCorpusFiles/","folderWhereTaggedVersionsWillBeWritten/")

Step 2: Check and Edit Tagged Corpus Files

Next, tagging should be checked and edited as appropriate.

Step 3: Counting Tags

After checking and editing the tags in your corpus, it is time to get tag counts for each document in your corpus using the countTagsFolder() function.

countTagsFolder(targetDir,tagList = None,suff = ".txt")

By default, complexity tags are counted. The countTagsFolder() function returns a dictionary with filenames as keys and feature counts as values.

sampleCountDictionary = lxgr.countTagsFolder("folderWhereTaggedVersionsWereWritten/")

Step 4: Writing Tag Counts to a File

The writeCounts() function can be used to write the results to a file. By default, counts are normed as the incidence per 10,000 words, though this can be changed using the norming argument. Raw counts can be obtained by including normed = False.

writeCounts(outputD,outName, tagList = None, sep = "\t", normed = True,norming = 10000)

If the default options are desired, the writeCounts() function only needs two arguments - a dictionary of filenames and index counts and a filename for the spreadsheet file:

lxgr.writeCounts(sampleCountDictionary,"sampleOutputFile.txt")

Future Directions

Add more functions for random sampling and tag-fixing.

Tag Descriptions

We are currently developing tag descriptions and detailed annotation guidelines for complexity features. Click here to access the document (updated/revised weekly)

License

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lxgrtgr-0.5.32.tar.gz (22.5 kB view details)

Uploaded Source

Built Distribution

lxgrtgr-0.5.32-py3-none-any.whl (20.5 kB view details)

Uploaded Python 3

File details

Details for the file lxgrtgr-0.5.32.tar.gz.

File metadata

  • Download URL: lxgrtgr-0.5.32.tar.gz
  • Upload date:
  • Size: 22.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.3

File hashes

Hashes for lxgrtgr-0.5.32.tar.gz
Algorithm Hash digest
SHA256 7e1d66307a4ccc3c4f7f72cafd6d174f920fa72be319837bb90a31fcfa3b4a11
MD5 23395b5935bd31b29a5bcf5835354ed3
BLAKE2b-256 860bb4edc699ecba9ce916309e27f61d2ecd6b8765e941f2502fda7b05e4b316

See more details on using hashes here.

File details

Details for the file lxgrtgr-0.5.32-py3-none-any.whl.

File metadata

  • Download URL: lxgrtgr-0.5.32-py3-none-any.whl
  • Upload date:
  • Size: 20.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.3

File hashes

Hashes for lxgrtgr-0.5.32-py3-none-any.whl
Algorithm Hash digest
SHA256 d99a25d273c08ec89f74b006b6ab2b8a2888af68884608e407932b4646e3ea18
MD5 c77768e9a65789ea7d91228206cb17d9
BLAKE2b-256 d9a61636ca88037f33be778f2f89c07e62442c681a76ec7a81208dfe7025fbfa

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page