Skip to main content

Syntactic complexity analyzer of written English language samples

Project description

NeoSCA

Syntactic complexity analyzer of written English language samples

Description

NeoSCA is a rewrite of Xiaofei Lu's L2 Syntactic Complexity Analyzer, supporting Windows, macOS, and Linux.

The same as L2SCA, NeoSCA takes written English language samples in plain text format as input, counts the frequency of the following 9 structures in the text:

  1. words (W)
  2. sentences (S)
  3. verb phrases (VP)
  4. clauses (C)
  5. T-units (T)
  6. dependent clauses (DC)
  7. complex T-units (CT)
  8. coordinate phrases (CP)
  9. complex nominals (CN)

and computes the following 14 syntactic complexity indices of the text:

  1. mean length of sentence (MLS)
  2. mean length of T-unit (MLT)
  3. mean length of clause (MLC)
  4. clauses per sentence (C/S)
  5. verb phrases per T-unit (VP/T)
  6. clauses per T-unit (C/T)
  7. dependent clauses per clause (DC/C)
  8. dependent clauses per T-unit (DC/T)
  9. T-units per sentence (T/S)
  10. complex T-unit ratio (CT/T)
  11. coordinate phrases per T-unit (CP/T)
  12. coordinate phrases per clause (CP/C)
  13. complex nominals per T-unit (CN/T)
  14. complex nominals per clause (CP/C)

Comparison

L2SCA NeoSCA
runs on macOS and Linux runs on Windows, macOS, and Linux
single and multiple input are handled respectively by two commands one command for both cases, making your life easier
runs only under its own home directory runs under any directory
outputs only frequencies of the "9+14" syntactic structures add options to reserve intermediate results, i.e. Stanford Parser's parsing results, Tregex's querying results

Usage

  1. Single input:
neosca sample1.txt 
# output will be saved in result.csv
neosca sample1.txt -o sample1.csv 
# custom output file
  1. Multiple input:
neosca sample1.txt sample2.txt
neosca sample*.txt 
# wildcard characters are also supported
neosca sample[1-10].txt
  1. Use -p/--reserve-parsed to reserve parsed files of Stanford Parser. Use -m/--reserve-match to reserve match results of Stanford Tregex.
neosca sample1.txt -p -m

Installation

  1. Install neosca
pip install neosca
  1. Install Java 8 or later

  2. Download latest versions of Stanford Parser and Stanford Tregex

  3. Set STANFORD_PARSER_HOME and STANFORD_TREGEX_HOME

  • Windows:

In the Environment Variables window (press Windows+s, type env, and press Enter):

STANFORD_PARSER_HOME=\path\to\stanford-parser-full-2020-11-17
STANFORD_TREGEX_HOME=\path\to\stanford-tregex-2020-11-17
  • Linux/macOS:
export STANFORD_PARSER_HOME=/path/to/stanford-parser-full-2020-11-17
export STANFORD_TREGEX_HOME=/path/to/stanford-tregex-2020-11-17

Under the hood

NeoSCA works as a wrapper of Stanford Parser and Stanford Tregex, so does L2SCA. In case you are unfamiliar with the two dependencies, below are some quick examples. Detailed explanations can be found in the book Computational Methods for Corpus Annotation and Analysis (Lu, 2014).

  • Stanford Parser

Assume you have a file named sample.txt containing one sentence:

This is an example.

This command:

java -mx1500m -cp "/path/to/stanford-parser-full-2020-11-17/*" edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat penn edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz /path/to/sample.txt

gives the phrase structure tree:

(ROOT
  (S
    (NP (DT This))
    (VP (VBZ is)
      (NP (DT an) (NN example)))
    (. .)))

The tree can be visualized as follows, with quotation marks eliminated.

a phrase structure tree

In this tree, the starting symbol is the label "ROOT" at the root level of the tree, and the 4 terminal symbols are the 4 words in the sentence, located at the bottom of the branches of the tree. The non-terminal symbols, located between the starting symbol and the terminal symbols, include a number of labels for different clausal, phrasal, and lexical categories. For example, the non-terminal symbol "NP" indicates the phrasal category "Noun Phrase".

When parsing input files, NeoSCA runs the above command and, if you have specified the -rp option, saves phrase structure trees in files with .parsed extension.

  • Stanford Tregex

Tregex queries regex-like patterns, called Tregex patterns, against phrase structure trees generated by Stanford Parser.

Assume a sample.parsed has:

(ROOT
  (S
    (NP (DT This))
    (VP (VBZ is)
      (NP (DT an) (NN example)))
    (. .)))

This command:

java -mx100m -cp "/path/to/stanford-tregex-2020-11-17/stanford-tregex.jar" edu.stanford.nlp.trees.tregex.TregexPattern "NP" sample.parsed -o

gives

Pattern string:
NP
Parsed representation:
Root NP
Reading trees from file(s) sample.parsed
(NP (DT This))

(NP (DT an) (NN example))

There were 2 matches in total.

When querying parsed files, NeoSCA runs the above command and remembers how many matches for each of the pre-specified Tregex patterns.

If you have specified the -rm option, it also saves the matches, the two NPs in our case, in files with .matches extension.

Citing

Please use the following citation if you use NeoSCA in your work:

@misc{neol2sca,
author = {Tan, Long},
title = {NeoSCA},
howpublished = {\url{https://github.com/tanloong/NeoSCA}},
year = {2022}
}

Also, you need to cite Lu's article describing L2SCA:

@article{lu2010automatic,
title={Automatic analysis of syntactic complexity in second language writing},
author={Lu, Xiaofei},
journal={International journal of corpus linguistics},
volume={15},
number={4},
pages={474--496},
year={2010},
publisher={John Benjamins}
}

License

The same as L2SCA, NeoSCA is licensed under the GNU General Public License, version 2 or later.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

neosca-0.0.16.tar.gz (20.0 kB view details)

Uploaded Source

Built Distribution

neosca-0.0.16-py3-none-any.whl (19.2 kB view details)

Uploaded Python 3

File details

Details for the file neosca-0.0.16.tar.gz.

File metadata

  • Download URL: neosca-0.0.16.tar.gz
  • Upload date:
  • Size: 20.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.6

File hashes

Hashes for neosca-0.0.16.tar.gz
Algorithm Hash digest
SHA256 d4afdde1dd3942fd1790d448edad9ea9209d79d4e551754e832fee73f771a505
MD5 9c84bfba780815dcac68aa66f6b7e0d7
BLAKE2b-256 27bc41e27884a97dd66cd12e61f546aca9ff253910baee509d6b0ae9717a7f0b

See more details on using hashes here.

File details

Details for the file neosca-0.0.16-py3-none-any.whl.

File metadata

  • Download URL: neosca-0.0.16-py3-none-any.whl
  • Upload date:
  • Size: 19.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.6

File hashes

Hashes for neosca-0.0.16-py3-none-any.whl
Algorithm Hash digest
SHA256 e7e613877c700e61426ed0dc6023b24ad8597bc2555aa209bb82ad6aa49b4ac2
MD5 3dd4e5b319d57dc8c51265b6f1e8f5ab
BLAKE2b-256 623ce405199af6c2cd95b2f840bd28388eab651866fd886409d75b0e5120fc54

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page