Syntactic complexity analyzer of written English language samples
Project description
NeoSCA
NeoSCA is a syntactic complexity analyzer of written English language samples. It is a rewrite of Xiaofei Lu's L2 Syntactic Complexity Analyzer, supporting Windows, macOS, and Linux.
Description
The same as L2SCA, NeoSCA takes written English language samples in plain text format as input, counts the frequency of the following 9 structures in the text:
- words (W)
- sentences (S)
- verb phrases (VP)
- clauses (C)
- T-units (T)
- dependent clauses (DC)
- complex T-units (CT)
- coordinate phrases (CP)
- complex nominals (CN)
and computes the following 14 syntactic complexity indices of the text:
- mean length of sentence (MLS)
- mean length of T-unit (MLT)
- mean length of clause (MLC)
- clauses per sentence (C/S)
- verb phrases per T-unit (VP/T)
- clauses per T-unit (C/T)
- dependent clauses per clause (DC/C)
- dependent clauses per T-unit (DC/T)
- T-units per sentence (T/S)
- complex T-unit ratio (CT/T)
- coordinate phrases per T-unit (CP/T)
- coordinate phrases per clause (CP/C)
- complex nominals per T-unit (CN/T)
- complex nominals per clause (CP/C)
Comparison
L2SCA | NeoSCA |
---|---|
runs on macOS and Linux | runs on Windows, macOS, and Linux |
single and multiple input are handled respectively by two commands | one command for both cases, making your life easier |
runs only under its own home directory | runs under any directory |
outputs only frequencies of the "9+14" syntactic structures | add options to reserve intermediate results, i.e. Stanford Parser's parsing results, Tregex's querying results |
Installation
- Install neosca
pip install neosca
-
Install Java 8 or later
-
Download and unzip latest versions of Stanford Parser and Stanford Tregex
-
Set
STANFORD_PARSER_HOME
andSTANFORD_TREGEX_HOME
- Windows:
In the Environment Variables window (press Windows
+s
, type env, and press Enter
):
STANFORD_PARSER_HOME=\path\to\stanford-parser-full-2020-11-17
STANFORD_TREGEX_HOME=\path\to\stanford-tregex-2020-11-17
- Linux/macOS:
export STANFORD_PARSER_HOME=/path/to/stanford-parser-full-2020-11-17
export STANFORD_TREGEX_HOME=/path/to/stanford-tregex-2020-11-17
Usage
- Single input:
nsca sample1.txt
# output will be saved in result.csv
nsca sample1.txt -o sample1.csv
# custom output file
- Multiple input:
nsca sample1.txt sample2.txt
nsca sample*.txt
# wildcard characters are supported
nsca sample[1-10].txt
- Use
-p
/--reserve-parsed
to reserve parsed files of Stanford Parser. Use-m
/--reserve-match
to reserve match results of Stanford Tregex.
nsca sample1.txt -p -m
- Calling
nsca
without any arguments returns help message.
Under the hood
Both NeoSCA and L2SCA rely on Stanford Parser and Stanford Tregex. In case you are unfamiliar with the two dependencies, below are some quick examples. Detailed explanations can be found in the book Computational Methods for Corpus Annotation and Analysis (Lu, 2014).
- Stanford Parser
Assume you have a file named sample.txt
containing one sentence:
This is an example.
This command:
java -mx1500m -cp "/path/to/stanford-parser-full-2020-11-17/*" edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat penn edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz /path/to/sample.txt
gives the phrase structure tree:
(ROOT
(S
(NP (DT This))
(VP (VBZ is)
(NP (DT an) (NN example)))
(. .)))
The tree can be visualized as follows, with quotation marks eliminated.
In this tree, the starting symbol is the label "ROOT" at the root level of the tree, and the 4 terminal symbols are the 4 words in the sentence, located at the bottom of the branches of the tree. The non-terminal symbols, located between the starting symbol and the terminal symbols, include a number of labels for different clausal, phrasal, and lexical categories. (Lu 2014)
When parsing input files,
NeoSCA runs the above command and,
if you have specified the -p
option,
saves phrase structure trees
in files with .parsed
extension.
- Stanford Tregex
Tregex queries regex-like patterns, called Tregex patterns, against phrase structure trees generated by Stanford Parser.
Assume a sample.parsed
has:
(ROOT
(S
(NP (DT This))
(VP (VBZ is)
(NP (DT an) (NN example)))
(. .)))
This command:
java -mx100m -cp "/path/to/stanford-tregex-2020-11-17/stanford-tregex.jar" edu.stanford.nlp.trees.tregex.TregexPattern "NP" sample.parsed -o
gives
Pattern string:
NP
Parsed representation:
Root NP
Reading trees from file(s) sample.parsed
(NP (DT This))
(NP (DT an) (NN example))
There were 2 matches in total.
When querying parsed files, NeoSCA runs the above command and remembers how many matches for each of the pre-specified Tregex patterns.
If you have specified the -m
option,
it also saves the matches,
the two NP
s in our case,
in files with .matches
extension.
Citing
Please use the following citation if you use NeoSCA in your work:
@misc{tan2022neosca,
author = {Tan, Long},
title = {NeoSCA},
howpublished = {\url{https://github.com/tanloong/neosca}},
year = {2022}
}
Also, you need to cite Lu's article describing L2SCA:
@article{lu2010automatic,
title={Automatic analysis of syntactic complexity in second language writing},
author={Lu, Xiaofei},
journal={International journal of corpus linguistics},
volume={15},
number={4},
pages={474--496},
year={2010},
publisher={John Benjamins}
}
License
The same as L2SCA, NeoSCA is licensed under the GNU General Public License, version 2 or later.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file neosca-0.0.20.tar.gz
.
File metadata
- Download URL: neosca-0.0.20.tar.gz
- Upload date:
- Size: 19.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e8a0046fc0c66ac1a9ed6d848cab7000c461bc95c11fdea1a8ee24238309af9a |
|
MD5 | 8d4a8159c902d29b73e1112d082f2f28 |
|
BLAKE2b-256 | 476edaf9cf0f18c27ed0169015e4cc3363117212b41f16d637f163878e4fb9b4 |
File details
Details for the file neosca-0.0.20-py3-none-any.whl
.
File metadata
- Download URL: neosca-0.0.20-py3-none-any.whl
- Upload date:
- Size: 18.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 798c591e19e9c610c4061eb1d362e5923430cbd06b7bd42e4a3b606857de9126 |
|
MD5 | eedf1fb6453808535259a32a6dc2fce4 |
|
BLAKE2b-256 | d13ccc20008f7d39fe9a9a4b441d628473a80d7f3926789062648ea2e1504cca |