Skip to main content

Another syntactic complexity analyzer of written English language samples

Project description

NeoSCA

build lint codecov codacy pypi commit support-version platform downloads license

简体中文 | 繁體中文 | English

NeoSCA is a rewrite of L2 Syntactic Complexity Analyzer (L2SCA) which is developed by Xiaofei Lu, with added support for Windows and an improved command-line interface for easier usage. The same as L2SCA, NeoSCA takes written English language samples in plain text format as input, and computes:

the frequency of 9 structures in the text:
  1. words (W)
  2. sentences (S)
  3. verb phrases (VP)
  4. clauses (C)
  5. T-units (T)
  6. dependent clauses (DC)
  7. complex T-units (CT)
  8. coordinate phrases (CP)
  9. complex nominals (CN), and
14 syntactic complexity indices of the text:
  1. mean length of sentence (MLS)
  2. mean length of T-unit (MLT)
  3. mean length of clause (MLC)
  4. clauses per sentence (C/S)
  5. verb phrases per T-unit (VP/T)
  6. clauses per T-unit (C/T)
  7. dependent clauses per clause (DC/C)
  8. dependent clauses per T-unit (DC/T)
  9. T-units per sentence (T/S)
  10. complex T-unit ratio (CT/T)
  11. coordinate phrases per T-unit (CP/T)
  12. coordinate phrases per clause (CP/C)
  13. complex nominals per T-unit (CN/T)
  14. complex nominals per clause (CP/C)

Contents

Highlights

  • Works on Windows/macOS/Linux
  • Flexible command-line options serving various needs

Install

Install NeoSCA

To install NeoSCA, you need to have Python 3.7 or later installed on your system. You can check if you already have Python installed by running the following command in your terminal:

python --version

If Python is not installed, you can download and install it from Python website. Once you have Python installed, you can install NeoSCA using pip:

pip install neosca

If you are in China and having trouble with slow download speeds or network issues, you can use the Tsinghua University PyPI mirror to install NeoSCA:

pip install neosca -i https://pypi.tuna.tsinghua.edu.cn/simple

Install Dependencies

NeoSCA depends on Java, Stanford Parser, and Stanford Tregex. NeoSCA provides an option to install all of them:

nsca --check-depends

Called with the --check-depends, NeoSCA will download and unzip archive files of these three to %AppData% (for Windows users, usually C:\\Users\\<username>\\AppData\\Roaming) or ~/.local/share (for macOS and Linux users), and set the environment variable JAVA_HOME, STANFORD_PARSER_HOME, and STANFORD_TREGEX_HOME. If you have previously installed any of the three, you need to manually set the according environment variable.

Usage

NeoSCA is a CLI-based tool. You can see the help message by running nsca --help in your terminal.

Basic Usage

Single Input

To analyze a single text file, use the command nsca followed by the file path.

nsca ./samples/sample1.txt
# frequency output: ./result.csv

A result.csv file will be generated in the current directory. You can specify a different output filename using -o/--output-file.

nsca ./samples/sample1.txt -o sample1.csv
# frequency output: ./sample1.csv
When analyzing a text file with a filename that includes spaces, it is important to enclose the file path in single or double quotes. Assume you have a sample 1.txt to analyze:
nsca "./samples/sample 1.txt"

This ensures that the entire filename including the spaces, is interpreted as a single argument. Without the double quotes, the command would interpret "./samples/sample" and "1.txt" as two separate arguments and the analysis would fail.

Multiple Input

To analyze multiple text files at once, simply list them after nsca.

cd ./samples/
nsca sample1.txt sample2.txt

You can also use wildcards to select multiple files at once.

cd ./samples/
nsca sample*.txt # every file whose name starts with "sample" and ends with ".txt"
nsca sample[1-9].txt sample10.txt # sample1.txt -- sample10.txt
nsca sample10[1-9].txt sample1[1-9][0-9].txt sample200.txt # sample101.txt -- sample200.txt

Advanced Usage

Expand Wildcards

Use --expand-wildcards to print all files that match your wildcard pattern. This can help you ensure that your pattern matches all desired files and excludes any unwanted ones. Note that files that do not exist on the computer will not be included in the output, even if they match the specified pattern.

nsca sample10[1-9].txt sample1[1-9][0-9].txt sample200.txt --expand-wildcards

Treat Newlines as Sentence Breaks

Stanford Parser by default does not take newlines as sentence breaks during the sentence segmentation. To achieve this you can use:

nsca sample1.txt --newline-break always

The --newline-break has 3 legal values: never (default), always, and two.

  • never means to ignore newlines for the purpose of sentence splitting. It is appropriate for continuous text with hard line breaks when just the non-whitespace characters should be used to determine sentence breaks.
  • always means to treat a newline as a sentence break, but there still may be more than one sentences per line.
  • two means to take two or more consecutive newlines as a sentence break. It is for text with hard line breaks and a blank line between paragraphs.

Select a Subset of Measures

NeoSCA by default outputs values of all of the available measures. You can use --select to only analyze measures that you are interested in. To see a full list of available measures, use nsca --list.

nsca --select VP T DC_C -- sample1.txt

To avoid the program taking input filenames as a selected measure and raising an error, use -- to separate them from the measures. All arguments after -- will be considered input filenames. Make sure to specify arguments except for input filenames at the left side of --.

Combine Subfiles

Use -c/--combine-subfiles to add up frequencies of the 9 syntactic structures of subfiles and compute values of the 14 syntactic complexity indices for the imaginary parent file. You can use this option multiple times to combine different lists of subfiles respectively. The -- should be used to separate input filenames from input subfile-names.

nsca -c sample1-sub1.txt sample1-sub2.txt
nsca -c sample1-sub*.txt
nsca -c sample1-sub*.txt -c sample2-sub*.txt
nsca -c sample1-sub*.txt -c sample2-sub*.txt -- sample[3-9].txt

Skip Long Sentences

Use --max-length to only analyze sentences with lengths shorter than or equal to 100, for example.

nsca sample1.txt --max-length 100

When the --max-length is not specified, the program will try to analyze sentences of any lengths, but may run out of memory trying to do so.

Reserve Intermediate Results

NeoSCA by default only saves frequency output. To reserve the parsed trees, use -p or --reserve-parsed. To reserve matched subtrees, use -m or --reserve-matched.
nsca samples/sample1.txt -p
# frequency output: ./result.csv
# parsed trees:     ./samples/sample1.parsed
nsca samples/sample1.txt -m
# frequency output: ./result.csv
# matched subtrees: ./result_matches/
nsca samples/sample1.txt -p -m
# frequency output: ./result.csv
# parsed trees:     ./samples/sample1.parsed
# matched subtrees: ./result_matches/

Misc

Pass Text Through the Command Line

If you want to analyze text that is passed directly through the command line, you can use --text followed by the text.

nsca --text 'The quick brown fox jumps over the lazy dog.'
# frequency output: ./result.csv

JSON Output

You can generate a JSON file by:

nsca ./samples/sample1.txt --output-format json
# frequency output: ./result.json
nsca ./samples/sample1.txt -o sample1.json
# frequency output: ./sample1.json

Just Parse Text and Exit

If you only want to save the parsed trees and exit, you can use --no-query. This can be useful if you want to use the parsed trees for other purposes. When --no-query is specified, the --reserve-parsed will be automatically set.

nsca samples/sample1.txt --no-query
# parsed trees: samples/sample1.parsed
nsca --text 'This is a test.' --no-query
# parsed trees: ./cmdline_text.parsed

List Output Fields

Use --list to print a list of all the available output fields.

nsca --list
W: words
S: sentences
VP: verb phrases
C: clauses
T: T-units
DC: dependent clauses
CT: complex T-units
CP: coordinate phrases
CN: complex nominals
MLS: mean length of sentence
MLT: mean length of T-unit
MLC: mean length of clause
C/S: clauses per sentence
VP/T: verb phrases per T-unit
C/T: clauses per T-unit
DC/C: dependent clauses per clause
DC/T: dependent clauses per T-unit
T/S: T-units per sentence
CT/T: complex T-unit ratio
CP/T: coordinate phrases per T-unit
CP/C: coordinate phrases per clause
CN/T: complex nominals per T-unit
CN/C: complex nominals per clause

Citing

If you use NeoSCA in your research, please cite as follows.

BibTeX
@misc{tan2022neosca,
title        = {NeoSCA: A Rewrite of L2 Syntactic Complexity Analyzer, version 0.0.38},
author       = {Long Tan},
howpublished = {\url{https://github.com/tanloong/neosca}},
year         = {2022}
}
APA (7th edition)
Tan, L. (2022). NeoSCA (version 0.0.38) [Computer software]. Github. https://github.com/tanloong/neosca
MLA (9th edition)
Tan, Long. NeoSCA. version 0.0.38, GitHub, 2022, https://github.com/tanloong/neosca.

Also, you need to cite Xiaofei's article describing L2SCA.

BibTeX
@article{lu2010automatic,
title     = {Automatic analysis of syntactic complexity in second language writing},
author    = {Xiaofei Lu},
journal   = {International journal of corpus linguistics},
volume    = {15},
number    = {4},
pages     = {474--496},
year      = {2010},
publisher = {John Benjamins Publishing Company},
doi       = {10.1075/ijcl.15.4.02lu},
}
APA (7th edition)
Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4), 474-496.
MLA (9th edition)
Lu, Xiaofei. "Automatic Analysis of Syntactic Complexity in Second Language Writing." International Journal of Corpus Linguistics, vol. 15, no. 4, John Benjamins Publishing Company, 2010, pp. 474-96.

Related Efforts

License

Distributed under the terms of the GNU General Public License version 2 or later.

Contact

You can send bug reports, feature requests, or any questions via:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

neosca-0.0.38.tar.gz (40.1 kB view details)

Uploaded Source

Built Distribution

neosca-0.0.38-py3-none-any.whl (35.9 kB view details)

Uploaded Python 3

File details

Details for the file neosca-0.0.38.tar.gz.

File metadata

  • Download URL: neosca-0.0.38.tar.gz
  • Upload date:
  • Size: 40.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.10

File hashes

Hashes for neosca-0.0.38.tar.gz
Algorithm Hash digest
SHA256 d7146df0c364c150c4d6d7eded80f8fa2ef42be28c26ee32e6a76c4f8bcb30c7
MD5 77583ca3cd2f5243bed3a56d12ec3191
BLAKE2b-256 6d8dbb8b11725fd12e78764f5526d9e83feba2855e326651debd3b99da5d09a3

See more details on using hashes here.

File details

Details for the file neosca-0.0.38-py3-none-any.whl.

File metadata

  • Download URL: neosca-0.0.38-py3-none-any.whl
  • Upload date:
  • Size: 35.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.10

File hashes

Hashes for neosca-0.0.38-py3-none-any.whl
Algorithm Hash digest
SHA256 8e6ae1a1083a26836d5716bd9672792ef91b6d8167b9a9cbf5b7af8672506df8
MD5 9596e0df797a5be5361542db760b42e2
BLAKE2b-256 5afce5f29b5926a0b1e1708f0fd6cc03b8eb058270726ed1fbe2f701f3ebf94a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page