Skip to main content

A package for extracting syntactic complexity measures from CoNLL-U annotations.

Project description

syntaxcomp

This package is designed for calculating syntactic complexity measures on the basis of morphosyntactically annotated texts in CoNLL-U format. It also enables sentence segmentation (T-unit and clause extraction) and NP extraction.

Disclaimer: correct results are only guaranteed for texts annotated with UDPipe 2.12. Please note that syntaxcomp relies heavily on CoNLL-U Parser.

Installation

pip install syntaxcomp

Usage Example

>>> from syntaxcomp.complexity import SentenceComplexity, TextComplexity

>>> example = """
# udpipe_model = english-ewt-ud-2.12-230717
# sent_id = 1
# text = This is a text containing two sentences.
1	This	this	PRON	DT	Number=Sing|PronType=Dem	4	nsubj	_	_
2	is	be	AUX	VBZ	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	4	cop	_	_
3	a	a	DET	DT	Definite=Ind|PronType=Art	4	det	_	_
4	text	text	NOUN	NN	Number=Sing	0	root	_	_
5	containing	contain	VERB	VBG	VerbForm=Ger	4	acl	_	_
6	two	two	NUM	CD	NumForm=Word|NumType=Card	7	nummod	_	_
7	sentences	sentence	NOUN	NNS	Number=Plur	5	obj	_	SpaceAfter=No
8	.	.	PUNCT	.	_	4	punct	_	_

# sent_id = 2
# text = This is the second sentence.
1	This	this	PRON	DT	Number=Sing|PronType=Dem	5	nsubj	_	_
2	is	be	AUX	VBZ	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	5	cop	_	_
3	the	the	DET	DT	Definite=Def|PronType=Art	5	det	_	_
4	second	second	ADJ	JJ	Degree=Pos|NumType=Ord	5	amod	_	_
5	sentence	sentence	NOUN	NN	Number=Sing	0	root	_	SpaceAfter=No
6	.	.	PUNCT	.	_	5	punct	_	SpaceAfter=No
"""

>>> tc = TextComplexity(example)
>>> tc.info()
Number of Sentences: 2
Number of Words: 12
Number of Clauses: 3
Number of T-Units: 2
Mean Sentence Length: 6.0
Mean Clause Length: 4.0
Mean T-Unit Length: 6.0
Mean Number of Clauses per Sentence: 1.5
Mean Number of Clauses per T-Unit: 1.5
Mean Tree Depth: 3
Median Tree Depth: 3.0
Minimum Tree Depth: 2
Maximum Tree Depth: 4
Mean Dependency Distance: 2.42
Node-to-Terminal-Node Ratio: 1.5
Average Levenshtein Distance between POS: 3
Average Levenshtein Distance between deprel: 4
Average NP Length: 1.8
Complex NP Ratio: 0.6
Number of Combined Clauses: 1
Number of Coordinate Clauses: 0
Number of Subordinate Clauses: 1
Coordinate to Combined Clause Ratio: 0.0
Subordinate to Combined Clause Ratio: 1.0
Coordinate to Subordinate Clause Ratio: 0.0
Coordinate Clause to Sentence Ratio: 0.0
Subordinate Clause to Sentence Ratio: 0.5
Percentage of root Clauses: 67.0%
Percentage of acl Clauses: 33.0%

Alternatively, you can directly pass the result of conllu.parse as input:

>>> from conllu import parse
>>> anno = parse(example)
>>> tc = TextComplexity(anno)

For SentenceComplexity, conllu.models.TokenList is currently the only accepted input:

>>> sc = SentenceComplexity(anno[0])
>>> sc.info()
Number of Words: 7
Number of Clauses: 2
Clauses: ['This is a text', 'containing two sentences']
Number of T-Units: 1
T-Units: ['This is a text containing two sentences']
Number of NPs: 3
NPs: ['This', 'a text', 'two sentences']
Tree Depth: 4
Mean Dependency Distance: 2
POS Chain: ['PRON', 'AUX', 'DET', 'NOUN', 'VERB', 'NUM', 'NOUN']
deprel Chain: ['nsubj', 'cop', 'det', 'root', 'acl', 'nummod', 'obj']

To display the text and the dependency tree, pass verbose=True (for TextComplexity, only the text will be printed):

>>> SentenceComplexity(anno[0], verbose=True)
This is a text containing two sentences.
(deprel:root) form:text lemma:text upos:NOUN [4]
    (deprel:nsubj) form:This lemma:this upos:PRON [1]
    (deprel:cop) form:is lemma:be upos:AUX [2]
    (deprel:det) form:a lemma:a upos:DET [3]
    (deprel:acl) form:containing lemma:contain upos:VERB [5]
        (deprel:obj) form:sentences lemma:sentence upos:NOUN [7]
            (deprel:nummod) form:two lemma:two upos:NUM [6]
    (deprel:punct) form:. lemma:. upos:PUNCT [8]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

syntaxcomp-0.0.1.tar.gz (21.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

syntaxcomp-0.0.1-py3-none-any.whl (20.5 kB view details)

Uploaded Python 3

File details

Details for the file syntaxcomp-0.0.1.tar.gz.

File metadata

  • Download URL: syntaxcomp-0.0.1.tar.gz
  • Upload date:
  • Size: 21.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.9.13

File hashes

Hashes for syntaxcomp-0.0.1.tar.gz
Algorithm Hash digest
SHA256 d15b6f076b089b314a3f2c58760013726ccbdb034721406470fc57696a29d614
MD5 927af8d43bd7ddd27b7106470e9af1d6
BLAKE2b-256 8a316b1ee7175268109cc51b7eda824180f471c82d39a9e71919ea30ed2ebbd9

See more details on using hashes here.

File details

Details for the file syntaxcomp-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: syntaxcomp-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 20.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.9.13

File hashes

Hashes for syntaxcomp-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9b2b467f67c920fea942c5cfd06f811e0528b23d8666e3468bd3c0ae2d50e091
MD5 cc8cd802ae161f946af4573efeb84329
BLAKE2b-256 f356c10032e635964b916a55ddd154718985fff52cc87333fcaadefacc0a8299

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page