Skip to main content

Parser for dependency trees

Project description

STARK: A Tool for Dependency Tree Extraction and Comparison

A bottom-up tool for discovering syntactic patterns in parsed corpora — no predefined queries needed.

STARK_demonstration_loop

STARK is a highly customizable tool designed for extracting different types of syntactic structures (trees) from parsed corpora (treebanks). It quantifies these structures with respect to frequency and provides other useful corpus-linguistic statistics, such as the strength of association between the nodes of a tree or its statistical significance in comparison to another treebank.

STARK is primarily aimed at processing treebanks based on the Universal Dependencies annotation scheme, but it also takes any other dependency treebank in the CONLL-U format as input.

For an online demonstration of the tool (with reduced set of features), please visit https://orodja.cjvt.si/stark/.

Installation and execution

Install Python 3 on your system https://www.python.org/downloads/.

Linux users

Install pip and other libraries required by the program, by running the following commands in the terminal:

sudo apt install python3-pip
cd <PATH TO PROJECT DIRECTORY>
pip3 install -r requirements.txt

Execute extraction by first moving to the project directory and executing the script with:

python3 stark.py 

Windows users

Download pip installation file (https://bootstrap.pypa.io/get-pip.py) and install it by double clicking on it.

Install other libraries necessary for running by going into program directory and double clicking on install.bat. If windows defender is preventing execution of this file you might have to unblock that file by right-clicking on .bat file -> Properties -> General -> Security -> Select Unblock -> Select Apply.

Execute extraction by running run.bat (in case it is blocked repeat the same procedure as for install.bat).

Changing the settings

By default, running the program as described above extracts trees from the sample en_ewt-ud-dev.conllu file (taken from the English EWT UD v2.14 treebank) as defined by the parameter settings in the default config.ini file. To change the settings, you can modify the config.ini file directly or create your own configuration (.ini) file, which is then passed as an argument when running the program in the terminal (example below) or specified in the run.bat file.

python3 stark.py --config_file my-settings.ini

Alternatively, you can change a specific setting by introducing it as a command line argument directly, which overrides the default setting specified in the config.ini configuration file. In the example below, the tool extracts verb-headed trees with lemmas as nodes from a treebank named my-treebank.conllu, while all other options remain the same as in the default config.ini configuration file.

python3 stark.py --input my-treebank.conllu --node_type lemma --head upos=VERB

List of main settings

The types of trees to be extracted and the associated output information can be defined through the main parameters listed below and described in more detail here.

General settings:

  • input: location of the input file or directory (parsed corpus in .conllu)
  • output: location of the output file (list of trees in .tsv)

Tree specification:

  • node_type: node characteristic under investigation (form, lemma, upos, xpos, feats, deprel or none)
  • labeled: extraction of labeled or unlabeled trees (values yes or no)
  • fixed: differentiating trees by surface word order (values yes or no)

Tree restrictions:

  • size: number of nodes in the tree (integer or range, e.g. 2-10)
  • head: predefined characteristics of the head node (e.g. upos=NOUN)
  • ignored_labels: predefined list of dependency labels to be ignored when retrieving the trees (e.g. punct)
  • query: predefined tree structure based on the DepSearch query language (e.g. upos=VERB >obl upos=NOUN).

Statistics:

  • association_measures: calculates the strength of association between nodes by MI, MI3, t-test, logDice, Dice and simple-LL scores (values yes or no)
  • compare: calculates the keyness of a tree in comparison to another treebank by LL, BIC, log ratio, odds ratio and %DIFF scores (reference treebank in .conllu)

Additional visualization:

  • example: prints a random sentence containing the tree
  • grew_match: provides links to examples in Grew-match and describes the tree structure using the grew query language

For a detailed explanation of these and many other settings, see the settings documentation here.

Output

Input-output flow

STARK produces a tab-separated (.tsv) file with a list of all the trees matching the input criteria sorted by descending frequency, as illustrated by the first few lines of the default sample output below, which show the 5-most frequent trees occurring in the sample en_ewt-ud-dev.conllu treebank.

The description of the tree is given in the first column, while subsequent columns additional information on the absolute and relative frequencies, the surface node order, the number of the nodes in the tree and the head node. For adding other types of information to the output, such as other useful statistics, examples and links to visualized trees, see the list of settings above or the detailed settings documentation here.

Tree Abs-Freq Rel-Freq Order N Head
DET <det NOUN 320 12644.6 AB 2 NOUN
ADP <case DET <det NOUN 183 7276.6 ABC 3 NOUN
ADP <case PROPN 175 6958.5 AB 2 PROPN
ADP <case NOUN 163 6481.4 AB 2 NOUN
ADJ <amod NOUN 117 4652.3 AB 2 NOUN

Description of tree structure

The description of the trees given in the first column of the output is based on the dep_search query language (archived here), which is simple to learn and easy to read:

  • Dependencies are expressed using < and > operators, which mimic the "arrows" in the dependency graph.
    • A < B means that token A is governed by token B, e.g. rainy < morning
    • A > B means that token A governs token B, e.g. read > newspapers
  • Dependency labels are specified right after the dependency operator
    • A <amod B means that token A is the adjectival modifier of token B, e.g. rainy <amod morning
    • A >obj B means that token B is the direct object of token A, e.g. read >obj newspapers
  • Priority is marked using parentheses:
    • A > B > C means that A governs both B and C in parallel, e.g. read > newspapers > people for 'people read newspapers'
    • A > (B > C) means that A governs B which, in turn, governs C, e.g. read > (newspapers > interesting) for 'read interesting newspapers'

Acknowledgment

This tool was developed by Luka Krsnik in collaboration with Kaja Dobrovoljc and Marko Robnik Šikonja. Financial and infrastructural support was provided by Slovenian Research and Innovation Agency, CLARIN.SI and CJVT UL as part of the research projects SPOT: A Treebank-Driven Approach to the Study of Spoken Slovenian (Z6-4617) and Language Resources and Technologies for Slovene (P6-0411), as well as through the 2019 and 2024 CLARIN.SI Resource and Service Development grants.

drawing drawing drawing

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stark_trees-0.0.1.tar.gz (106.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stark_trees-0.0.1-py3-none-any.whl (142.4 kB view details)

Uploaded Python 3

File details

Details for the file stark_trees-0.0.1.tar.gz.

File metadata

  • Download URL: stark_trees-0.0.1.tar.gz
  • Upload date:
  • Size: 106.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for stark_trees-0.0.1.tar.gz
Algorithm Hash digest
SHA256 c7a6fdfde7ec5ace7174861812f143de3d2000c6d5f6c32ae96b46bcaef13a06
MD5 4b13fcb00f6a1822a093e89bc456ed2c
BLAKE2b-256 b063f685fde5e90251588846dc8303737d5013fc3c5ad7c2aaeade168e6e3571

See more details on using hashes here.

File details

Details for the file stark_trees-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: stark_trees-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 142.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for stark_trees-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e3cea17aee2c0e9ea1741549b7e1c5537cf782e0469e2baf899440a8e6513a94
MD5 f186bbfacec5711681ade2b0010e87d3
BLAKE2b-256 ad63c90467fcf58b1b67eabc8e111b4aec7023af82339a7eee18fd61badb0b75

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page