Dependency parse searching
Dependency parse searching for CONLL-U DataFrames
Note: this tool currently doesn't have tests, CI, etc. It is not yet advised to use this tool outside of the depgrep methods provided by the
pip install depgrep
The tool is designed to work with corpora made from CONLL-U files and parsed into DataFrames by buzz. The best thing to do is use buzz to model corpora, and then use its depgrep method.
pip install buzz
Then, in Python:
from buzz import Corpus corpus = Corpus('path/to/conll/files') query = 'l"have"' # match the lemma "have"
depgrep searches work through a combination of nodes and relations, just like Tgrep2, on which this tool is based.
A node targets one token feature (word, lemma, POS, wordclass, dependency role, etc). It may be specified as a regular expression or a simple string match:
f/amod|nsubj/ will match tokens filling the nsubj or amod role;
l"be" will match the lemma, be.
The first part of the node query chooses which token attribute is to be searched. It can be any of:
w : word l : lemma p : part of speech tag x : wordclass / XPOS f : dependency role i : index in sentence s : sentence number
Case sensitivity is controlled by the case of the attribute you are searching:
p/VB/ is case-insensitive, and
P/VB/ is case sensitive. Therefore, the following query matches words ending in ing, ING, Ing, etc:
For case-insensitivity across the query, use the
case_sensitive=False keyword argument.
Relations specify the relationship between nodes. For example, we can use
f"nsubj" <- f"ROOT" to locate nominal subjects governed by nodes in the role of ROOT. The thing you want to find is the leftmost node in the query. So, while the above query finds nominal subject tokens, you could use inverse relation,
f"ROOT" -> f"nsubj" to return the ROOT tokens.
a = b : a and b are the same node a & b : a and b are the same node (same as =) a <- b : a is a dependent of b a <<- b : a is a descendent of b, with any distance in between a <-: b : a is the only dependent of b a <-N b : a is descendent of b by N generations a -> b : a is the governor of a a ->> b : a is an ancestor of b, with any distance in between a ->: b : a is the only governor of b (as is normal in many grammars) a ->N b : a is ancestor of b by N generations a + b : a is immediately to the left of b a +N b : a is N places to the left of b a <| b : a is left of b, with any distance in between a - b : a is immediately to the right of b a -N b : a is n places to the right of b a |> b : a is right of b, with any distance in between a $ b : a and b share a governor (i.e. are sisters) a $> b : a is a sister of and to the right of b. a $< b : a is a sister of and to the left of b.
! before a relation to negate it:
f"ROOT" != x"VERB" will find non-verbal ROOT nodes.
Brackets can be used to make more complex queries:
f"amod" = l/^[abc]/ <- (f/nsubj/ != x/NOUN/)
The above translates to match adjectival modifiers starting with a, b or c, which are governed by nominal subjects that are not nouns
Note that without brackets, each relation/node refers to the leftmost node. In the following, the plural noun must be the same node as the nsubj, not the ROOT:
f"nsubj" <- f"ROOT" = p"NNS"
You can use the pipe (
|) to create an OR expression.
# match all kinds of modifiers x"ADJ" | f"amod" | f"appos" | p/^JJ/ x"NOUN" <- f"ROOT" | = p"NNS"
Above, we match nouns that are either governed by ROOT, or are plural.
You can use
* to stand in for any token. To match any token that is the governor of a verb, do:
__ -> x"VERB"
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.