Tregex written in Python
Project description
Tregex is the Java program for identifying patterns in constituency trees. PyTregex is a Python implementation of Tregex.
Usage
Command-line
Install it with pip install and run it by python -m pytregex.
$ pip install pytregex
$ echo '(NP(DT The)(NN battery)(NN plant))' | python -m pytregex pattern 'NP < NN' -filter
# (NP (DT The) (NN battery) (NN plant))
# (NP (DT The) (NN battery) (NN plant))
# There were 2 matches in total.
$ echo '(NP(DT The)(NN battery)(NN plant))' > trees.txt
$ python -m pytregex pattern 'NP < NN' ./trees.txt
# (NP (DT The) (NN battery) (NN plant))
# (NP (DT The) (NN battery) (NN plant))
# There were 2 matches in total.
$ python -m pytregex pattern 'NP < NN' -C ./trees.txt
# 2
$ python -m pytregex pattern 'NP < NN=a' -h a ./trees.txt
# (NN battery)
# (NN plant)
# There were 2 matches in total.
$ python -m pytregex explain '<'
# 'A < B' means A immediately dominates B
$ python -m pytregex pprint '(NP(DT The)(NN battery)(NN plant))'
# NP
# ├── DT
# │ └── The
# ├── NN
# │ └── battery
# └── NN
# └── plant
Inline
from pytregex.tregex import TregexPattern
tre = TregexPattern("NP < NN=a")
matches = tre.findall("(NP(DT The)(NN battery)(NN plant))")
handles = tre.get_nodes("a")
print("matches nodes:\n{}\n".format("\n".join(str(m) for m in matches)))
print("named nodes:\n{}".format("\n".join(str(h) for h in handles)))
# Output:
# matches nodes:
# (NP (DT The) (NN battery) (NN plant))
# (NP (DT The) (NN battery) (NN plant))
#
# named nodes:
# (NN battery)
# (NN plant)
See tests for more examples.
Differences from Tregex
Tregex is whitespace-sensitive, it distinguishes between | and ␣|␣. PyTregex ignores whitespace and has different symbols taking the place of ␣|␣.
| Tregex | PyTregex | |
|---|---|---|
| node disjunction | A|B |
A|B |
A␣|␣B | ||
| condition disjunction | A<B␣|␣<C |
A<B␣||␣<C |
A<B||<C | ||
| expression disjunction | A␣|␣B |
N/A |
| expression separation | N/A | A;B |
A␣;␣B |
In the table above the difference between expression disjunction and expression separation is whether "expressions stop evaluating as soon as the result is known." For example, in Tregex NP=a | NNP=b if NP matches, b will not be assigned even if there is an NNP in the tree, while in PyTregex NP=a ; NNP=b assigns b as long as NNP is found regardless of whether NP matches.
Missing features
Backreferencing
$ tree='(NP NP , NP ,)'
$ pattern='(@NP <, (@NP $+ (/,/ $+ (@NP $+ /,/=comma))) <- =comma)'
$ echo "$tree" | tregex.sh "$pattern" -filter -s 2>/dev/null
# (NP NP , NP ,)
$ echo "$tree" | python -m pytregex pattern "$pattern" -filter
# (@NP <, (@NP $+ (/,/ $+ (@NP $+ /,/=comma))) <- =comma)
# ˄
# Parsing error at token '='
Headfinders
PyTregex currently has only one HeadFinder which is for English. If your patterns are for trees of other languages and contain <#, >#, <<#, or >>#, they may not work as expected.
Variable groups
$ tree='(SBAR (WHNP-11 (WP who)) (S (NP-SBJ (-NONE- *T*-11)) (VP (VBD resigned))))'
$ pattern='@SBAR < /^WH.*-([0-9]+)$/#1%index << (__=empty < (/^-NONE-/ < /^\*T\*-([0-9]+)$/#1%index))'
$ echo "$tree" | tregex.sh "$pattern" -filter 2>/dev/null
# (SBAR
# (WHNP-11 (WP who))
# (S
# (NP-SBJ (-NONE- *T*-11))
# (VP (VBD resigned))))
$ echo "$tree" | python -m pytregex pattern "$pattern" -filter
# Tokenization error: Illegal character "#"
Acknowledgments
Thanks Galen Andrew, Roger Levy, Anna Rafferty, and John Bauer for their work on Tregex. One-third of PyTregex's code is just translated from Tregex.
This program uses David Beazley's PLY(Python Lex-Yacc) for pattern tokenization and parsing.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pytregex-0.0.2.tar.gz.
File metadata
- Download URL: pytregex-0.0.2.tar.gz
- Upload date:
- Size: 112.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3d82b7828f13bd7f51cc4094697e335da7ad20a4f7268a25d583546dd2fdb32c
|
|
| MD5 |
15b639f4473077474d6b259b4c5f1b69
|
|
| BLAKE2b-256 |
7f0f088520d702c4bc354cae2d52994f5088d2594816a06b12b1c23dfd56901a
|
File details
Details for the file pytregex-0.0.2-py3-none-any.whl.
File metadata
- Download URL: pytregex-0.0.2-py3-none-any.whl
- Upload date:
- Size: 89.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
44129d7c29807dc06a43e5ec7d03831f73d0ade067c416d36d861b749b9e1a34
|
|
| MD5 |
bf430c20afc0bd6ecfa973fb4d22d3d6
|
|
| BLAKE2b-256 |
755aac516acfeef58c2996dd142a3e517fcd878cc4c1081c31c8e0b0e099fc39
|