Skip to main content

Tregex written in Python

Project description

Python Version from PEP 621 TOML license

Tregex is the Java program for identifying patterns in constituency trees. PyTregex is a Python implementation of Tregex.

Usage

Command-line

Install it with pip install and run it by python -m pytregex.

$ pip install pytregex

$ echo '(NP(DT The)(NN battery)(NN plant))' | python -m pytregex pattern 'NP < NN' -filter
# (NP (DT The) (NN battery) (NN plant))
# (NP (DT The) (NN battery) (NN plant))
# There were 2 matches in total.

$ echo '(NP(DT The)(NN battery)(NN plant))' > trees.txt
$ python -m pytregex pattern 'NP < NN' ./trees.txt
# (NP (DT The) (NN battery) (NN plant))
# (NP (DT The) (NN battery) (NN plant))
# There were 2 matches in total.

$ python -m pytregex pattern 'NP < NN' -C ./trees.txt
# 2

$ python -m pytregex pattern 'NP < NN=a' -h a ./trees.txt
# (NN battery)
# (NN plant)
# There were 2 matches in total.

$ python -m pytregex explain '<'
# 'A < B' means A immediately dominates B

$ python -m pytregex pprint '(NP(DT The)(NN battery)(NN plant))'
# NP
# ├── DT
# │   └── The
# ├── NN
# │   └── battery
# └── NN
#     └── plant

Inline

from pytregex.tregex import TregexPattern

tre = TregexPattern("NP < NN=a")
matches = tre.findall("(NP(DT The)(NN battery)(NN plant))")
handles = tre.get_nodes("a")
print("matches nodes:\n{}\n".format("\n".join(str(m) for m in matches)))
print("named nodes:\n{}".format("\n".join(str(h) for h in handles)))

# Output:
# matches nodes:
# (NP (DT The) (NN battery) (NN plant))
# (NP (DT The) (NN battery) (NN plant))
#
# named nodes:
# (NN battery)
# (NN plant)

See tests for more examples.

Differences from Tregex

Tregex is whitespace-sensitive, it distinguishes between | and ␣|␣. PyTregex ignores whitespace and has different symbols taking the place of ␣|␣.

<style> table tr:nth-child(odd), table tr:nth-child(even) { background-color: transparent !important; } </style>
Tregex PyTregex
node disjunction A|B A|B
A␣|␣B
condition disjunction A<B␣|␣<C A<B␣||␣<C
A<B||<C
expression disjunction A␣|␣B N/A
expression separation N/A A;B
A␣;␣B

In the table above the difference between expression disjunction and expression separation is whether "expressions stop evaluating as soon as the result is known." For example, in Tregex NP=a | NNP=b if NP matches, b will not be assigned even if there is an NNP in the tree, while in PyTregex NP=a ; NNP=b assigns b as long as NNP is found regardless of whether NP matches.

Missing features

Backreferencing

$ tree='(NP NP , NP ,)'
$ pattern='(@NP <, (@NP $+ (/,/ $+ (@NP $+ /,/=comma))) <- =comma)' 

$ echo "$tree" | tregex.sh "$pattern" -filter -s 2>/dev/null
# (NP NP , NP ,)

$ echo "$tree" | python -m pytregex pattern "$pattern" -filter
# (@NP <, (@NP $+ (/,/ $+ (@NP $+ /,/=comma))) <- =comma)
#                                              ˄
# Parsing error at token '='

Headfinders

PyTregex currently has only one HeadFinder which is for English. If your patterns are for trees of other languages and contain <#, >#, <<#, or >>#, they may not work as expected.

Variable groups

$ tree='(SBAR (WHNP-11 (WP who)) (S (NP-SBJ (-NONE- *T*-11)) (VP (VBD resigned))))' 
$ pattern='@SBAR < /^WH.*-([0-9]+)$/#1%index << (__=empty < (/^-NONE-/ < /^\*T\*-([0-9]+)$/#1%index))' 

$ echo "$tree" | tregex.sh "$pattern" -filter 2>/dev/null
# (SBAR
#   (WHNP-11 (WP who))
#   (S
#     (NP-SBJ (-NONE- *T*-11))
#     (VP (VBD resigned))))

$ echo "$tree" | python -m pytregex pattern "$pattern" -filter
# Tokenization error: Illegal character "#"

Acknowledgments

Thanks Galen Andrew, Roger Levy, Anna Rafferty, and John Bauer for their work on Tregex. One-third of PyTregex's code is just translated from Tregex.

This program uses David Beazley's PLY(Python Lex-Yacc) for pattern tokenization and parsing.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytregex-0.0.2.tar.gz (112.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pytregex-0.0.2-py3-none-any.whl (89.0 kB view details)

Uploaded Python 3

File details

Details for the file pytregex-0.0.2.tar.gz.

File metadata

  • Download URL: pytregex-0.0.2.tar.gz
  • Upload date:
  • Size: 112.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.20

File hashes

Hashes for pytregex-0.0.2.tar.gz
Algorithm Hash digest
SHA256 3d82b7828f13bd7f51cc4094697e335da7ad20a4f7268a25d583546dd2fdb32c
MD5 15b639f4473077474d6b259b4c5f1b69
BLAKE2b-256 7f0f088520d702c4bc354cae2d52994f5088d2594816a06b12b1c23dfd56901a

See more details on using hashes here.

File details

Details for the file pytregex-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: pytregex-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 89.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.20

File hashes

Hashes for pytregex-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 44129d7c29807dc06a43e5ec7d03831f73d0ade067c416d36d861b749b9e1a34
MD5 bf430c20afc0bd6ecfa973fb4d22d3d6
BLAKE2b-256 755aac516acfeef58c2996dd142a3e517fcd878cc4c1081c31c8e0b0e099fc39

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page