Reverse engineer patterns for use with the SpaCy DependencyTreeMatcher
Project description
SpaCy Pattern Builder
Use training examples to build and refine patterns for use with SpaCy's DependencyTreeMatcher.
Motivation
Generating patterns programmatically from training data is more efficient than creating them manually.
Installation
With pip:
pip install spacy-pattern-builder
Usage
# Import a SpaCy model, parse a string to create a Doc object
import en_core_web_sm
text = 'We introduce efficient methods for fitting Boolean models to molecular data.'
nlp = en_core_web_sm.load()
doc = nlp(text)
from spacy_pattern_builder import build_dependency_pattern
# Provide a list of tokens we want to match.
match_tokens = [doc[i] for i in [0, 1, 3]] # [We, introduce, methods]
''' Note that these tokens must be fully connected. That is,
all tokens must have a path to all other tokens in the list,
without needing to traverse tokens outside of the list.
Otherwise, spacy-pattern-builder will raise a TokensNotFullyConnectedError.
You can get a connected set that includes your tokens with the following: '''
from spacy_pattern_builder import util
connected_tokens = util.smallest_connected_subgraph(match_tokens, doc)
assert match_tokens == connected_tokens
# Specify the token attributes / features to use
feature_dict = { # This here is equal to the default feature_dict
'DEP': 'dep_',
'TAG': 'tag_'
}
# Build the pattern
pattern = build_dependency_pattern(doc, match_tokens, feature_dict=feature_dict)
from pprint import pprint
pprint(pattern) # In the format consumed by SpaCy's DependencyTreeMatcher:
'''
[{'PATTERN': {'DEP': 'ROOT', 'TAG': 'VBP'}, 'SPEC': {'NODE_NAME': 'node1'}},
{'PATTERN': {'DEP': 'nsubj', 'TAG': 'PRP'},
'SPEC': {'NBOR_NAME': 'node1', 'NBOR_RELOP': '>', 'NODE_NAME': 'node0'}},
{'PATTERN': {'DEP': 'dobj', 'TAG': 'NNS'},
'SPEC': {'NBOR_NAME': 'node1', 'NBOR_RELOP': '>', 'NODE_NAME': 'node3'}}]
'''
# Create a matcher and add the newly generated pattern
from spacy.matcher import DependencyTreeMatcher
matcher = DependencyTreeMatcher(doc.vocab)
matcher.add('pattern', None, pattern)
# And match away
matches = matcher(doc)
for match_id, token_idxs in matches:
tokens = [doc[i] for i in token_idxs]
tokens = sorted(tokens, key=lambda w: w.i)
print(tokens) # [We, introduce, methods]
Acknowledgements
Uses:
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for spacy-pattern-builder-0.0.3.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 992125bb34efafc808ae786bbef2c763a33e9794362e925279b86e83e1d441d1 |
|
MD5 | cb41165ee72905241d5646af1da7b730 |
|
BLAKE2b-256 | bd28c897265e2241dd4185d9c62405cfb575d5c4c855bf439fc70e47111c42b6 |
Close
Hashes for spacy_pattern_builder-0.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8cdec81f7f4109082f54e58fe87aeca7c5c0721e2ee22237d00d8f7c539e3057 |
|
MD5 | bbce6b254a65f157ec2b6e7fb6c1ee5b |
|
BLAKE2b-256 | c93b7bc5332caf79fe8fef70a1dc72143a7740becbd71b99ba02b19702f3dbc5 |