Spacy's Matcher specifically designed for matching English idioms
Project description
idiomatch
An implementation of SpaCy(3.0)'s Matcher specifically designed for identifying English idioms.
Quick Start
Install the library via uv (or whatever package manager you prefer)
uv add idiomatch
import spacy
from idiomatch import Idiomatcher
def main():
sent = "The floodgates will remain opened for a host of new lawsuits." # a usecase of *open the floodgates*
nlp = spacy.load("en_core_web_sm") # idiom matcher needs an nlp pipeline; Currently supports en_core_web_sm only.
idiomatcher = Idiomatcher.from_pretrained(nlp.vocab) # this will take approx 50 seconds.
doc = nlp(sent) # process the sentence with an nlp pipeline
print(idiomatcher(doc)) # identify the idiom in the sentence
if __name__ == '__main__':
main()
adding patterns into idiom_matcher...: 100%|██████████| 2756/2756 [00:52<00:00, 52.83it/s]
[{'idiom': 'open the floodgates', 'span': 'The floodgates will remain opened', 'meta': (13612509636477658373, 0, 5)}]
Supported Idioms
List of supported idioms can be found in idiomatch/resources/idioms.txt. Total of 2758 idioms are available for
matching. These "target idioms" were extracted from a vocabulary of 5000 most
frequently used English idioms, which had been made available for open use courtesy of IBM's SLIDE project.
Adding Idioms Yourself
If you have idioms that are not included in the list of supported idioms, you can add them to Idiomatcher
yourself with the add_idioms member method:
import spacy
from idiomatch import Idiomatcher
def main():
nlp = spacy.load("en_core_web_sm")
idiomatcher = Idiomatcher.from_pretrained(nlp.vocab) # instantiate
# As for a placeholder for openslot, use either: someone / something / someone's / one's
idioms = ["have blood on one's hands", "on one's hands"]
idiomatcher.add_idioms(nlp, idioms) # this will train idiomatcher to identify the given idioms
sent = "The leaders of this war have the blood of many thousands of people on their hands."
doc = nlp(sent)
print(idiomatcher(doc))
if __name__ == '__main__':
main()
100%|██████████| 2/2 [00:00<00:00, 145.62it/s]
adding patterns into idiom_matcher...: 100%|██████████| 2/2 [00:00<00:00, 196.40it/s]
[{'idiom': "have blood on one's hands", 'span': 'have the blood of many thousands of people on their hands', 'meta': (5930902300252675198, 5, 16)}, {'idiom': "on one's hands", 'span': 'on their hands', 'meta': (8246625119345375174, 13, 16)}]
Supported Variations
English idioms extensively vary in forms, at least in six different ways. Idiomatcher can gracefully handle all the
cases, as exemplified below:
| variation | example | result |
|---|---|---|
| modification | He called my blatant bluff | [{'idiom': "call someone's bluff", 'span': 'called my blatant bluff', 'meta': (11321959191976266509, 1, 5)}] |
| openslot | This will keep all of us posted | [{'idiom': 'keep someone posted', 'span': 'keep all of us posted', 'meta': (11722464987668971331, 2, 7)}] |
| hyphenated | That was one balls-out street race! | [{'idiom': 'balls-out', 'span': 'balls - out', 'meta': (2876800142358111704, 3, 6)}] |
| hyphen omitted | That was one balls out street race! | [{'idiom': 'balls-out', 'span': 'balls out', 'meta': (2876800142358111704, 3, 5)}] |
| passivisation (modification) | the floodgates are finally opened | [{'idiom': 'open the floodgates', 'span': 'the floodgates are finally opened', 'meta': (13612509636477658373, 0, 5)}] |
| passivisation (openslot) | my bluff was embarrassingly called by her | [{'idiom': "call someone's bluff", 'span': 'my bluff was embarrassingly called', 'meta': (11321959191976266509, 0, 5)}] |
| inclusion | If she dies, you wil have her blood on your hands! | [{'idiom': "have blood on one's hands", 'span': 'have her blood on your hands', 'meta': (5930902300252675198, 6, 12)}, {'idiom': "on one's hands", 'span': 'on your hands', 'meta': (8246625119345375174, 9, 12)}] |
How Does it Work?
The idiom-matching patterns, which are the foundations of Idiomatcher's flexibility, are heavily inspired by Hughs et al.'s briliant work (2021) on Flexible Retrieval of Idiomatic Expressions from a Large Text Corpus.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file idiomatch-0.2.1.tar.gz.
File metadata
- Download URL: idiomatch-0.2.1.tar.gz
- Upload date:
- Size: 209.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bfd00683ef52dd559b8aec1bee236606cd559543409a453b96b6cfe16465272b
|
|
| MD5 |
71571dd31d879ad7e2f8e7a6e309c778
|
|
| BLAKE2b-256 |
cfa6a386c33d6f217f4537690d363d7fd4dc69e9a167747e0dca69518d641a59
|
File details
Details for the file idiomatch-0.2.1-py3-none-any.whl.
File metadata
- Download URL: idiomatch-0.2.1-py3-none-any.whl
- Upload date:
- Size: 239.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8185409d9713c65fa6dcdaecaf69741b988806752739b1f19a0a70d2d4783e5f
|
|
| MD5 |
48efc3d7b91554d2572ab95d0f3457a6
|
|
| BLAKE2b-256 |
64a3c3576a7bc339d33d391fb3d90b2c9b3506442c70ae7c27093181d5cc2c61
|