Skip to main content

Spacy's Matcher specifically designed for matching English idioms

Project description

idiomatch

An implementation of SpaCy(3.0)'s Matcher specifically designed for identifying English idioms.

Quick Start

Install the library via uv (or whatever package manager you prefer)

uv add idiomatch 
import spacy
from idiomatch import Idiomatcher

def main():
    sent = "The floodgates will remain opened for a host of new lawsuits."  # a usecase of *open the floodgates*
    nlp = spacy.load("en_core_web_sm")  # idiom matcher needs an nlp pipeline; Currently supports en_core_web_sm only.
    idiomatcher = Idiomatcher.from_pretrained(nlp.vocab)  # this will take approx 50 seconds.
    doc = nlp(sent)  # process the sentence with an nlp pipeline
    print(idiomatcher(doc))  # identify the idiom in the sentence


if __name__ == '__main__':
    main()
adding patterns into idiom_matcher...: 100%|██████████| 2756/2756 [00:52<00:00, 52.83it/s]
[{'idiom': 'open the floodgates', 'span': 'The floodgates will remain opened', 'meta': (13612509636477658373, 0, 5)}]

Supported Idioms

List of supported idioms can be found in idiomatch/resources/idioms.txt. Total of 2758 idioms are available for matching. These "target idioms" were extracted from a vocabulary of 5000 most frequently used English idioms, which had been made available for open use courtesy of IBM's SLIDE project.

Adding Idioms Yourself

If you have idioms that are not included in the list of supported idioms, you can add them to Idiomatcher yourself with the add_idioms member method:

import spacy
from idiomatch import Idiomatcher


def main():
    nlp = spacy.load("en_core_web_sm")
    idiomatcher = Idiomatcher.from_pretrained(nlp.vocab)  # instantiate 
    # As for a placeholder for openslot, use either: someone / something / someone's / one's 
    idioms = ["have blood on one's hands", "on one's hands"]
    idiomatcher.add_idioms(nlp, idioms)  # this will train idiomatcher to identify the given idioms
    sent = "The leaders of this war have the blood of many thousands of people on their hands."
    doc = nlp(sent)
    print(idiomatcher(doc))


if __name__ == '__main__':
    main()
100%|██████████| 2/2 [00:00<00:00, 145.62it/s]
adding patterns into idiom_matcher...: 100%|██████████| 2/2 [00:00<00:00, 196.40it/s]
[{'idiom': "have blood on one's hands", 'span': 'have the blood of many thousands of people on their hands', 'meta': (5930902300252675198, 5, 16)}, {'idiom': "on one's hands", 'span': 'on their hands', 'meta': (8246625119345375174, 13, 16)}]

Supported Variations

English idioms extensively vary in forms, at least in six different ways. Idiomatcher can gracefully handle all the cases, as exemplified below:

variation example result
modification He called my blatant bluff [{'idiom': "call someone's bluff", 'span': 'called my blatant bluff', 'meta': (11321959191976266509, 1, 5)}]
openslot This will keep all of us posted [{'idiom': 'keep someone posted', 'span': 'keep all of us posted', 'meta': (11722464987668971331, 2, 7)}]
hyphenated That was one balls-out street race! [{'idiom': 'balls-out', 'span': 'balls - out', 'meta': (2876800142358111704, 3, 6)}]
hyphen omitted That was one balls out street race! [{'idiom': 'balls-out', 'span': 'balls out', 'meta': (2876800142358111704, 3, 5)}]
passivisation (modification) the floodgates are finally opened [{'idiom': 'open the floodgates', 'span': 'the floodgates are finally opened', 'meta': (13612509636477658373, 0, 5)}]
passivisation (openslot) my bluff was embarrassingly called by her [{'idiom': "call someone's bluff", 'span': 'my bluff was embarrassingly called', 'meta': (11321959191976266509, 0, 5)}]
inclusion If she dies, you wil have her blood on your hands! [{'idiom': "have blood on one's hands", 'span': 'have her blood on your hands', 'meta': (5930902300252675198, 6, 12)}, {'idiom': "on one's hands", 'span': 'on your hands', 'meta': (8246625119345375174, 9, 12)}]

How Does it Work?

The idiom-matching patterns, which are the foundations of Idiomatcher's flexibility, are heavily inspired by Hughs et al.'s briliant work (2021) on Flexible Retrieval of Idiomatic Expressions from a Large Text Corpus.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

idiomatch-0.2.4.tar.gz (607.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

idiomatch-0.2.4-py3-none-any.whl (702.1 kB view details)

Uploaded Python 3

File details

Details for the file idiomatch-0.2.4.tar.gz.

File metadata

  • Download URL: idiomatch-0.2.4.tar.gz
  • Upload date:
  • Size: 607.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.4

File hashes

Hashes for idiomatch-0.2.4.tar.gz
Algorithm Hash digest
SHA256 e1c5284cec6faabd754e509477759b57e9d25137714a2e77e6612e1b76c76c11
MD5 38e4e9cea0e39c36d4d29a1dac4461db
BLAKE2b-256 22fb5bd9f411d51a01372eddcf537cb5a031dde03c433954b9983083457d189c

See more details on using hashes here.

File details

Details for the file idiomatch-0.2.4-py3-none-any.whl.

File metadata

  • Download URL: idiomatch-0.2.4-py3-none-any.whl
  • Upload date:
  • Size: 702.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.4

File hashes

Hashes for idiomatch-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 30b37e7b463a766691ba3ebf0e3e28e27b3ecc2c181fe4d0b4f75e6c91adca4f
MD5 b51e5c106fc0b657c759cda66dacc346
BLAKE2b-256 662ee23056950ee29afb4003f66f7ec3fa13c1f4e5bd6516c78ab5527bc98ab7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page