Skip to main content

A high performance multiple functional word matcher

Project description

Matcher Rust Implementation with PyO3 Binding

A high-performance, multi-functional word matcher implemented in Rust.

Designed to solve AND OR NOT and TEXT VARIATIONS problems in word/word_list matching. For detailed implementation, see the Design Document.

Installation

Use pip

pip install matcher_py

Install pre-built binary

Visit the release page to download the pre-built binary.

Usage

The msgspec library is recommended for serializing the matcher configuration due to its performance benefits. You can also use other msgpack serialization libraries like ormsgpack. All relevant types are defined in extension_types.py.

Explaination of the configuration

  • Matcher's configuration is defined by the MatchTableMap = Dict[int, List[MatchTable]] type, the key of MatchTableMap is called match_id, for each match_id, the table_id inside should but isn't required to be unique.
  • SimpleMatcher's configuration is defined by the SimpleMatchTableMap = Dict[SimpleMatchType, Dict[int, str]] type, the value Dict[int, str]'s key is called word_id, word_id is required to be globally unique.

MatchTable

  • table_id: The unique ID of the match table.
  • match_table_type: The type of the match table.
  • simple_match_type: The type of the simple match (only relevant if match_table_type is "simple").
  • word_list: The word list of the match table.
  • exemption_simple_match_type: The type of the exemption simple match.
  • exemption_word_list: The exemption word list of the match table.

For each match table, word matching is performed over the word_list, and exemption word matching is performed over the exemption_word_list. If the exemption word matching result is True, the word matching result will be False.

MatchTableType

  • Simple = "simple": Supports simple multiple patterns matching with text normalization defined by simple_match_type.
    • We offer transformation methods for text normalization, including MatchFanjian, MatchNormalize, MatchPinYin ···.
    • It can handle combination patterns and repeated times sensitive matching, delimited by ,, such as hello,world,hello will match hellohelloworld and worldhellohello, but not helloworld due to the repeated times of hello.
  • SimilarChar = "similar_char": Supports similar character matching using regex.
    • ["hello,hallo,hollo,hi", "word,world,wrd,🌍", "!,?,~"] will match helloworld, hollowrd, hi🌍 ··· any combinations of the words split by , in the list.
  • Acrostic = "acrostic": Supports acrostic matching using regex (currently only supports Chinese and simple English sentences).
    • ["h,e,l,l,o", "你,好"] will match hope, endures, love, lasts, onward. and 你的笑容温暖, 好心情常伴。.
  • SimilarTextLevenshtein = "similar_text_levenshtei"n": Supports similar text matching based on Levenshtein distance (threshold is 0.8).
    • ["helloworld"] will match helloworld, hellowrld, helloworld! ··· any similar text to the words in the list.
  • Regex = "regex": Supports regex matching.
    • ["h[aeiou]llo", "w[aeiou]rd"] will match hello, world, hillo, wurld ··· any text that matches the regex in the list.

SimpleMatchType

  • MatchNone = 1: No transformation.
  • MatchFanjian = 2: Traditional Chinese to simplified Chinese transformation.
    • 妳好 -> 你好
    • 現⾝ -> 现身
  • MatchDelete = 12: Delete all non-alphanumeric and non-unicode Chinese characters.
    • hello, world! -> helloworld
    • 《你∷好》 -> 你好
  • MatchNormalize = 16: Normalize all English character variations and number variations to basic characters.
    • ℋЀ⒈㈠ϕ -> he11o
    • ⒈Ƨ㊂ -> 123
  • MatchPinYin = 32: Convert all unicode Chinese characters to pinyin with boundaries.
    • 你好 -> ␀ni␀␀hao␀
    • 西安 -> ␀xi␀␀an␀
  • MatchPinYinChar = 64: Convert all unicode Chinese characters to pinyin without boundaries.
    • 你好 -> nihao
    • 西安 -> xian

You can combine these transformations as needed. Pre-defined combinations like MatchDeleteNormalize = 28 and MatchFanjianDeleteNormalize = 30 are provided for convenience.

Avoid combining MatchPinYin and MatchPinYinChar due to that MatchPinYin is a more limited version of MatchPinYinChar, in some cases like xian, can be treat as two words xi and an, or only one word xian.

Limitations

Simple Match can handle words with a maximum of 32 combined words (more than 32 then effective combined words are not guaranteed) and 8 repeated words (more than 8 repeated words will be limited to 8).

Matcher Basic Usage

Here’s an example of how to use the Matcher:

import msgspec
import numpy as np
from matcher_py import Matcher
from matcher_py.extension_types import MatchTable, MatchTableType, SimpleMatchType

msgpack_encoder = msgspec.msgpack.Encoder()
matcher = Matcher(
    msgpack_encoder.encode({
        1: [
            MatchTable(
                table_id=1,
                match_table_type=MatchTableType.Simple,
                simple_match_type=SimpleMatchType.MatchFanjianDeleteNormalize,
                word_list=["hello", "world"],
                exemption_simple_match_type=SimpleMatchType.MatchNone,
                exemption_word_list=["word"],
            )
        ]
    })
)
# Check if a text matches
assert matcher.is_match("hello")
assert not matcher.is_match("hello, word")
# Perform word matching as a dict
assert matcher.word_match(r"hello, world")[1]
# Perform word matching as a string
result = matcher.word_match_as_string("hello")
assert result == """{1:[{\"table_id\":1,\"word\":\"hello\"}]"}"""
# Perform batch processing as a dict using a list
text_list = ["hello", "world", "hello,word"]
batch_results = matcher.batch_word_match(text_list)
print(batch_results)
# Perform batch processing as a string using a list
text_list = ["hello", "world", "hello,word"]
batch_results = matcher.batch_word_match_as_string(text_list)
print(batch_results)
# Perform batch processing as a dict using a numpy array
text_array = np.array(["hello", "world", "hello,word"], dtype=np.dtype("object"))
numpy_results = matcher.numpy_word_match(text_array)
print(numpy_results)
# Perform batch processing as a string using a numpy array
text_array = np.array(["hello", "world", "hello,word"], dtype=np.dtype("object"))
numpy_results = matcher.numpy_word_match_as_string(text_array)
print(numpy_results)

Simple Matcher Basic Usage

Here’s an example of how to use the SimpleMatcher:

import msgspec
import numpy as np
from matcher_py import SimpleMatcher
from matcher_py.extension_types import SimpleMatchType

msgpack_encoder = msgspec.msgpack.Encoder()
simple_matcher = SimpleMatcher(
    msgpack_encoder.encode({SimpleMatchType.MatchNone: {1: "example"}})
)
# Check if a text matches
assert simple_matcher.is_match("example")
# Perform simple processing
results = simple_matcher.simple_process("example")
print(results)
# Perform batch processing using a list
text_list = ["example", "test", "example test"]
batch_results = simple_matcher.batch_simple_process(text_list)
print(batch_results)
# Perform batch processing using a NumPy array
text_array = np.array(["example", "test", "example test"], dtype=np.dtype("object"))
numpy_results = simple_matcher.numpy_simple_process(text_array)
print(numpy_results)

Contributing

Contributions to matcher_py are welcome! If you find a bug or have a feature request, please open an issue on the GitHub repository. If you would like to contribute code, please fork the repository and submit a pull request.

License

matcher_py is licensed under the MIT OR Apache-2.0 license.

More Information

For more details, visit the GitHub repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

matcher_py-0.2.10.tar.gz (300.6 kB view details)

Uploaded Source

Built Distributions

matcher_py-0.2.10-cp38-abi3-win_amd64.whl (1.5 MB view details)

Uploaded CPython 3.8+ Windows x86-64

matcher_py-0.2.10-cp38-abi3-musllinux_1_2_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.8+ musllinux: musl 1.2+ x86-64

matcher_py-0.2.10-cp38-abi3-musllinux_1_2_aarch64.whl (1.8 MB view details)

Uploaded CPython 3.8+ musllinux: musl 1.2+ ARM64

matcher_py-0.2.10-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ x86-64

matcher_py-0.2.10-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.6 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ ARM64

matcher_py-0.2.10-cp38-abi3-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.8+ macOS 11.0+ ARM64

matcher_py-0.2.10-cp38-abi3-macosx_10_12_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.8+ macOS 10.12+ x86-64

File details

Details for the file matcher_py-0.2.10.tar.gz.

File metadata

  • Download URL: matcher_py-0.2.10.tar.gz
  • Upload date:
  • Size: 300.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for matcher_py-0.2.10.tar.gz
Algorithm Hash digest
SHA256 fe499173617bd2b6b115acacd6a4d217cf6c38c88294cc7c2b4d47a9065c92db
MD5 1aa2d12ddb8f7ca66736fe2a39ca5ac4
BLAKE2b-256 ce04de7b8828dd5ff01c8e2266ab1675eea1c5db715dd69d9f60f1bac57040ee

See more details on using hashes here.

File details

Details for the file matcher_py-0.2.10-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for matcher_py-0.2.10-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 0b486027b8989dd0b5480905b00c9a2087a7f4d8cda479a0b7f33995c4eaf3f8
MD5 b5e0dc886cf9033d21225f9a188b8a6b
BLAKE2b-256 292a7d59d0bd51b846fde05a22aeebf50e02d6a99dfdd043a6e624f0edafc2a6

See more details on using hashes here.

File details

Details for the file matcher_py-0.2.10-cp38-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for matcher_py-0.2.10-cp38-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 c1831e5adc6305705bd9427d167b6560b40a19c87da67b63c7be8b65e3f3a5f3
MD5 c161996df5d716f0deffd31357964148
BLAKE2b-256 b3782780e13c94caf196651f6f45d031d57a156a85e7e8b962b4552b331428c1

See more details on using hashes here.

File details

Details for the file matcher_py-0.2.10-cp38-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for matcher_py-0.2.10-cp38-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 bb3b7e8c179b5bebbc8c01a0ebb81c1d7c2cea7265ccdcd5a7c5c0b2c06b2e31
MD5 9420ba031134de24ae5c7320f8aa74dd
BLAKE2b-256 c40dc766d05890fd09bdd18eaee8fe82072cf64b9aa91ded73b378c506f3d902

See more details on using hashes here.

File details

Details for the file matcher_py-0.2.10-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for matcher_py-0.2.10-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a8f4d9a794972b7af377d9bdc09307c7d69470f9a5bd70932c6fad834dce1d67
MD5 784aaf6529ca32b49881ec4d9ae386c1
BLAKE2b-256 fb861d8716566337f0d8f1b14c300da83f5918403c31b5e31753c0490e789fcb

See more details on using hashes here.

File details

Details for the file matcher_py-0.2.10-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for matcher_py-0.2.10-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 2d9f062ab5b775f0d22d56d820b1b8244434384cc024eea5aa9379157bfa97e8
MD5 d5d638c6edcfdca7dd4526fbfc000df6
BLAKE2b-256 931748c4573cfa6b4402f3dcce01e8c6d954647ddf0c4224887b24d82ef48c77

See more details on using hashes here.

File details

Details for the file matcher_py-0.2.10-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for matcher_py-0.2.10-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 cb28d8f4a6c009f69e2a3ef75b1fbd6714e946885766a589aadcf15085299e15
MD5 c1518594410bec4321057bdc3e1d08f7
BLAKE2b-256 eb1fe7dcad99d8f05aa642eb2d64579f84195cc78a51866a90c90c72c003d18c

See more details on using hashes here.

File details

Details for the file matcher_py-0.2.10-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for matcher_py-0.2.10-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 b87c1945b1cd9d62a847c3e2c732578a2e467bf8f1aaa18176279d634c7a842e
MD5 56a3f7b3834c9a997af0330f0b23ec70
BLAKE2b-256 ea89a6ca7ce6c8e7a4fab0e755d674e380aef0473b9a0b21223600c8fbe54acc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page