A high performance multiple functional word matcher
Project description
Matcher Rust Implementation with PyO3 Binding
A high-performance, multi-functional word matcher implemented in Rust.
Designed to solve AND OR NOT and TEXT VARIATIONS problems in word/word_list matching. For detailed implementation, see the Design Document.
Installation
Use pip
pip install matcher_py
Install pre-built binary
Visit the release page to download the pre-built binary.
Usage
The msgspec
library is recommended for serializing the matcher configuration due to its performance benefits. You can also use other msgpack serialization libraries like ormsgpack
. All relevant types are defined in extension_types.py.
Explaination of the configuration
Matcher
's configuration is defined by theMatchTableMap = Dict[int, List[MatchTable]]
type, the key ofMatchTableMap
is calledmatch_id
, for eachmatch_id
, thetable_id
inside should but isn't required to be unique.SimpleMatcher
's configuration is defined by theSimpleMatchTableMap = Dict[SimpleMatchType, Dict[int, str]]
type, the valueDict[int, str]
's key is calledword_id
,word_id
is required to be globally unique.
MatchTable
table_id
: The unique ID of the match table.match_table_type
: The type of the match table.simple_match_type
: The type of the simple match (only relevant ifmatch_table_type
is "simple").word_list
: The word list of the match table.exemption_simple_match_type
: The type of the exemption simple match.exemption_word_list
: The exemption word list of the match table.
For each match table, word matching is performed over the word_list
, and exemption word matching is performed over the exemption_word_list
. If the exemption word matching result is True, the word matching result will be False.
MatchTableType
Simple = "simple"
: Supports simple multiple patterns matching with text normalization defined bysimple_match_type
.- We offer transformation methods for text normalization, including
MatchFanjian
,MatchNormalize
,MatchPinYin
···. - It can handle combination patterns and repeated times sensitive matching, delimited by
,
, such ashello,world,hello
will matchhellohelloworld
andworldhellohello
, but nothelloworld
due to the repeated times ofhello
.
- We offer transformation methods for text normalization, including
SimilarChar = "similar_char"
: Supports similar character matching using regex.["hello,hallo,hollo,hi", "word,world,wrd,🌍", "!,?,~"]
will matchhelloworld
,hollowrd
,hi🌍
··· any combinations of the words split by,
in the list.
Acrostic = "acrostic"
: Supports acrostic matching using regex (currently only supports Chinese and simple English sentences).["h,e,l,l,o", "你,好"]
will matchhope, endures, love, lasts, onward.
and你的笑容温暖, 好心情常伴。
.
SimilarTextLevenshtein = "similar_text_levenshtei"n"
: Supports similar text matching based on Levenshtein distance (threshold is 0.8).["helloworld"]
will matchhelloworld
,hellowrld
,helloworld!
··· any similar text to the words in the list.
Regex = "regex"
: Supports regex matching.["h[aeiou]llo", "w[aeiou]rd"]
will matchhello
,world
,hillo
,wurld
··· any text that matches the regex in the list.
SimpleMatchType
MatchNone = 1
: No transformation.MatchFanjian = 2
: Traditional Chinese to simplified Chinese transformation.妳好
->你好
現⾝
->现身
MatchDelete = 12
: Delete all non-alphanumeric and non-unicode Chinese characters.hello, world!
->helloworld
《你∷好》
->你好
MatchNormalize = 16
: Normalize all English character variations and number variations to basic characters.ℋЀ⒈㈠ϕ
->he11o
⒈Ƨ㊂
->123
MatchPinYin = 32
: Convert all unicode Chinese characters to pinyin with boundaries.你好
->␀ni␀␀hao␀
西安
->␀xi␀␀an␀
MatchPinYinChar = 64
: Convert all unicode Chinese characters to pinyin without boundaries.你好
->nihao
西安
->xian
You can combine these transformations as needed. Pre-defined combinations like MatchDeleteNormalize = 28
and MatchFanjianDeleteNormalize = 30
are provided for convenience.
Avoid combining MatchPinYin
and MatchPinYinChar
due to that MatchPinYin
is a more limited version of MatchPinYinChar
, in some cases like xian
, can be treat as two words xi
and an
, or only one word xian
.
Limitations
Simple Match can handle words with a maximum of 32 combined words (more than 32 then effective combined words are not guaranteed) and 8 repeated words (more than 8 repeated words will be limited to 8).
Matcher Basic Usage
Here’s an example of how to use the Matcher
:
import msgspec
import numpy as np
from matcher_py import Matcher
from matcher_py.extension_types import MatchTable, MatchTableType, SimpleMatchType
msgpack_encoder = msgspec.msgpack.Encoder()
matcher = Matcher(
msgpack_encoder.encode({
1: [
MatchTable(
table_id=1,
match_table_type=MatchTableType.Simple,
simple_match_type=SimpleMatchType.MatchFanjianDeleteNormalize,
word_list=["hello", "world"],
exemption_simple_match_type=SimpleMatchType.MatchNone,
exemption_word_list=["word"],
)
]
})
)
# Check if a text matches
assert matcher.is_match("hello")
assert not matcher.is_match("hello, word")
# Perform word matching as a dict
assert matcher.word_match(r"hello, world")[1]
# Perform word matching as a string
result = matcher.word_match_as_string("hello")
assert result == """{1:[{\"table_id\":1,\"word\":\"hello\"}]"}"""
# Perform batch processing as a dict using a list
text_list = ["hello", "world", "hello,word"]
batch_results = matcher.batch_word_match(text_list)
print(batch_results)
# Perform batch processing as a string using a list
text_list = ["hello", "world", "hello,word"]
batch_results = matcher.batch_word_match_as_string(text_list)
print(batch_results)
# Perform batch processing as a dict using a numpy array
text_array = np.array(["hello", "world", "hello,word"], dtype=np.dtype("object"))
numpy_results = matcher.numpy_word_match(text_array)
print(numpy_results)
# Perform batch processing as a string using a numpy array
text_array = np.array(["hello", "world", "hello,word"], dtype=np.dtype("object"))
numpy_results = matcher.numpy_word_match_as_string(text_array)
print(numpy_results)
Simple Matcher Basic Usage
Here’s an example of how to use the SimpleMatcher
:
import msgspec
import numpy as np
from matcher_py import SimpleMatcher
from matcher_py.extension_types import SimpleMatchType
msgpack_encoder = msgspec.msgpack.Encoder()
simple_matcher = SimpleMatcher(
msgpack_encoder.encode({SimpleMatchType.MatchNone: {1: "example"}})
)
# Check if a text matches
assert simple_matcher.is_match("example")
# Perform simple processing
results = simple_matcher.simple_process("example")
print(results)
# Perform batch processing using a list
text_list = ["example", "test", "example test"]
batch_results = simple_matcher.batch_simple_process(text_list)
print(batch_results)
# Perform batch processing using a NumPy array
text_array = np.array(["example", "test", "example test"], dtype=np.dtype("object"))
numpy_results = simple_matcher.numpy_simple_process(text_array)
print(numpy_results)
Contributing
Contributions to matcher_py
are welcome! If you find a bug or have a feature request, please open an issue on the GitHub repository. If you would like to contribute code, please fork the repository and submit a pull request.
License
matcher_py
is licensed under the MIT OR Apache-2.0 license.
More Information
For more details, visit the GitHub repository.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for matcher_py-0.2.10-cp38-abi3-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0b486027b8989dd0b5480905b00c9a2087a7f4d8cda479a0b7f33995c4eaf3f8 |
|
MD5 | b5e0dc886cf9033d21225f9a188b8a6b |
|
BLAKE2b-256 | 292a7d59d0bd51b846fde05a22aeebf50e02d6a99dfdd043a6e624f0edafc2a6 |
Hashes for matcher_py-0.2.10-cp38-abi3-musllinux_1_2_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c1831e5adc6305705bd9427d167b6560b40a19c87da67b63c7be8b65e3f3a5f3 |
|
MD5 | c161996df5d716f0deffd31357964148 |
|
BLAKE2b-256 | b3782780e13c94caf196651f6f45d031d57a156a85e7e8b962b4552b331428c1 |
Hashes for matcher_py-0.2.10-cp38-abi3-musllinux_1_2_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bb3b7e8c179b5bebbc8c01a0ebb81c1d7c2cea7265ccdcd5a7c5c0b2c06b2e31 |
|
MD5 | 9420ba031134de24ae5c7320f8aa74dd |
|
BLAKE2b-256 | c40dc766d05890fd09bdd18eaee8fe82072cf64b9aa91ded73b378c506f3d902 |
Hashes for matcher_py-0.2.10-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a8f4d9a794972b7af377d9bdc09307c7d69470f9a5bd70932c6fad834dce1d67 |
|
MD5 | 784aaf6529ca32b49881ec4d9ae386c1 |
|
BLAKE2b-256 | fb861d8716566337f0d8f1b14c300da83f5918403c31b5e31753c0490e789fcb |
Hashes for matcher_py-0.2.10-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2d9f062ab5b775f0d22d56d820b1b8244434384cc024eea5aa9379157bfa97e8 |
|
MD5 | d5d638c6edcfdca7dd4526fbfc000df6 |
|
BLAKE2b-256 | 931748c4573cfa6b4402f3dcce01e8c6d954647ddf0c4224887b24d82ef48c77 |
Hashes for matcher_py-0.2.10-cp38-abi3-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cb28d8f4a6c009f69e2a3ef75b1fbd6714e946885766a589aadcf15085299e15 |
|
MD5 | c1518594410bec4321057bdc3e1d08f7 |
|
BLAKE2b-256 | eb1fe7dcad99d8f05aa642eb2d64579f84195cc78a51866a90c90c72c003d18c |
Hashes for matcher_py-0.2.10-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b87c1945b1cd9d62a847c3e2c732578a2e467bf8f1aaa18176279d634c7a842e |
|
MD5 | 56a3f7b3834c9a997af0330f0b23ec70 |
|
BLAKE2b-256 | ea89a6ca7ce6c8e7a4fab0e755d674e380aef0473b9a0b21223600c8fbe54acc |