A high performance multiple functional word matcher
Project description
Matcher Rust Implementation with PyO3 Binding
A high-performance, multi-functional word matcher implemented in Rust.
Designed to solve AND OR NOT and TEXT VARIATIONS problems in word/word_list matching. For detailed implementation, see the Design Document.
Features
- Multiple Matching Methods:
- Simple Word Matching
- Regex-Based Matching
- Similarity-Based Matching
- Text Normalization:
- Fanjian: Simplify traditional Chinese characters to simplified ones.
Example:
蟲艸
->虫艹
- Delete: Remove specific characters.
Example:
*Fu&*iii&^%%*&kkkk
->Fuiiikkkk
- Normalize: Normalize special characters to identifiable characters.
Example:
𝜢𝕰𝕃𝙻Ϙ 𝙒ⓞƦℒ𝒟!
->hello world
- PinYin: Convert Chinese characters to Pinyin for fuzzy matching.
Example:
西安
->/xi//an/
, matches洗按
->/xi//an/
, but not先
->/xian/
- PinYinChar: Convert Chinese characters to Pinyin.
Example:
西安
->xian
, matches洗按
and先
->xian
- Fanjian: Simplify traditional Chinese characters to simplified ones.
Example:
- Combination and Repeated Word Matching:
- Takes into account the number of repetitions of words.
- Example:
hello,world
matcheshello world
andworld,hello
- Example:
无,法,无,天
matches无无法天
(because无
is repeated twice), but not无法天
- Customizable Exemption Lists: Exclude specific words from matching.
- Efficient Handling of Large Word Lists: Optimized for performance.
Installation
Use pip
pip install matcher_py
Install pre-built binary
Visit the release page to download the pre-built binary.
Usage
The msgspec
library is recommended for serializing the matcher configuration due to its performance benefits. You can also use other msgpack serialization libraries like ormsgpack
. All relevant types are defined in extension_types.py.
Explanation of the configuration
Matcher
's configuration is defined by theMatchTableMap = Dict[int, List[MatchTable]]
type, the key ofMatchTableMap
is calledmatch_id
, for eachmatch_id
, thetable_id
inside should but isn't required to be unique.SimpleMatcher
's configuration is defined by theSimpleMatchTableMap = Dict[SimpleMatchType, Dict[int, str]]
type, the valueDict[int, str]
's key is calledword_id
,word_id
is required to be globally unique.
MatchTable
table_id
: The unique ID of the match table.match_table_type
: The type of the match table.word_list
: The word list of the match table.exemption_simple_match_type
: The type of the exemption simple match.exemption_word_list
: The exemption word list of the match table.
For each match table, word matching is performed over the word_list
, and exemption word matching is performed over the exemption_word_list
. If the exemption word matching result is True, the word matching result will be False.
MatchTableType
Simple
: Supports simple multiple patterns matching with text normalization defined bysimple_match_type
.- We offer transformation methods for text normalization, including
Fanjian
,Normalize
,PinYin
···. - It can handle combination patterns and repeated times sensitive matching, delimited by
,
, such ashello,world,hello
will matchhellohelloworld
andworldhellohello
, but nothelloworld
due to the repeated times ofhello
.
- We offer transformation methods for text normalization, including
Regex
: Supports regex patterns matching.SimilarChar
: Supports similar character matching using regex.["hello,hallo,hollo,hi", "word,world,wrd,🌍", "!,?,~"]
will matchhelloworld
,hollowrd
,hi🌍
··· any combinations of the words split by,
in the list.
Acrostic
: Supports acrostic matching using regex (currently only supports Chinese and simple English sentences).["h,e,l,l,o", "你,好"]
will matchhope, endures, love, lasts, onward.
and你的笑容温暖, 好心情常伴。
.
Regex
: Supports regex matching.["h[aeiou]llo", "w[aeiou]rd"]
will matchhello
,world
,hillo
,wurld
··· any text that matches the regex in the list.
Similar
: Supports similar text matching based on distance and threshold.Levenshtein
: Supports similar text matching based on Levenshtein distance.DamerauLevenshtein
: Supports similar text matching based on Damerau-Levenshtein distance.Indel
: Supports similar text matching based on Indel distance.Jaro
: Supports similar text matching based on Jaro distance.JaroWinkler
: Supports similar text matching based on Jaro-Winkler distance.
SimpleMatchType
None
: No transformation.Fanjian
: Traditional Chinese to simplified Chinese transformation. Based on FANJIAN and UNICODE.妳好
->你好
現⾝
->现身
Delete
: Delete all punctuation, special characters and white spaces.hello, world!
->helloworld
《你∷好》
->你好
Normalize
: Normalize all English character variations and number variations to basic characters. Based on UPPER_LOWER, EN_VARIATION and NUM_NORM.ℋЀ⒈㈠ϕ
->he11o
⒈Ƨ㊂
->123
PinYin
: Convert all unicode Chinese characters to pinyin with boundaries. Based on PINYIN.你好
->␀ni␀␀hao␀
西安
->␀xi␀␀an␀
PinYinChar
: Convert all unicode Chinese characters to pinyin without boundaries. Based on PINYIN_CHAR.你好
->nihao
西安
->xian
You can combine these transformations as needed. Pre-defined combinations like DeleteNormalize
and FanjianDeleteNormalize
are provided for convenience.
Avoid combining PinYin
and PinYinChar
due to that PinYin
is a more limited version of PinYinChar
, in some cases like xian
, can be treat as two words xi
and an
, or only one word xian
.
Delete
is technologically a combination of TextDelete
and WordDelete
, we implement different delete methods for text and word. 'Cause we believe CN_SPECIAL
and EN_SPECIAL
are parts of the word, but not for text. For text_process
and reduce_text_process
functions, users should use TextDelete
instead of WordDelete
.
WordDelete
: Delete all patterns in PUNCTUATION_SPECIAL.TextDelete
: Delete all patterns in PUNCTUATION_SPECIAL, CN_SPECIAL, EN_SPECIAL.
Text Process Usage
Here’s an example of how to use the reduce_text_process
and text_process
functions:
from matcher_py import reduce_text_process, text_process
from matcher_py.extension_types import SimpleMatchType
print(reduce_text_process(SimpleMatchType.MatchTextDelete | SimpleMatchType.MatchNormalize, "hello, world!"))
print(text_process(SimpleMatchType.MatchTextDelete, "hello, world!"))
Matcher Basic Usage
Here’s an example of how to use the Matcher
:
import msgspec
import numpy as np
from matcher_py import Matcher
from matcher_py.extension_types import MatchTable, MatchTableType, SimpleMatchType
msgpack_encoder = msgspec.msgpack.Encoder()
matcher = Matcher(
msgpack_encoder.encode({
1: [
MatchTable(
table_id=1,
match_table_type=MatchTableType.Simple(simple_match_type = SimpleMatchType.MatchFanjianDeleteNormalize),
word_list=["hello", "world"],
exemption_simple_match_type=SimpleMatchType.MatchNone,
exemption_word_list=["word"],
)
]
})
)
# Check if a text matches
assert matcher.is_match("hello")
assert not matcher.is_match("hello, word")
# Perform word matching as a dict
assert matcher.word_match(r"hello, world")[1]
# Perform word matching as a string
result = matcher.word_match_as_string("hello")
assert result == """{1:[{\"table_id\":1,\"word\":\"hello\"}]"}"""
# Perform batch processing as a dict using a list
text_list = ["hello", "world", "hello,word"]
batch_results = matcher.batch_word_match(text_list)
print(batch_results)
# Perform batch processing as a string using a list
text_list = ["hello", "world", "hello,word"]
batch_results = matcher.batch_word_match_as_string(text_list)
print(batch_results)
# Perform batch processing as a dict using a numpy array
text_array = np.array(["hello", "world", "hello,word"], dtype=np.dtype("object"))
numpy_results = matcher.numpy_word_match(text_array)
print(numpy_results)
# Perform batch processing as a string using a numpy array
text_array = np.array(["hello", "world", "hello,word"], dtype=np.dtype("object"))
numpy_results = matcher.numpy_word_match_as_string(text_array)
print(numpy_results)
Simple Matcher Basic Usage
Here’s an example of how to use the SimpleMatcher
:
import msgspec
import numpy as np
from matcher_py import SimpleMatcher
from matcher_py.extension_types import SimpleMatchType
msgpack_encoder = msgspec.msgpack.Encoder()
simple_matcher = SimpleMatcher(
msgpack_encoder.encode({SimpleMatchType.MatchNone: {1: "example"}})
)
# Check if a text matches
assert simple_matcher.is_match("example")
# Perform simple processing
results = simple_matcher.simple_process("example")
print(results)
# Perform batch processing using a list
text_list = ["example", "test", "example test"]
batch_results = simple_matcher.batch_simple_process(text_list)
print(batch_results)
# Perform batch processing using a NumPy array
text_array = np.array(["example", "test", "example test"], dtype=np.dtype("object"))
numpy_results = simple_matcher.numpy_simple_process(text_array)
print(numpy_results)
Contributing
Contributions to matcher_py
are welcome! If you find a bug or have a feature request, please open an issue on the GitHub repository. If you would like to contribute code, please fork the repository and submit a pull request.
License
matcher_py
is licensed under the MIT OR Apache-2.0 license.
More Information
For more details, visit the GitHub repository.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for matcher_py-0.3.4-cp38-abi3-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 87e48ea4a03b3dbf0775f23a78e9596320b188a9830a831e3fb75f51b4345419 |
|
MD5 | 9e6695b1c0378e4c3da4e123edfa9e1d |
|
BLAKE2b-256 | 41ea38228d25471a2a07ece6be170cda9a9d1c11ecd1aba1e85daf75596f1edd |
Hashes for matcher_py-0.3.4-cp38-abi3-musllinux_1_2_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 78e749f105d3dd0e8ca1498c9a38bd7e9829f79bbd88e2593aa073488455569d |
|
MD5 | df832f68636b34d1de8e3f0d600f9940 |
|
BLAKE2b-256 | 4d7aba5e75ee5a9b9ca201527c62ffa36a2f0a25e8bfba898e73e6d1bc0869f6 |
Hashes for matcher_py-0.3.4-cp38-abi3-musllinux_1_2_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 753be36c196cc48b83008b57a4004f1bdb8133d50729fe2f43041358711b7237 |
|
MD5 | 449a57255907710bd11dcb4854e84794 |
|
BLAKE2b-256 | e120446b22646369686657e3df3a0cbd3d8f8d11cc16baec88c646d0c6caadb8 |
Hashes for matcher_py-0.3.4-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a4a64daf008683b7e6da6ee1afefb40ee3efa99ca0e9a141b810f559430b0f8c |
|
MD5 | 722ecbbbe63c6bacc98c71bad5caf28b |
|
BLAKE2b-256 | ac80673c9b17beddca5bf0d4d0911b203331af3f9fd41ba885ff7b992b2ce58e |
Hashes for matcher_py-0.3.4-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 112282b8a7c4cbc69d9daf90e94e79a10c0c738ea1c0fe4069a85754627d4122 |
|
MD5 | d4cfeb0fc716cf61ed22b0a0ead1f884 |
|
BLAKE2b-256 | 31d6e9d3c11eebff9d7ee7b60fc7326f00bf9713c7827111f8acefcc300e7b7c |
Hashes for matcher_py-0.3.4-cp38-abi3-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 59bdc76212f4fb406f9be76cb379f5d5f8011263d25d77379a737f8953b9a7bc |
|
MD5 | 79278808e7fe88c99fd9810083648483 |
|
BLAKE2b-256 | cd60f2389f7192f6af6cfa769a11067e17c81473aaff462bf8f9a711645d1c27 |
Hashes for matcher_py-0.3.4-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 404809409ed23312c664ec418381af671dcde830c968a6f16d94ec6492d56901 |
|
MD5 | a52a7cf9087a9dd7446923d5476ff9e6 |
|
BLAKE2b-256 | 59e32bc79afe619fc63da48b12f16bbb164e3159d9f313d6477fb8de83428516 |