A high performance multiple functional word matcher
Project description
Matcher Rust Implementation with PyO3 Binding
A high-performance, multi-functional word matcher implemented in Rust.
Designed to solve AND OR NOT and TEXT VARIATIONS problems in word/word_list matching. For detailed implementation, see the Design Document.
Features
- Multiple Matching Methods:
- Simple Word Matching
- Regex-Based Matching
- Similarity-Based Matching
- Text Normalization:
- Fanjian: Simplify traditional Chinese characters to simplified ones.
Example:
蟲艸
->虫艹
- Delete: Remove specific characters.
Example:
*Fu&*iii&^%%*&kkkk
->Fuiiikkkk
- Normalize: Normalize special characters to identifiable characters.
Example:
𝜢𝕰𝕃𝙻Ϙ 𝙒ⓞƦℒ𝒟!
->hello world
- PinYin: Convert Chinese characters to Pinyin for fuzzy matching.
Example:
西安
->/xi//an/
, matches洗按
->/xi//an/
, but not先
->/xian/
- PinYinChar: Convert Chinese characters to Pinyin.
Example:
西安
->xian
, matches洗按
and先
->xian
- Fanjian: Simplify traditional Chinese characters to simplified ones.
Example:
- Combination and Repeated Word Matching:
- Takes into account the number of repetitions of words.
- Example:
hello,world
matcheshello world
andworld,hello
- Example:
无,法,无,天
matches无无法天
(because无
is repeated twice), but not无法天
- Customizable Exemption Lists: Exclude specific words from matching.
- Efficient Handling of Large Word Lists: Optimized for performance.
Installation
Use pip
pip install matcher_py
Install pre-built binary
Visit the release page to download the pre-built binary.
Usage
The msgspec
library is recommended for serializing the matcher configuration due to its performance benefits. You can also use other msgpack serialization libraries like ormsgpack
. All relevant types are defined in extension_types.py.
Explanation of the configuration
Matcher
's configuration is defined by theMatchTableMap = Dict[int, List[MatchTable]]
type, the key ofMatchTableMap
is calledmatch_id
, for eachmatch_id
, thetable_id
inside should but isn't required to be unique.SimpleMatcher
's configuration is defined by theSimpleMatchTableMap = Dict[SimpleMatchType, Dict[int, str]]
type, the valueDict[int, str]
's key is calledword_id
,word_id
is required to be globally unique.
MatchTable
table_id
: The unique ID of the match table.match_table_type
: The type of the match table.word_list
: The word list of the match table.exemption_simple_match_type
: The type of the exemption simple match.exemption_word_list
: The exemption word list of the match table.
For each match table, word matching is performed over the word_list
, and exemption word matching is performed over the exemption_word_list
. If the exemption word matching result is True, the word matching result will be False.
MatchTableType
Simple
: Supports simple multiple patterns matching with text normalization defined bysimple_match_type
.- We offer transformation methods for text normalization, including
Fanjian
,Normalize
,PinYin
···. - It can handle combination patterns and repeated times sensitive matching, delimited by
,
, such ashello,world,hello
will matchhellohelloworld
andworldhellohello
, but nothelloworld
due to the repeated times ofhello
.
- We offer transformation methods for text normalization, including
Regex
: Supports regex patterns matching.SimilarChar
: Supports similar character matching using regex.["hello,hallo,hollo,hi", "word,world,wrd,🌍", "!,?,~"]
will matchhelloworld
,hollowrd
,hi🌍
··· any combinations of the words split by,
in the list.
Acrostic
: Supports acrostic matching using regex (currently only supports Chinese and simple English sentences).["h,e,l,l,o", "你,好"]
will matchhope, endures, love, lasts, onward.
and你的笑容温暖, 好心情常伴。
.
Regex
: Supports regex matching.["h[aeiou]llo", "w[aeiou]rd"]
will matchhello
,world
,hillo
,wurld
··· any text that matches the regex in the list.
Similar
: Supports similar text matching based on distance and threshold.Levenshtein
: Supports similar text matching based on Levenshtein distance.DamerauLevenshtein
: Supports similar text matching based on Damerau-Levenshtein distance.Indel
: Supports similar text matching based on Indel distance.Jaro
: Supports similar text matching based on Jaro distance.JaroWinkler
: Supports similar text matching based on Jaro-Winkler distance.
SimpleMatchType
None
: No transformation.Fanjian
: Traditional Chinese to simplified Chinese transformation. Based on FANJIAN and UNICODE.妳好
->你好
現⾝
->现身
Delete
: Delete all punctuation, special characters and white spaces.hello, world!
->helloworld
《你∷好》
->你好
Normalize
: Normalize all English character variations and number variations to basic characters. Based on UPPER_LOWER, EN_VARIATION and NUM_NORM.ℋЀ⒈㈠ϕ
->he11o
⒈Ƨ㊂
->123
PinYin
: Convert all unicode Chinese characters to pinyin with boundaries. Based on PINYIN.你好
->␀ni␀␀hao␀
西安
->␀xi␀␀an␀
PinYinChar
: Convert all unicode Chinese characters to pinyin without boundaries. Based on PINYIN_CHAR.你好
->nihao
西安
->xian
You can combine these transformations as needed. Pre-defined combinations like DeleteNormalize
and FanjianDeleteNormalize
are provided for convenience.
Avoid combining PinYin
and PinYinChar
due to that PinYin
is a more limited version of PinYinChar
, in some cases like xian
, can be treat as two words xi
and an
, or only one word xian
.
Delete
is technologically a combination of TextDelete
and WordDelete
, we implement different delete methods for text and word. 'Cause we believe CN_SPECIAL
and EN_SPECIAL
are parts of the word, but not for text. For text_process
and reduce_text_process
functions, users should use TextDelete
instead of WordDelete
.
WordDelete
: Delete all patterns in PUNCTUATION_SPECIAL.TextDelete
: Delete all patterns in PUNCTUATION_SPECIAL, CN_SPECIAL, EN_SPECIAL.
Limitations
Simple Match can handle words with a maximum of 32 combined words (more than 32 then effective combined words are not guaranteed) and 8 repeated words (more than 8 repeated words will be limited to 8).
Text Process Usage
Here’s an example of how to use the reduce_text_process
and text_process
functions:
from matcher_py import reduce_text_process, text_process
from matcher_py.extension_types import SimpleMatchType
print(reduce_text_process(SimpleMatchType.MatchTextDelete | SimpleMatchType.MatchNormalize, "hello, world!"))
print(text_process(SimpleMatchType.MatchTextDelete, "hello, world!"))
Matcher Basic Usage
Here’s an example of how to use the Matcher
:
import msgspec
import numpy as np
from matcher_py import Matcher
from matcher_py.extension_types import MatchTable, MatchTableType, SimpleMatchType
msgpack_encoder = msgspec.msgpack.Encoder()
matcher = Matcher(
msgpack_encoder.encode({
1: [
MatchTable(
table_id=1,
match_table_type=MatchTableType.Simple(simple_match_type = SimpleMatchType.MatchFanjianDeleteNormalize),
word_list=["hello", "world"],
exemption_simple_match_type=SimpleMatchType.MatchNone,
exemption_word_list=["word"],
)
]
})
)
# Check if a text matches
assert matcher.is_match("hello")
assert not matcher.is_match("hello, word")
# Perform word matching as a dict
assert matcher.word_match(r"hello, world")[1]
# Perform word matching as a string
result = matcher.word_match_as_string("hello")
assert result == """{1:[{\"table_id\":1,\"word\":\"hello\"}]"}"""
# Perform batch processing as a dict using a list
text_list = ["hello", "world", "hello,word"]
batch_results = matcher.batch_word_match(text_list)
print(batch_results)
# Perform batch processing as a string using a list
text_list = ["hello", "world", "hello,word"]
batch_results = matcher.batch_word_match_as_string(text_list)
print(batch_results)
# Perform batch processing as a dict using a numpy array
text_array = np.array(["hello", "world", "hello,word"], dtype=np.dtype("object"))
numpy_results = matcher.numpy_word_match(text_array)
print(numpy_results)
# Perform batch processing as a string using a numpy array
text_array = np.array(["hello", "world", "hello,word"], dtype=np.dtype("object"))
numpy_results = matcher.numpy_word_match_as_string(text_array)
print(numpy_results)
Simple Matcher Basic Usage
Here’s an example of how to use the SimpleMatcher
:
import msgspec
import numpy as np
from matcher_py import SimpleMatcher
from matcher_py.extension_types import SimpleMatchType
msgpack_encoder = msgspec.msgpack.Encoder()
simple_matcher = SimpleMatcher(
msgpack_encoder.encode({SimpleMatchType.MatchNone: {1: "example"}})
)
# Check if a text matches
assert simple_matcher.is_match("example")
# Perform simple processing
results = simple_matcher.simple_process("example")
print(results)
# Perform batch processing using a list
text_list = ["example", "test", "example test"]
batch_results = simple_matcher.batch_simple_process(text_list)
print(batch_results)
# Perform batch processing using a NumPy array
text_array = np.array(["example", "test", "example test"], dtype=np.dtype("object"))
numpy_results = simple_matcher.numpy_simple_process(text_array)
print(numpy_results)
Contributing
Contributions to matcher_py
are welcome! If you find a bug or have a feature request, please open an issue on the GitHub repository. If you would like to contribute code, please fork the repository and submit a pull request.
License
matcher_py
is licensed under the MIT OR Apache-2.0 license.
More Information
For more details, visit the GitHub repository.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file matcher_py-0.3.1.tar.gz
.
File metadata
- Download URL: matcher_py-0.3.1.tar.gz
- Upload date:
- Size: 306.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 05369a485e22d3fb4310cd3c20fb991116dba1ddb1f4724f4bac00fbd7ff0d98 |
|
MD5 | 52deaf0b8bfb418fa035ac905a8d0bb9 |
|
BLAKE2b-256 | d3fd11714bbe354110087ea2a02a261b2ea1a0218c072cde814aaba35c5e3910 |
File details
Details for the file matcher_py-0.3.1-cp38-abi3-win_amd64.whl
.
File metadata
- Download URL: matcher_py-0.3.1-cp38-abi3-win_amd64.whl
- Upload date:
- Size: 1.6 MB
- Tags: CPython 3.8+, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 043ff8c20a72acb46d335ff0819216f418aa8bc84b14ce8a2ab5c20bed75e398 |
|
MD5 | 2f12843a0c5be9f077b123a0d8e547d5 |
|
BLAKE2b-256 | 8bd9088dd49ea90bf6e6abba3284a18dc1fdc67ec93f4352f06270d32d6e7923 |
File details
Details for the file matcher_py-0.3.1-cp38-abi3-musllinux_1_2_x86_64.whl
.
File metadata
- Download URL: matcher_py-0.3.1-cp38-abi3-musllinux_1_2_x86_64.whl
- Upload date:
- Size: 1.8 MB
- Tags: CPython 3.8+, musllinux: musl 1.2+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d9989926ae89e496999a42d8b907e20d8ac443171d3bbfe0ff913db929dd90e1 |
|
MD5 | 71d97dbea0520e122a23cf68109d1aef |
|
BLAKE2b-256 | c9ec260b9b1adb95a3e8515b7a2a6097e5e736db2116ca17b569258cb54f3d0b |
File details
Details for the file matcher_py-0.3.1-cp38-abi3-musllinux_1_2_aarch64.whl
.
File metadata
- Download URL: matcher_py-0.3.1-cp38-abi3-musllinux_1_2_aarch64.whl
- Upload date:
- Size: 1.9 MB
- Tags: CPython 3.8+, musllinux: musl 1.2+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 724382ec7bcd4255dac9e6d2918f1483616cdd3242fbc379d0e750089fab2d6e |
|
MD5 | 65c0aaee883e6aa245f43683d418a782 |
|
BLAKE2b-256 | 25c96f25d6becfd8cd0f73ee77fa50fdb26507ac81c2fede0751dd6ab0e5b79c |
File details
Details for the file matcher_py-0.3.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: matcher_py-0.3.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.6 MB
- Tags: CPython 3.8+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1d7088ab2bb40ab57b0bf053662c3fda07d7368f50f5289927b1d39c71b6c176 |
|
MD5 | 92f1a0ad9bc4afc6c2803cdf69f4480f |
|
BLAKE2b-256 | 75a889cb858698ce28091a7431c644cf821c4ce79cab885544af5cdcbcd58af0 |
File details
Details for the file matcher_py-0.3.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
.
File metadata
- Download URL: matcher_py-0.3.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 1.7 MB
- Tags: CPython 3.8+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3293b63aecdc3537449a4cc314ab30d56c5b6fa8de843913107f06f90b1d07e8 |
|
MD5 | 669523a44c0a9c264025f92ab2063d76 |
|
BLAKE2b-256 | 1ced26178b791ce089a9be1d632d55797dafb7d7e8784740d8256ec6292b054a |
File details
Details for the file matcher_py-0.3.1-cp38-abi3-macosx_11_0_arm64.whl
.
File metadata
- Download URL: matcher_py-0.3.1-cp38-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.5 MB
- Tags: CPython 3.8+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ce9a178758011307ecc632a432f7236fab7522b3869e6758baf66acc99323528 |
|
MD5 | ade14d80298a1e9afcc2bd9567830f70 |
|
BLAKE2b-256 | 69f92ee747d0fb8bfd950ca6dce79626a52054f6f4437dea63da322cc4d38776 |
File details
Details for the file matcher_py-0.3.1-cp38-abi3-macosx_10_12_x86_64.whl
.
File metadata
- Download URL: matcher_py-0.3.1-cp38-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 1.6 MB
- Tags: CPython 3.8+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1ec3dff309fe54e6737fccc84b8c9f7a68bb234c922768dcf26bc8d229274a41 |
|
MD5 | 70cf8e90e733c4456cb965a117ad41ee |
|
BLAKE2b-256 | dcff469dd17f7ca7250c320d157726f1e17bb46bf64ca58a55f35a7c787a5e33 |