Skip to main content

A high performance multiple functional word matcher

Project description

Matcher Rust Implementation with PyO3 Binding

Installation

Use pip

pip install matcher_py

Install pre-built binary

Visit the release page to download the pre-built binary.

Usage

The msgspec library is recommended for serializing the matcher configuration due to its performance benefits. You can also use other msgpack serialization libraries like ormsgpack. All relevant types are defined in extension_types.py.

Explaination of the configuration

  • Matcher's configuration is defined by the MatchTableMap = Dict[int, List[MatchTable]] type, the key of MatchTableMap is called match_id, for each match_id, the table_id inside should but isn't required to be unique.
  • SimpleMatcher's configuration is defined by the SimpleMatchTableMap = Dict[SimpleMatchType, Dict[int, str]] type, the value Dict[int, str]'s key is called word_id, word_id is required to be globally unique.

MatchTable

  • table_id: The unique ID of the match table.
  • match_table_type: The type of the match table.
  • simple_match_type: The type of the simple match (only relevant if match_table_type is "simple").
  • word_list: The word list of the match table.
  • exemption_simple_match_type: The type of the exemption simple match.
  • exemption_word_list: The exemption word list of the match table.

For each match table, word matching is performed over the word_list, and exemption word matching is performed over the exemption_word_list. If the exemption word matching result is True, the word matching result will be False.

MatchTableType

  • Simple = "simple": Supports simple multiple patterns matching with text normalization defined by simple_match_type.
    • We offer transformation methods for text normalization, including MatchFanjian, MatchNormalize, MatchPinYin ···.
    • It can handle combination patterns and repeated times sensitive matching, delimited by ,, such as hello,world,hello will match hellohelloworld and worldhellohello, but not helloworld due to the repeated times of hello.
  • SimilarChar = "similar_char": Supports similar character matching using regex.
    • ["hello,hallo,hollo,hi", "word,world,wrd,🌍", "!,?,~"] will match helloworld, hollowrd, hi🌍 ··· any combinations of the words split by , in the list.
  • Acrostic = acrostic: Supports acrostic matching using regex (currently only supports Chinese and simple English sentences).
    • ["h,e,l,l,o", "你,好"] will match hope, endures, love, lasts, onward. and 你的笑容温暖, 好心情常伴。.
  • SimilarTextLevenshtein = similar_text_levenshtein: Supports similar text matching based on Levenshtein distance (threshold is 0.8).
    • ["helloworld"] will match helloworld, hellowrld, helloworld! ··· any similar text to the words in the list.
  • Regex = regex: Supports regex matching.
    • ["h[aeiou]llo", "w[aeiou]rd"] will match hello, world, hillo, wurld ··· any text that matches the regex in the list.

SimpleMatchType

  • MatchNone = 1: No transformation.
  • MatchFanjian = 2: Traditional Chinese to simplified Chinese transformation.
    • 妳好 -> 你好
    • 現⾝ -> 现身
  • MatchDelete = 12: Delete all non-alphanumeric and non-unicode Chinese characters.
    • hello, world! -> helloworld
    • 《你∷好》 -> 你好
  • MatchNormalize = 16: Normalize all English character variations and number variations to basic characters.
    • ℋЀ⒈㈠ϕ -> he11o
    • ⒈Ƨ㊂ -> 123
  • MatchPinYin = 32: Convert all unicode Chinese characters to pinyin with boundaries.
    • 你好 -> ␀ni␀␀hao␀
    • 西安 -> ␀xi␀␀an␀
  • MatchPinYinChar = 64: Convert all unicode Chinese characters to pinyin without boundaries.
    • 你好 -> nihao
    • 西安 -> xian

You can combine these transformations as needed. Pre-defined combinations like MatchDeleteNormalize = 28 and MatchFanjianDeleteNormalize = 30 are provided for convenience.

Avoid combining MatchPinYin and MatchPinYinChar due to that MatchPinYin is a more limited version of MatchPinYinChar, in some cases like xian, can be treat as two words xi and an, or only one word xian.

Limitations

Simple Match can handle words with a maximum of 32 combined words (more than 32 then effective combined words are not guaranteed) and 8 repeated words (more than 8 repeated words will be limited to 8).

Matcher Basic Usage

Here’s an example of how to use the Matcher:

import msgspec
import numpy as np
from matcher_py import Matcher
from matcher_py.extension_types import MatchTable, MatchTableType, SimpleMatchType

msgpack_encoder = msgspec.msgpack.Encoder()
matcher = Matcher(
    msgpack_encoder.encode({
        1: [
            MatchTable(
                table_id=1,
                match_table_type=MatchTableType.Simple,
                simple_match_type=SimpleMatchType.MatchFanjianDeleteNormalize,
                word_list=["hello", "world"],
                exemption_simple_match_type=SimpleMatchType.MatchNone,
                exemption_word_list=["word"],
            )
        ]
    })
)
# Check if a text matches
assert matcher.is_match("hello")
assert not matcher.is_match("hello, word")
# Perform word matching as a dict
assert matcher.word_match(r"hello, world")[1]
# Perform word matching as a string
result = matcher.word_match_as_string("hello")
assert result == """{1:[{\"table_id\":1,\"word\":\"hello\"}]"}"""
# Perform batch processing as a dict using a list
text_list = ["hello", "world", "hello,word"]
batch_results = matcher.batch_word_match(text_list)
print(batch_results)
# Perform batch processing as a string using a list
text_list = ["hello", "world", "hello,word"]
batch_results = matcher.batch_word_match_as_string(text_list)
print(batch_results)
# Perform batch processing as a dict using a numpy array
text_array = np.array(["hello", "world", "hello,word"], dtype=np.dtype("object"))
numpy_results = matcher.numpy_word_match(text_array)
print(numpy_results)
# Perform batch processing as a string using a numpy array
text_array = np.array(["hello", "world", "hello,word"], dtype=np.dtype("object"))
numpy_results = matcher.numpy_word_match_as_string(text_array)
print(numpy_results)

Simple Matcher Basic Usage

Here’s an example of how to use the SimpleMatcher:

import msgspec
import numpy as np
from matcher_py import SimpleMatcher
from matcher_py.extension_types import SimpleMatchType

msgpack_encoder = msgspec.msgpack.Encoder()
simple_matcher = SimpleMatcher(
    msgpack_encoder.encode({SimpleMatchType.MatchNone: {1: "example"}})
)
# Check if a text matches
assert simple_matcher.is_match("example")
# Perform simple processing
results = simple_matcher.simple_process("example")
print(results)
# Perform batch processing using a list
text_list = ["example", "test", "example test"]
batch_results = simple_matcher.batch_simple_process(text_list)
print(batch_results)
# Perform batch processing using a NumPy array
text_array = np.array(["example", "test", "example test"], dtype=np.dtype("object"))
numpy_results = simple_matcher.numpy_simple_process(text_array)
print(numpy_results)

Contributing

Contributions to matcher_py are welcome! If you find a bug or have a feature request, please open an issue on the GitHub repository. If you would like to contribute code, please fork the repository and submit a pull request.

License

matcher_py is licensed under the MIT OR Apache-2.0 license.

More Information

For more details, visit the GitHub repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

matcher_py-0.2.8.tar.gz (297.7 kB view details)

Uploaded Source

Built Distributions

matcher_py-0.2.8-cp38-abi3-win_amd64.whl (1.5 MB view details)

Uploaded CPython 3.8+ Windows x86-64

matcher_py-0.2.8-cp38-abi3-musllinux_1_2_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.8+ musllinux: musl 1.2+ x86-64

matcher_py-0.2.8-cp38-abi3-musllinux_1_2_aarch64.whl (1.8 MB view details)

Uploaded CPython 3.8+ musllinux: musl 1.2+ ARM64

matcher_py-0.2.8-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ x86-64

matcher_py-0.2.8-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.6 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ ARM64

matcher_py-0.2.8-cp38-abi3-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.8+ macOS 11.0+ ARM64

matcher_py-0.2.8-cp38-abi3-macosx_10_12_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.8+ macOS 10.12+ x86-64

File details

Details for the file matcher_py-0.2.8.tar.gz.

File metadata

  • Download URL: matcher_py-0.2.8.tar.gz
  • Upload date:
  • Size: 297.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.3

File hashes

Hashes for matcher_py-0.2.8.tar.gz
Algorithm Hash digest
SHA256 9e8cfbb4a44dca408de698bd49d42da2f25232004e36284274d6da483a23f39f
MD5 c12cded52010ae1fac1585116d1dcc97
BLAKE2b-256 8d64ae47673c1d3e4899db7a05ddb42f70a43c0fd77591ec7f8265d33a269f86

See more details on using hashes here.

File details

Details for the file matcher_py-0.2.8-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for matcher_py-0.2.8-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 37777f9a57c332cd997087d81a882b7dc56cee5416a28288b6b971cbb4391188
MD5 b8929f4d3a9891d17ce04e4f0feba0d3
BLAKE2b-256 4cad63702487bbe0debd8a93b762a7ed1d756ee0454ad80fd7c038660756df94

See more details on using hashes here.

File details

Details for the file matcher_py-0.2.8-cp38-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for matcher_py-0.2.8-cp38-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 84470944af1ca162b692314cd7d79aa8acf5ca23efe50710357c088370c4e162
MD5 56a21fb0725f46bd282f0ec6644cecd9
BLAKE2b-256 36b812c32729a12242eb1099b19a9ab0ee744353e1dd036882113fa81cdf7f6f

See more details on using hashes here.

File details

Details for the file matcher_py-0.2.8-cp38-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for matcher_py-0.2.8-cp38-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 a614db754d11996c1aeb3ce9c712be5e7bbd5d7e14d36b78e1cd9e57c9367112
MD5 e5feacd396893f8b6fcf4f67669e93d0
BLAKE2b-256 bf5e4eb3bca68336fe96b234131bbba2889c180eb6400e8be814322bb35bdcd9

See more details on using hashes here.

File details

Details for the file matcher_py-0.2.8-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for matcher_py-0.2.8-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b9c6e0dc88fdfaa2eec512edb995850bb26f89eb8e57882c35392e2d19c818bb
MD5 b76a323ed5c7d2075f7c7fb51a20f71c
BLAKE2b-256 6c7370914a98dcabc5877c148b4f569df4eca213f198ead4625873ff12beb751

See more details on using hashes here.

File details

Details for the file matcher_py-0.2.8-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for matcher_py-0.2.8-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 e28a648d34c11ce7434f713a9fa367e3a83f875e2aa4608a2708347218ec2372
MD5 118c13d05e41a123575326be65246640
BLAKE2b-256 744264751854c7fa7f164ecb51549c3e7ffa2bbddfef6a1eed7c11614f3e5e46

See more details on using hashes here.

File details

Details for the file matcher_py-0.2.8-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for matcher_py-0.2.8-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 07e70e921d78b7f2a1153468e82ce208407727bb54f37ef17e43b079d63ff0f1
MD5 07ba39f47ca9a7b45a539d7f8595fc71
BLAKE2b-256 130490e9188228aa5bb0845c4a260af0f08c08482ea6a3c8d1f194b374e227e0

See more details on using hashes here.

File details

Details for the file matcher_py-0.2.8-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for matcher_py-0.2.8-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 fe63deaffa8a38421a6ce80e85ee13f626f384a9a5731d49f412ae1b5c0f227b
MD5 1f0df05d87c6bc2f499336ef9d069e88
BLAKE2b-256 8f2243dfedf3b88427664076f44451af3b2e8f78309727ee9a8cd46537444060

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page