Skip to main content

A high performance multiple functional word matcher

Project description

Matcher Rust Implementation with PyO3 Binding

Installation

Use pip

pip install matcher_py

Install pre-built binary

Visit the release page to download the pre-built binary.

Usage

The msgspec library is recommended for serializing the matcher configuration due to its performance benefits. You can also use other msgpack serialization libraries like ormsgpack. All relevant types are defined in extension_types.py.

Explaination of the configuration

  • Matcher's configuration is defined by the MatchTableMap = Dict[int, List[MatchTable]] type, the key of MatchTableMap is called match_id, for each match_id, the table_id inside should but isn't required to be unique.
  • SimpleMatcher's configuration is defined by the SimpleMatchTableMap = Dict[SimpleMatchType, Dict[int, str]] type, the value Dict[int, str]'s key is called word_id, word_id is required to be globally unique.

MatchTable

  • table_id: The unique ID of the match table.
  • match_table_type: The type of the match table.
  • simple_match_type: The type of the simple match (only relevant if match_table_type is "simple").
  • word_list: The word list of the match table.
  • exemption_simple_match_type: The type of the exemption simple match.
  • exemption_word_list: The exemption word list of the match table.

For each match table, word matching is performed over the word_list, and exemption word matching is performed over the exemption_word_list. If the exemption word matching result is True, the word matching result will be False.

MatchTableType

  • Simple = "simple": Supports simple multiple patterns matching with text normalization defined by simple_match_type.
    • We offer transformation methods for text normalization, including MatchFanjian, MatchNormalize, MatchPinYin ···.
    • It can handle combination patterns and repeated times sensitive matching, delimited by ,, such as hello,world,hello will match hellohelloworld and worldhellohello, but not helloworld due to the repeated times of hello.
  • SimilarChar = "similar_char": Supports similar character matching using regex.
    • ["hello,hallo,hollo,hi", "word,world,wrd,🌍", "!,?,~"] will match helloworld, hollowrd, hi🌍 ··· any combinations of the words split by , in the list.
  • Acrostic = "acrostic": Supports acrostic matching using regex (currently only supports Chinese and simple English sentences).
    • ["h,e,l,l,o", "你,好"] will match hope, endures, love, lasts, onward. and 你的笑容温暖, 好心情常伴。.
  • SimilarTextLevenshtein = "similar_text_levenshtei"n": Supports similar text matching based on Levenshtein distance (threshold is 0.8).
    • ["helloworld"] will match helloworld, hellowrld, helloworld! ··· any similar text to the words in the list.
  • Regex = "regex": Supports regex matching.
    • ["h[aeiou]llo", "w[aeiou]rd"] will match hello, world, hillo, wurld ··· any text that matches the regex in the list.

SimpleMatchType

  • MatchNone = 1: No transformation.
  • MatchFanjian = 2: Traditional Chinese to simplified Chinese transformation.
    • 妳好 -> 你好
    • 現⾝ -> 现身
  • MatchDelete = 12: Delete all non-alphanumeric and non-unicode Chinese characters.
    • hello, world! -> helloworld
    • 《你∷好》 -> 你好
  • MatchNormalize = 16: Normalize all English character variations and number variations to basic characters.
    • ℋЀ⒈㈠ϕ -> he11o
    • ⒈Ƨ㊂ -> 123
  • MatchPinYin = 32: Convert all unicode Chinese characters to pinyin with boundaries.
    • 你好 -> ␀ni␀␀hao␀
    • 西安 -> ␀xi␀␀an␀
  • MatchPinYinChar = 64: Convert all unicode Chinese characters to pinyin without boundaries.
    • 你好 -> nihao
    • 西安 -> xian

You can combine these transformations as needed. Pre-defined combinations like MatchDeleteNormalize = 28 and MatchFanjianDeleteNormalize = 30 are provided for convenience.

Avoid combining MatchPinYin and MatchPinYinChar due to that MatchPinYin is a more limited version of MatchPinYinChar, in some cases like xian, can be treat as two words xi and an, or only one word xian.

Limitations

Simple Match can handle words with a maximum of 32 combined words (more than 32 then effective combined words are not guaranteed) and 8 repeated words (more than 8 repeated words will be limited to 8).

Matcher Basic Usage

Here’s an example of how to use the Matcher:

import msgspec
import numpy as np
from matcher_py import Matcher
from matcher_py.extension_types import MatchTable, MatchTableType, SimpleMatchType

msgpack_encoder = msgspec.msgpack.Encoder()
matcher = Matcher(
    msgpack_encoder.encode({
        1: [
            MatchTable(
                table_id=1,
                match_table_type=MatchTableType.Simple,
                simple_match_type=SimpleMatchType.MatchFanjianDeleteNormalize,
                word_list=["hello", "world"],
                exemption_simple_match_type=SimpleMatchType.MatchNone,
                exemption_word_list=["word"],
            )
        ]
    })
)
# Check if a text matches
assert matcher.is_match("hello")
assert not matcher.is_match("hello, word")
# Perform word matching as a dict
assert matcher.word_match(r"hello, world")[1]
# Perform word matching as a string
result = matcher.word_match_as_string("hello")
assert result == """{1:[{\"table_id\":1,\"word\":\"hello\"}]"}"""
# Perform batch processing as a dict using a list
text_list = ["hello", "world", "hello,word"]
batch_results = matcher.batch_word_match(text_list)
print(batch_results)
# Perform batch processing as a string using a list
text_list = ["hello", "world", "hello,word"]
batch_results = matcher.batch_word_match_as_string(text_list)
print(batch_results)
# Perform batch processing as a dict using a numpy array
text_array = np.array(["hello", "world", "hello,word"], dtype=np.dtype("object"))
numpy_results = matcher.numpy_word_match(text_array)
print(numpy_results)
# Perform batch processing as a string using a numpy array
text_array = np.array(["hello", "world", "hello,word"], dtype=np.dtype("object"))
numpy_results = matcher.numpy_word_match_as_string(text_array)
print(numpy_results)

Simple Matcher Basic Usage

Here’s an example of how to use the SimpleMatcher:

import msgspec
import numpy as np
from matcher_py import SimpleMatcher
from matcher_py.extension_types import SimpleMatchType

msgpack_encoder = msgspec.msgpack.Encoder()
simple_matcher = SimpleMatcher(
    msgpack_encoder.encode({SimpleMatchType.MatchNone: {1: "example"}})
)
# Check if a text matches
assert simple_matcher.is_match("example")
# Perform simple processing
results = simple_matcher.simple_process("example")
print(results)
# Perform batch processing using a list
text_list = ["example", "test", "example test"]
batch_results = simple_matcher.batch_simple_process(text_list)
print(batch_results)
# Perform batch processing using a NumPy array
text_array = np.array(["example", "test", "example test"], dtype=np.dtype("object"))
numpy_results = simple_matcher.numpy_simple_process(text_array)
print(numpy_results)

Contributing

Contributions to matcher_py are welcome! If you find a bug or have a feature request, please open an issue on the GitHub repository. If you would like to contribute code, please fork the repository and submit a pull request.

License

matcher_py is licensed under the MIT OR Apache-2.0 license.

More Information

For more details, visit the GitHub repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

matcher_py-0.2.9.tar.gz (298.7 kB view details)

Uploaded Source

Built Distributions

matcher_py-0.2.9-cp38-abi3-win_amd64.whl (1.5 MB view details)

Uploaded CPython 3.8+ Windows x86-64

matcher_py-0.2.9-cp38-abi3-musllinux_1_2_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.8+ musllinux: musl 1.2+ x86-64

matcher_py-0.2.9-cp38-abi3-musllinux_1_2_aarch64.whl (1.8 MB view details)

Uploaded CPython 3.8+ musllinux: musl 1.2+ ARM64

matcher_py-0.2.9-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ x86-64

matcher_py-0.2.9-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.6 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ ARM64

matcher_py-0.2.9-cp38-abi3-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.8+ macOS 11.0+ ARM64

matcher_py-0.2.9-cp38-abi3-macosx_10_12_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.8+ macOS 10.12+ x86-64

File details

Details for the file matcher_py-0.2.9.tar.gz.

File metadata

  • Download URL: matcher_py-0.2.9.tar.gz
  • Upload date:
  • Size: 298.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for matcher_py-0.2.9.tar.gz
Algorithm Hash digest
SHA256 62d2e09ce30fa713f4d6fa468de870a661df6d76d3daee634068b4564a9de39a
MD5 9a3d10e8fc5fec083d1f7344fa4ec36a
BLAKE2b-256 1ecdcc80a52f7ea0a057c846ecc711d0cdb95d2f66ee1b183f486bd3187c4caa

See more details on using hashes here.

File details

Details for the file matcher_py-0.2.9-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for matcher_py-0.2.9-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 12485833fcbedcdb085e0bc0623a183043516cd163bcfc275691f13301f03b00
MD5 c6e56af4c4df427dab2ff7d6d261694b
BLAKE2b-256 54f1846bce3bc8041289e8d8ee3a15c7510c4a4c9380df63a12fd40443d1fad6

See more details on using hashes here.

File details

Details for the file matcher_py-0.2.9-cp38-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for matcher_py-0.2.9-cp38-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 a73424827242970c783d91cb5c36ab9e5e202ca9c08adf8f895f5ddd1d410f3d
MD5 2c4464fa6e9c4bc3c7d541a2faae5ddb
BLAKE2b-256 4d373794ef3caad6cc3843e88c152c1b697c5aa294465e504044329723a1e55c

See more details on using hashes here.

File details

Details for the file matcher_py-0.2.9-cp38-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for matcher_py-0.2.9-cp38-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 d1e7eec7724b3db8654307bd6140f84f5793c8983abc82cf509e561776978567
MD5 23c1cecfc3869f8393240465961546fb
BLAKE2b-256 ff6595aaa40018271e69b7f8a33385d79d0e364217f9634dfac8a72d5037387c

See more details on using hashes here.

File details

Details for the file matcher_py-0.2.9-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for matcher_py-0.2.9-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f08933892ad061478261a3ec5ec206d464d4a3fa8d39337d73ec7e3780b2f23e
MD5 2fbf0872473718378627cef8bc9798ab
BLAKE2b-256 bd985693eda0e4fee67909521a266668151d2d7b55b921d38a1f1e407fdd2071

See more details on using hashes here.

File details

Details for the file matcher_py-0.2.9-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for matcher_py-0.2.9-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 75d683252d63094fba67a986c24ac9cab5fc5e29ad7b5fa3c2f3a267b619519d
MD5 8f4efae5072bf192c4897d8ada7142ee
BLAKE2b-256 7726222f4e7ac9b626954c6f4c9578bd2c263ef3862a4f3b65dd33496e65ccbd

See more details on using hashes here.

File details

Details for the file matcher_py-0.2.9-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for matcher_py-0.2.9-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f6b1f61c345f0727df74e31d8c1b263656dc5de0d472d37e805f1cf0a65faca2
MD5 be7fe82e615b972824865bd6dbc4a17b
BLAKE2b-256 af00d0c15a7cec4596601bc7a6fd947f092a805d61e6c8ea48c910d767de71e0

See more details on using hashes here.

File details

Details for the file matcher_py-0.2.9-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for matcher_py-0.2.9-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 6ab9ab22bff4405db3ba51044c50dff88556e541afc3353cdc6adfb06f8f8f3b
MD5 72ce1fe539d63bc54f434475c5fe148e
BLAKE2b-256 cd00517b6fb53ef39cc41c49580d9e4c4b75c53bc688cddb13f35800a00ff637

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page