Skip to main content

A high performance multiple functional word matcher

Project description

Matcher Rust Implementation with PyO3 Binding

Installation

Use pip

pip install matcher_py

Install pre-built binary

Visit the release page to download the pre-built binary.

Usage

The msgspec library is recommended for serializing the matcher configuration due to its performance benefits. You can also use other msgpack serialization libraries like ormsgpack. All relevant types are defined in extension_types.py.

Explaination of the configuration

  1. Matcher's configuration is defined by the MatchTableMap = Dict[int, List[MatchTable]] type, the key of MatchTableMap is called match_id, for each match_id, the table_id inside should but isn't required to be unique.
  2. SimpleMatcher's configuration is defined by the SimpleMatchTableMap = Dict[SimpleMatchType, Dict[int, str]] type, the value Dict[int, str]'s key is called word_id, word_id is required to be globally unique.

MatchTable

  • table_id: The unique ID of the match table.
  • match_table_type: The type of the match table.
  • simple_match_type: The type of the simple match (only relevant if match_table_type is "simple").
  • word_list: The word list of the match table.
  • exemption_simple_match_type: The type of the exemption simple match.
  • exemption_word_list: The exemption word list of the match table.

For each match table, word matching is performed over the word_list, and exemption word matching is performed over the exemption_word_list. If the exemption word matching result is True, the word matching result will be False.

MatchTableType

  • Simple: Supports simple multiple patterns matching with text normalization defined by simple_match_type.
    • We offer transformation methods for text normalization, including MatchFanjian, MatchNormalize, MatchPinYin ···.
    • It can handle combination patterns and repeated times sensitive matching, delimited by ,, such as hello,world,hello will match hellohelloworld and worldhellohello, but not helloworld due to the repeated times of hello.
  • SimilarChar: Supports similar character matching using regex.
    • ["hello,hallo,hollo,hi", "word,world,wrd,🌍", "!,?,~"] will match helloworld, hollowrd, hi🌍 ··· any combinations of the words split by , in the list.
  • Acrostic: Supports acrostic matching using regex (currently only supports Chinese and simple English sentences).
    • ["h,e,l,l,o", "你,好"] will match hope, endures, love, lasts, onward. and 你的笑容温暖, 好心情常伴。.
  • SimilarTextLevenshtein: Supports similar text matching based on Levenshtein distance (threshold is 0.8).
    • ["helloworld"] will match helloworld, hellowrld, helloworld! ··· any similar text to the words in the list.
  • Regex: Supports regex matching.
    • ["h[aeiou]llo", "w[aeiou]rd"] will match hello, world, hillo, wurld ··· any text that matches the regex in the list.

SimpleMatchType

  • MatchNone: No transformation.
  • MatchFanjian: Traditional Chinese to simplified Chinese transformation.
    • 妳好 -> 你好
    • 現⾝ -> 现身
  • MatchDelete: Delete all non-alphanumeric and non-unicode Chinese characters.
    • hello, world! -> helloworld
    • 《你∷好》 -> 你好
  • MatchNormalize: Normalize all English character variations and number variations to basic characters.
    • ℋЀ⒈㈠ϕ -> he11o
    • ⒈Ƨ㊂ -> 123
  • MatchPinYin: Convert all unicode Chinese characters to pinyin with boundaries.
    • 你好 -> ␀ni␀␀hao␀
    • 西安 -> ␀xi␀␀an␀
  • MatchPinYinChar: Convert all unicode Chinese characters to pinyin without boundaries.
    • 你好 -> nihao
    • 西安 -> xian

You can combine these transformations as needed. Pre-defined combinations like MatchDeleteNormalize and MatchFanjianDeleteNormalize are provided for convenience.

Avoid combining MatchPinYin and MatchPinYinChar due to that MatchPinYin is a more limited version of MatchPinYinChar, in some cases like xian, can be treat as two words xi and an, or only one word xian.

Limitations

  • Simple Match can handle words with a maximum of 32 combined words (more than 32 then effective combined words are not guaranteed) and 8 repeated words (more than 8 repeated words will be limited to 8).

Matcher Basic Usage

Here’s an example of how to use the Matcher:

import msgspec
import numpy as np
from matcher_py import Matcher
from matcher_py.extension_types import MatchTable, MatchTableType, SimpleMatchType

msgpack_encoder = msgspec.msgpack.Encoder()
matcher = Matcher(
    msgpack_encoder.encode({
        1: [
            MatchTable(
                table_id=1,
                match_table_type=MatchTableType.Simple,
                simple_match_type=SimpleMatchType.MatchFanjianDeleteNormalize,
                word_list=["hello", "world"],
                exemption_simple_match_type=SimpleMatchType.MatchNone,
                exemption_word_list=["word"],
            )
        ]
    })
)
# Check if a text matches
assert matcher.is_match("hello")
assert not matcher.is_match("hello, word")
# Perform word matching as a dict
assert matcher.word_match(r"hello, world")[1]
# Perform word matching as a string
result = matcher.word_match_as_string("hello")
assert result == """{1:[{\"table_id\":1,\"word\":\"hello\"}]"}"""
# Perform batch processing as a dict using a list
text_list = ["hello", "world", "hello,word"]
batch_results = matcher.batch_word_match(text_list)
print(batch_results)
# Perform batch processing as a string using a list
text_list = ["hello", "world", "hello,word"]
batch_results = matcher.batch_word_match_as_string(text_list)
print(batch_results)
# Perform batch processing as a dict using a numpy array
text_array = np.array(["hello", "world", "hello,word"], dtype=np.dtype("object"))
numpy_results = matcher.numpy_word_match(text_array)
print(numpy_results)
# Perform batch processing as a string using a numpy array
text_array = np.array(["hello", "world", "hello,word"], dtype=np.dtype("object"))
numpy_results = matcher.numpy_word_match_as_string(text_array)
print(numpy_results)

Simple Matcher Basic Usage

Here’s an example of how to use the SimpleMatcher:

import msgspec
import numpy as np
from matcher_py import SimpleMatcher
from matcher_py.extension_types import SimpleMatchType

msgpack_encoder = msgspec.msgpack.Encoder()
simple_matcher = SimpleMatcher(
    msgpack_encoder.encode({SimpleMatchType.MatchNone: {1: "example"}})
)
# Check if a text matches
assert simple_matcher.is_match("example")
# Perform simple processing
results = simple_matcher.simple_process("example")
print(results)
# Perform batch processing using a list
text_list = ["example", "test", "example test"]
batch_results = simple_matcher.batch_simple_process(text_list)
print(batch_results)
# Perform batch processing using a NumPy array
text_array = np.array(["example", "test", "example test"], dtype=np.dtype("object"))
numpy_results = simple_matcher.numpy_simple_process(text_array)
print(numpy_results)

Contributing

Contributions to matcher_py are welcome! If you find a bug or have a feature request, please open an issue on the GitHub repository. If you would like to contribute code, please fork the repository and submit a pull request.

License

matcher_py is licensed under the MIT OR Apache-2.0 license.

For more details, visit the GitHub repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

matcher_py-0.2.7.tar.gz (295.1 kB view details)

Uploaded Source

Built Distributions

matcher_py-0.2.7-cp38-abi3-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.8+ Windows x86-64

matcher_py-0.2.7-cp38-abi3-musllinux_1_2_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.8+ musllinux: musl 1.2+ x86-64

matcher_py-0.2.7-cp38-abi3-musllinux_1_2_aarch64.whl (1.6 MB view details)

Uploaded CPython 3.8+ musllinux: musl 1.2+ ARM64

matcher_py-0.2.7-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ x86-64

matcher_py-0.2.7-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.3 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ ARM64

matcher_py-0.2.7-cp38-abi3-macosx_11_0_arm64.whl (1.2 MB view details)

Uploaded CPython 3.8+ macOS 11.0+ ARM64

matcher_py-0.2.7-cp38-abi3-macosx_10_12_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.8+ macOS 10.12+ x86-64

File details

Details for the file matcher_py-0.2.7.tar.gz.

File metadata

  • Download URL: matcher_py-0.2.7.tar.gz
  • Upload date:
  • Size: 295.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.3

File hashes

Hashes for matcher_py-0.2.7.tar.gz
Algorithm Hash digest
SHA256 7bbe137acd49c11d8bf45b56f67c08b460be52d2bba671e13b59330b0aa00add
MD5 ea74c21f56d8beffaae47cd001bd7ebe
BLAKE2b-256 cdaca5910fc7e2f8fe1b2a0ffa19e98086c69d4bbd1af058803a98a19db63881

See more details on using hashes here.

File details

Details for the file matcher_py-0.2.7-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for matcher_py-0.2.7-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 422344c71484c254f8199d26604e0674021ea7d1fc4473f5f16c3f0058b1bbb9
MD5 aee563e3e20a86ee256f9a189eeb094a
BLAKE2b-256 963ff1fcb6fd3aa7753bd0316b6c86826b2ed9b96f3015cda36e895760070163

See more details on using hashes here.

File details

Details for the file matcher_py-0.2.7-cp38-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for matcher_py-0.2.7-cp38-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 f2f69d784faec4d83ba091ae8e6d1d2135dc911aa516543acb67178e7dca0601
MD5 3421f8cecc2553a19565b94079bb81d3
BLAKE2b-256 1bfd812b1f6acfc04c8d8dfb05e3ecf06cdf42976aac92ecdc79c80b04646285

See more details on using hashes here.

File details

Details for the file matcher_py-0.2.7-cp38-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for matcher_py-0.2.7-cp38-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 0621e8b6da6f82eaf55a20151f6a8d8a261e48a439bd3e982f41da7cdd9872e3
MD5 04321a91dd7f0a63d81f5d9c192d6b4d
BLAKE2b-256 e9fff17c6644da0eea2b6a1e8a04559e3b6c2b54c086c2e6f784859bcb502597

See more details on using hashes here.

File details

Details for the file matcher_py-0.2.7-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for matcher_py-0.2.7-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4610bb65e1a08509ab7690c4f36ffe36833668967989239c88220fed097aa357
MD5 d647e49d9eac9509f9494100a85bebb5
BLAKE2b-256 9e2fee95a64e6835cab4119cddd592842e275b7bd47c5366e4e886389a7c4411

See more details on using hashes here.

File details

Details for the file matcher_py-0.2.7-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for matcher_py-0.2.7-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 f4b2d0e8cbf0d16275dafc18818d140898792df542833b5b05f303407336dbd1
MD5 4856b8add90823c4e6f79e6aba3ea1f7
BLAKE2b-256 107b52b6b82682e37a66dbb68ad4ef4f6b8044c1e1e1fecec786b7dde8e1d2d9

See more details on using hashes here.

File details

Details for the file matcher_py-0.2.7-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for matcher_py-0.2.7-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 286e690a42e85b9138cf0ef7ac4c67add5ca7ffd084583c57508efcbfa7ea952
MD5 7d3f9a0d49171747f5368b7993d657b2
BLAKE2b-256 8442a7b5bb88dd9471945aad7a8a4e3e8b9ffd8f2bcab8f867be3100b7768e8b

See more details on using hashes here.

File details

Details for the file matcher_py-0.2.7-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for matcher_py-0.2.7-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 0ced4bcaef80298c15e4e43bcabfe3848132e3e01ad3f1ab915aa49955fcde87
MD5 d0c18aa7faa0c4c71b6bafd74c445a35
BLAKE2b-256 c3b4a860f1dee8071921e415d0e70264683e7fb94c6a49b518889a24b9b27f9f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page