Skip to main content

A high performance multiple functional word matcher

Project description

Matcher Rust Implementation with PyO3 Binding

A high-performance, multi-functional word matcher implemented in Rust.

Designed to solve AND OR NOT and TEXT VARIATIONS problems in word/word_list matching. For detailed implementation, see the Design Document.

Features

  • Multiple Matching Methods:
    • Simple Word Matching
    • Regex-Based Matching
    • Similarity-Based Matching
  • Text Normalization:
    • Fanjian: Simplify traditional Chinese characters to simplified ones. Example: 蟲艸 -> 虫艹
    • Delete: Remove specific characters. Example: *Fu&*iii&^%%*&kkkk -> Fuiiikkkk
    • Normalize: Normalize special characters to identifiable characters. Example: 𝜢𝕰𝕃𝙻Ϙ 𝙒ⓞƦℒ𝒟! -> hello world
    • PinYin: Convert Chinese characters to Pinyin for fuzzy matching. Example: 西安 -> /xi//an/, matches 洗按 -> /xi//an/, but not -> /xian/
    • PinYinChar: Convert Chinese characters to Pinyin. Example: 西安 -> xian, matches 洗按 and -> xian
  • Combination and Repeated Word Matching:
    • Takes into account the number of repetitions of words.
    • Example: hello,world matches hello world and world,hello
    • Example: 无,法,无,天 matches 无无法天 (because is repeated twice), but not 无法天
  • Customizable Exemption Lists: Exclude specific words from matching.
  • Efficient Handling of Large Word Lists: Optimized for performance.

Installation

Use pip

pip install matcher_py

Install pre-built binary

Visit the release page to download the pre-built binary.

Usage

The msgspec library is recommended for serializing the matcher configuration due to its performance benefits. You can also use other msgpack serialization libraries like ormsgpack. All relevant types are defined in extension_types.py.

Explanation of the configuration

  • Matcher's configuration is defined by the MatchTableMap = Dict[int, List[MatchTable]] type, the key of MatchTableMap is called match_id, for each match_id, the table_id inside should but isn't required to be unique.
  • SimpleMatcher's configuration is defined by the SimpleMatchTableMap = Dict[SimpleMatchType, Dict[int, str]] type, the value Dict[int, str]'s key is called word_id, word_id is required to be globally unique.

MatchTable

  • table_id: The unique ID of the match table.
  • match_table_type: The type of the match table.
  • word_list: The word list of the match table.
  • exemption_simple_match_type: The type of the exemption simple match.
  • exemption_word_list: The exemption word list of the match table.

For each match table, word matching is performed over the word_list, and exemption word matching is performed over the exemption_word_list. If the exemption word matching result is True, the word matching result will be False.

MatchTableType

  • Simple: Supports simple multiple patterns matching with text normalization defined by simple_match_type.
    • We offer transformation methods for text normalization, including Fanjian, Normalize, PinYin ···.
    • It can handle combination patterns and repeated times sensitive matching, delimited by ,, such as hello,world,hello will match hellohelloworld and worldhellohello, but not helloworld due to the repeated times of hello.
  • Regex: Supports regex patterns matching.
    • SimilarChar: Supports similar character matching using regex.
      • ["hello,hallo,hollo,hi", "word,world,wrd,🌍", "!,?,~"] will match helloworld, hollowrd, hi🌍 ··· any combinations of the words split by , in the list.
    • Acrostic: Supports acrostic matching using regex (currently only supports Chinese and simple English sentences).
      • ["h,e,l,l,o", "你,好"] will match hope, endures, love, lasts, onward. and 你的笑容温暖, 好心情常伴。.
    • Regex: Supports regex matching.
      • ["h[aeiou]llo", "w[aeiou]rd"] will match hello, world, hillo, wurld ··· any text that matches the regex in the list.
  • Similar: Supports similar text matching based on distance and threshold.
    • Levenshtein: Supports similar text matching based on Levenshtein distance.
    • DamerauLevenshtein: Supports similar text matching based on Damerau-Levenshtein distance.
    • Indel: Supports similar text matching based on Indel distance.
    • Jaro: Supports similar text matching based on Jaro distance.
    • JaroWinkler: Supports similar text matching based on Jaro-Winkler distance.

SimpleMatchType

  • None: No transformation.
  • Fanjian: Traditional Chinese to simplified Chinese transformation. Based on FANJIAN and UNICODE.
    • 妳好 -> 你好
    • 現⾝ -> 现身
  • Delete: Delete all punctuation, special characters and white spaces.
    • hello, world! -> helloworld
    • 《你∷好》 -> 你好
  • Normalize: Normalize all English character variations and number variations to basic characters. Based on UPPER_LOWER, EN_VARIATION and NUM_NORM.
    • ℋЀ⒈㈠ϕ -> he11o
    • ⒈Ƨ㊂ -> 123
  • PinYin: Convert all unicode Chinese characters to pinyin with boundaries. Based on PINYIN.
    • 你好 -> ␀ni␀␀hao␀
    • 西安 -> ␀xi␀␀an␀
  • PinYinChar: Convert all unicode Chinese characters to pinyin without boundaries. Based on PINYIN_CHAR.
    • 你好 -> nihao
    • 西安 -> xian

You can combine these transformations as needed. Pre-defined combinations like DeleteNormalize and FanjianDeleteNormalize are provided for convenience.

Avoid combining PinYin and PinYinChar due to that PinYin is a more limited version of PinYinChar, in some cases like xian, can be treat as two words xi and an, or only one word xian.

Delete is technologically a combination of TextDelete and WordDelete, we implement different delete methods for text and word. 'Cause we believe CN_SPECIAL and EN_SPECIAL are parts of the word, but not for text. For text_process and reduce_text_process functions, users should use TextDelete instead of WordDelete.

Limitations

Simple Match can handle words with a maximum of 32 combined words (more than 32 then effective combined words are not guaranteed) and 8 repeated words (more than 8 repeated words will be limited to 8).

Text Process Usage

Here’s an example of how to use the reduce_text_process and text_process functions:

from matcher_py import reduce_text_process, text_process
from matcher_py.extension_types import SimpleMatchType

print(reduce_text_process(SimpleMatchType.MatchTextDelete | SimpleMatchType.MatchNormalize, "hello, world!"))
print(text_process(SimpleMatchType.MatchTextDelete, "hello, world!"))

Matcher Basic Usage

Here’s an example of how to use the Matcher:

import msgspec
import numpy as np
from matcher_py import Matcher
from matcher_py.extension_types import MatchTable, MatchTableType, SimpleMatchType

msgpack_encoder = msgspec.msgpack.Encoder()
matcher = Matcher(
    msgpack_encoder.encode({
        1: [
            MatchTable(
                table_id=1,
                match_table_type=MatchTableType.Simple(simple_match_type = SimpleMatchType.MatchFanjianDeleteNormalize),
                word_list=["hello", "world"],
                exemption_simple_match_type=SimpleMatchType.MatchNone,
                exemption_word_list=["word"],
            )
        ]
    })
)
# Check if a text matches
assert matcher.is_match("hello")
assert not matcher.is_match("hello, word")
# Perform word matching as a dict
assert matcher.word_match(r"hello, world")[1]
# Perform word matching as a string
result = matcher.word_match_as_string("hello")
assert result == """{1:[{\"table_id\":1,\"word\":\"hello\"}]"}"""
# Perform batch processing as a dict using a list
text_list = ["hello", "world", "hello,word"]
batch_results = matcher.batch_word_match(text_list)
print(batch_results)
# Perform batch processing as a string using a list
text_list = ["hello", "world", "hello,word"]
batch_results = matcher.batch_word_match_as_string(text_list)
print(batch_results)
# Perform batch processing as a dict using a numpy array
text_array = np.array(["hello", "world", "hello,word"], dtype=np.dtype("object"))
numpy_results = matcher.numpy_word_match(text_array)
print(numpy_results)
# Perform batch processing as a string using a numpy array
text_array = np.array(["hello", "world", "hello,word"], dtype=np.dtype("object"))
numpy_results = matcher.numpy_word_match_as_string(text_array)
print(numpy_results)

Simple Matcher Basic Usage

Here’s an example of how to use the SimpleMatcher:

import msgspec
import numpy as np
from matcher_py import SimpleMatcher
from matcher_py.extension_types import SimpleMatchType

msgpack_encoder = msgspec.msgpack.Encoder()
simple_matcher = SimpleMatcher(
    msgpack_encoder.encode({SimpleMatchType.MatchNone: {1: "example"}})
)
# Check if a text matches
assert simple_matcher.is_match("example")
# Perform simple processing
results = simple_matcher.simple_process("example")
print(results)
# Perform batch processing using a list
text_list = ["example", "test", "example test"]
batch_results = simple_matcher.batch_simple_process(text_list)
print(batch_results)
# Perform batch processing using a NumPy array
text_array = np.array(["example", "test", "example test"], dtype=np.dtype("object"))
numpy_results = simple_matcher.numpy_simple_process(text_array)
print(numpy_results)

Contributing

Contributions to matcher_py are welcome! If you find a bug or have a feature request, please open an issue on the GitHub repository. If you would like to contribute code, please fork the repository and submit a pull request.

License

matcher_py is licensed under the MIT OR Apache-2.0 license.

More Information

For more details, visit the GitHub repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

matcher_py-0.3.1.tar.gz (306.8 kB view details)

Uploaded Source

Built Distributions

matcher_py-0.3.1-cp38-abi3-win_amd64.whl (1.6 MB view details)

Uploaded CPython 3.8+ Windows x86-64

matcher_py-0.3.1-cp38-abi3-musllinux_1_2_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.8+ musllinux: musl 1.2+ x86-64

matcher_py-0.3.1-cp38-abi3-musllinux_1_2_aarch64.whl (1.9 MB view details)

Uploaded CPython 3.8+ musllinux: musl 1.2+ ARM64

matcher_py-0.3.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ x86-64

matcher_py-0.3.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.7 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ ARM64

matcher_py-0.3.1-cp38-abi3-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.8+ macOS 11.0+ ARM64

matcher_py-0.3.1-cp38-abi3-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8+ macOS 10.12+ x86-64

File details

Details for the file matcher_py-0.3.1.tar.gz.

File metadata

  • Download URL: matcher_py-0.3.1.tar.gz
  • Upload date:
  • Size: 306.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for matcher_py-0.3.1.tar.gz
Algorithm Hash digest
SHA256 05369a485e22d3fb4310cd3c20fb991116dba1ddb1f4724f4bac00fbd7ff0d98
MD5 52deaf0b8bfb418fa035ac905a8d0bb9
BLAKE2b-256 d3fd11714bbe354110087ea2a02a261b2ea1a0218c072cde814aaba35c5e3910

See more details on using hashes here.

File details

Details for the file matcher_py-0.3.1-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for matcher_py-0.3.1-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 043ff8c20a72acb46d335ff0819216f418aa8bc84b14ce8a2ab5c20bed75e398
MD5 2f12843a0c5be9f077b123a0d8e547d5
BLAKE2b-256 8bd9088dd49ea90bf6e6abba3284a18dc1fdc67ec93f4352f06270d32d6e7923

See more details on using hashes here.

File details

Details for the file matcher_py-0.3.1-cp38-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for matcher_py-0.3.1-cp38-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 d9989926ae89e496999a42d8b907e20d8ac443171d3bbfe0ff913db929dd90e1
MD5 71d97dbea0520e122a23cf68109d1aef
BLAKE2b-256 c9ec260b9b1adb95a3e8515b7a2a6097e5e736db2116ca17b569258cb54f3d0b

See more details on using hashes here.

File details

Details for the file matcher_py-0.3.1-cp38-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for matcher_py-0.3.1-cp38-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 724382ec7bcd4255dac9e6d2918f1483616cdd3242fbc379d0e750089fab2d6e
MD5 65c0aaee883e6aa245f43683d418a782
BLAKE2b-256 25c96f25d6becfd8cd0f73ee77fa50fdb26507ac81c2fede0751dd6ab0e5b79c

See more details on using hashes here.

File details

Details for the file matcher_py-0.3.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for matcher_py-0.3.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1d7088ab2bb40ab57b0bf053662c3fda07d7368f50f5289927b1d39c71b6c176
MD5 92f1a0ad9bc4afc6c2803cdf69f4480f
BLAKE2b-256 75a889cb858698ce28091a7431c644cf821c4ce79cab885544af5cdcbcd58af0

See more details on using hashes here.

File details

Details for the file matcher_py-0.3.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for matcher_py-0.3.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 3293b63aecdc3537449a4cc314ab30d56c5b6fa8de843913107f06f90b1d07e8
MD5 669523a44c0a9c264025f92ab2063d76
BLAKE2b-256 1ced26178b791ce089a9be1d632d55797dafb7d7e8784740d8256ec6292b054a

See more details on using hashes here.

File details

Details for the file matcher_py-0.3.1-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for matcher_py-0.3.1-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ce9a178758011307ecc632a432f7236fab7522b3869e6758baf66acc99323528
MD5 ade14d80298a1e9afcc2bd9567830f70
BLAKE2b-256 69f92ee747d0fb8bfd950ca6dce79626a52054f6f4437dea63da322cc4d38776

See more details on using hashes here.

File details

Details for the file matcher_py-0.3.1-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for matcher_py-0.3.1-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 1ec3dff309fe54e6737fccc84b8c9f7a68bb234c922768dcf26bc8d229274a41
MD5 70cf8e90e733c4456cb965a117ad41ee
BLAKE2b-256 dcff469dd17f7ca7250c320d157726f1e17bb46bf64ca58a55f35a7c787a5e33

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page