Skip to main content

Blazingly fast Word Matcher

Project description

Matcher Rust Implementation with PyO3 Binding

A high-performance, multi-functional word matcher implemented in Rust.

Designed to solve AND OR NOT and TEXT VARIATIONS problems in word/word_list matching. For detailed implementation, see the Design Document.

Features

  • Multiple Matching Methods:
    • Simple Word Matching
    • Regex-Based Matching
    • Similarity-Based Matching
  • Text Normalization:
    • Fanjian: Simplify traditional Chinese characters to simplified ones. Example: 蟲艸 -> 虫草
    • Delete: Remove specific characters. Example: *Fu&*iii&^%%*&kkkk -> Fuiiikkkk
    • Normalize: Normalize special characters to identifiable characters. Example: 𝜢𝕰𝕃𝙻Ϙ 𝙒ⓞƦℒ𝒟! -> hello world!
    • PinYin: Convert Chinese characters to Pinyin for fuzzy matching. Example: 西安 -> /xi//an/, matches 洗按 -> /xi//an/, but not -> /xian/
    • PinYinChar: Convert Chinese characters to Pinyin. Example: 西安 -> xian, matches 洗按 and -> xian
  • AND OR NOT Word Matching:
    • Takes into account the number of repetitions of words.
    • Example: hello&world matches hello world and world,hello
    • Example: 无&法&无&天 matches 无无法天 (because is repeated twice), but not 无法天
    • Example: hello~helloo~hhello matches hello but not helloo and hhello
  • Customizable Exemption Lists: Exclude specific words from matching.
  • Efficient Handling of Large Word Lists: Optimized for performance.

Installation

Use pip

pip install matcher_py

Install pre-built binary

Visit the release page to download the pre-built binary.

Usage

The msgspec library is recommended for serializing the matcher configuration due to its performance benefits. You can also use other msgpack serialization libraries like ormsgpack. All relevant types are defined in extension_types.py.

Explanation of the configuration

  • Matcher's configuration is defined by the MatchTableMap = Dict[int, List[MatchTable]] type, the key of MatchTableMap is called match_id, for each match_id, the table_id inside should but isn't required to be unique.
  • SimpleMatcher's configuration is defined by the SimpleMatchTableMap = Dict[SimpleMatchType, Dict[int, str]] type, the value Dict[int, str]'s key is called word_id, word_id is required to be globally unique.

MatchTable

  • table_id: The unique ID of the match table.
  • match_table_type: The type of the match table.
  • word_list: The word list of the match table.
  • exemption_simple_match_type: The type of the exemption simple match.
  • exemption_word_list: The exemption word list of the match table.

For each match table, word matching is performed over the word_list, and exemption word matching is performed over the exemption_word_list. If the exemption word matching result is True, the word matching result will be False.

MatchTableType

  • Simple: Supports simple multiple patterns matching with text normalization defined by simple_match_type.
    • We offer transformation methods for text normalization, including Fanjian, Normalize, PinYin ···.
    • It can handle combination patterns and repeated times sensitive matching, delimited by &, such as hello&world&hello will match hellohelloworld and worldhellohello, but not helloworld due to the repeated times of hello.
  • Regex: Supports regex patterns matching.
    • SimilarChar: Supports similar character matching using regex.
      • ["hello,hallo,hollo,hi", "word,world,wrd,🌍", "!,?,~"] will match helloworld!, hollowrd?, hi🌍~ ··· any combinations of the words split by , in the list.
    • Acrostic: Supports acrostic matching using regex (currently only supports Chinese and simple English sentences).
      • ["h,e,l,l,o", "你,好"] will match hope, endures, love, lasts, onward. and 你的笑容温暖, 好心情常伴。.
    • Regex: Supports regex matching.
      • ["h[aeiou]llo", "w[aeiou]rd"] will match hello, world, hillo, wurld ··· any text that matches the regex in the list.
  • Similar: Supports similar text matching based on distance and threshold.
    • Levenshtein: Supports similar text matching based on Levenshtein distance.
    • DamerauLevenshtein: Supports similar text matching based on Damerau-Levenshtein distance.
    • Indel: Supports similar text matching based on Indel distance.
    • Jaro: Supports similar text matching based on Jaro distance.
    • JaroWinkler: Supports similar text matching based on Jaro-Winkler distance.

SimpleMatchType

  • None: No transformation.
  • Fanjian: Traditional Chinese to simplified Chinese transformation. Based on FANJIAN.
    • 妳好 -> 你好
    • 現⾝ -> 现身
  • Delete: Delete all punctuation, special characters and white spaces.
    • hello, world! -> helloworld
    • 《你∷好》 -> 你好
  • Normalize: Normalize all English character variations and number variations to basic characters. Based on SYMBOL_NORM, NORM and NUM_NORM.
    • ℋЀ⒈㈠Õ -> he11o
    • ⒈Ƨ㊂ -> 123
  • PinYin: Convert all unicode Chinese characters to pinyin with boundaries. Based on PINYIN.
    • 你好 -> ␀ni␀␀hao␀
    • 西安 -> ␀xi␀␀an␀
  • PinYinChar: Convert all unicode Chinese characters to pinyin without boundaries. Based on PINYIN_CHAR.
    • 你好 -> nihao
    • 西安 -> xian

You can combine these transformations as needed. Pre-defined combinations like DeleteNormalize and FanjianDeleteNormalize are provided for convenience.

Avoid combining PinYin and PinYinChar due to that PinYin is a more limited version of PinYinChar, in some cases like xian, can be treat as two words xi and an, or only one word xian.

Delete is technologically a combination of TextDelete and WordDelete, we implement different delete methods for text and word. 'Cause we believe CN_SPECIAL and EN_SPECIAL are parts of the word, but not for text. For text_process and reduce_text_process functions, users should use TextDelete instead of WordDelete.

  • WordDelete: Delete all patterns in WHITE_SPACE.
  • TextDelete: Delete all patterns in TEXT_DELETE.

Text Process Usage

Here’s an example of how to use the reduce_text_process and text_process functions:

from matcher_py import reduce_text_process, text_process
from matcher_py.extension_types import SimpleMatchType

print(reduce_text_process(SimpleMatchType.MatchTextDelete | SimpleMatchType.MatchNormalize, "hello, world!"))
print(text_process(SimpleMatchType.MatchTextDelete, "hello, world!"))

Matcher Basic Usage

Here’s an example of how to use the Matcher:

import msgspec
import numpy as np
from matcher_py import Matcher
from matcher_py.extension_types import MatchTable, MatchTableType, SimpleMatchType

msgpack_encoder = msgspec.msgpack.Encoder()
matcher = Matcher(
    msgpack_encoder.encode({
        1: [
            MatchTable(
                table_id=1,
                match_table_type=MatchTableType.Simple(simple_match_type = SimpleMatchType.MatchFanjianDeleteNormalize),
                word_list=["hello", "world"],
                exemption_simple_match_type=SimpleMatchType.MatchNone,
                exemption_word_list=["word"],
            )
        ]
    })
)
# Check if a text matches
assert matcher.is_match("hello")
assert not matcher.is_match("hello, word")
# Perform word matching as a dict
assert matcher.word_match(r"hello, world")[1]
# Perform word matching as a string
result = matcher.word_match_as_string("hello")
assert result == """{1:[{\"table_id\":1,\"word\":\"hello\"}]"}"""
# Perform batch processing as a dict using a list
text_list = ["hello", "world", "hello,word"]
batch_results = matcher.batch_word_match(text_list)
print(batch_results)
# Perform batch processing as a string using a list
text_list = ["hello", "world", "hello,word"]
batch_results = matcher.batch_word_match_as_string(text_list)
print(batch_results)
# Perform batch processing as a dict using a numpy array
text_array = np.array(["hello", "world", "hello,word"], dtype=np.dtype("object"))
numpy_results = matcher.numpy_word_match(text_array)
print(numpy_results)
# Perform batch processing as a string using a numpy array
text_array = np.array(["hello", "world", "hello,word"], dtype=np.dtype("object"))
numpy_results = matcher.numpy_word_match_as_string(text_array)
print(numpy_results)

Simple Matcher Basic Usage

Here’s an example of how to use the SimpleMatcher:

import msgspec
import numpy as np
from matcher_py import SimpleMatcher
from matcher_py.extension_types import SimpleMatchType

msgpack_encoder = msgspec.msgpack.Encoder()
simple_matcher = SimpleMatcher(
    msgpack_encoder.encode({SimpleMatchType.MatchNone: {1: "example"}})
)
# Check if a text matches
assert simple_matcher.is_match("example")
# Perform simple processing
results = simple_matcher.simple_process("example")
print(results)
# Perform batch processing using a list
text_list = ["example", "test", "example test"]
batch_results = simple_matcher.batch_simple_process(text_list)
print(batch_results)
# Perform batch processing using a NumPy array
text_array = np.array(["example", "test", "example test"], dtype=np.dtype("object"))
numpy_results = simple_matcher.numpy_simple_process(text_array)
print(numpy_results)

Contributing

Contributions to matcher_py are welcome! If you find a bug or have a feature request, please open an issue on the GitHub repository. If you would like to contribute code, please fork the repository and submit a pull request.

License

matcher_py is licensed under the MIT OR Apache-2.0 license.

More Information

For more details, visit the GitHub repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

matcher_py-0.4.2.tar.gz (496.8 kB view details)

Uploaded Source

Built Distributions

matcher_py-0.4.2-cp38-abi3-win_amd64.whl (1.9 MB view details)

Uploaded CPython 3.8+ Windows x86-64

matcher_py-0.4.2-cp38-abi3-musllinux_1_2_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.8+ musllinux: musl 1.2+ x86-64

matcher_py-0.4.2-cp38-abi3-musllinux_1_2_aarch64.whl (2.3 MB view details)

Uploaded CPython 3.8+ musllinux: musl 1.2+ ARM64

matcher_py-0.4.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ x86-64

matcher_py-0.4.2-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.1 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ ARM64

matcher_py-0.4.2-cp38-abi3-macosx_11_0_arm64.whl (1.9 MB view details)

Uploaded CPython 3.8+ macOS 11.0+ ARM64

matcher_py-0.4.2-cp38-abi3-macosx_10_12_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.8+ macOS 10.12+ x86-64

File details

Details for the file matcher_py-0.4.2.tar.gz.

File metadata

  • Download URL: matcher_py-0.4.2.tar.gz
  • Upload date:
  • Size: 496.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for matcher_py-0.4.2.tar.gz
Algorithm Hash digest
SHA256 f3ece97e198d27165f80ffb9d51cdfc9cd97ddd2f5eab200e9bffacf4442724d
MD5 f4a03ebdfb479fd4da0e199c6d1a951e
BLAKE2b-256 62a8f381bc6e52248cc94dc104d6d7719a001373e7b4ca77a2c542a8e10e5fd0

See more details on using hashes here.

File details

Details for the file matcher_py-0.4.2-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for matcher_py-0.4.2-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 94f95ad905f6b79469fadbd37c2bcdc106917c92483ed48112b9a0b28af6b8f8
MD5 c8ed1c1464570cdab82576b6fffe1660
BLAKE2b-256 1b38c4d05af6598b8ca0c931058cfd36bd7ed863507a7666ac441eae380f747b

See more details on using hashes here.

File details

Details for the file matcher_py-0.4.2-cp38-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for matcher_py-0.4.2-cp38-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 547ad08b36a146d161c43b9a2dc5984da9db0bdb47cfd23a8122d90b8c50c570
MD5 49f0c2df00c6623f8209465231a87468
BLAKE2b-256 8860f7feceef2816dfbe8ba814b706c7c8b27de8f0308c0cadf344693173c7ed

See more details on using hashes here.

File details

Details for the file matcher_py-0.4.2-cp38-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for matcher_py-0.4.2-cp38-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 b323c18c353c75f13c23c33d25ef1937dda041ba2d0c24baa9e646db29f83235
MD5 509801d56dea74b4807c7ad6abf66bae
BLAKE2b-256 17627de124b9c37695a26ca9ac4af7cebc533c4b5a71dc19acbff037721e467c

See more details on using hashes here.

File details

Details for the file matcher_py-0.4.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for matcher_py-0.4.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 21ec460b6163a123a14ab7917c2fff1589e29c1779c15046c7ba731fdf8264fe
MD5 442c20a0fee2fc52d225f6f4646a193d
BLAKE2b-256 f2788b99e85a1ddb3c1f89f1fdb8a00b1672ce3852d16b0ea710d582ed7c4536

See more details on using hashes here.

File details

Details for the file matcher_py-0.4.2-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for matcher_py-0.4.2-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 8ff62256d4a73de0abf2f182275924ebc80b68861fc6e2c8e9e7686efb8361de
MD5 426e3aa834ff6c0503acce09700a2a33
BLAKE2b-256 b3c9c1d496260c578c723c3c66d0e0c9b745bafc062e131ccccdaecf044d5eb4

See more details on using hashes here.

File details

Details for the file matcher_py-0.4.2-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for matcher_py-0.4.2-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 083dc9d558358941bc9a7cbdbc3f572d09e78103fd2d0cc5ca35c8fa8b6a7b8c
MD5 a23d738357d7bcb8c9ed7b40376f66e2
BLAKE2b-256 f0c669e748f4184d40f81f7494411a5695052475649ee64aada6b7ea534a62fc

See more details on using hashes here.

File details

Details for the file matcher_py-0.4.2-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for matcher_py-0.4.2-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 a59687890fcf2f58974877669c01cf34c656efb0db11111e855cc3af7e2325da
MD5 41a06eea01cc344eb8fa07196f0f807f
BLAKE2b-256 bc3a030ca8658105f91face6e3d63b04528221c70fede615070641a045ef1314

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page