Skip to main content

Blazingly fast Word Matcher

Project description

Matcher Rust Implementation with PyO3 Binding

A high-performance, multi-functional word matcher implemented in Rust.

Designed to solve AND OR NOT and TEXT VARIATIONS problems in word/word_list matching. For detailed implementation, see the Design Document.

Features

  • Multiple Matching Methods:
    • Simple Word Matching
    • Regex-Based Matching
    • Similarity-Based Matching
  • Text Normalization:
    • Fanjian: Simplify traditional Chinese characters to simplified ones. Example: 蟲艸 -> 虫艹
    • Delete: Remove specific characters. Example: *Fu&*iii&^%%*&kkkk -> Fuiiikkkk
    • Normalize: Normalize special characters to identifiable characters. Example: 𝜢𝕰𝕃𝙻Ϙ 𝙒ⓞƦℒ𝒟! -> hello world
    • PinYin: Convert Chinese characters to Pinyin for fuzzy matching. Example: 西安 -> /xi//an/, matches 洗按 -> /xi//an/, but not -> /xian/
    • PinYinChar: Convert Chinese characters to Pinyin. Example: 西安 -> xian, matches 洗按 and -> xian
  • AND OR NOT Word Matching:
    • Takes into account the number of repetitions of words.
    • Example: hello&world matches hello world and world,hello
    • Example: 无&法&无&天 matches 无无法天 (because is repeated twice), but not 无法天
    • Example: hello~helloo~hhello matches hello but not helloo and hhello
  • Customizable Exemption Lists: Exclude specific words from matching.
  • Efficient Handling of Large Word Lists: Optimized for performance.

Installation

Use pip

pip install matcher_py

Install pre-built binary

Visit the release page to download the pre-built binary.

Usage

The msgspec library is recommended for serializing the matcher configuration due to its performance benefits. You can also use other msgpack serialization libraries like ormsgpack. All relevant types are defined in extension_types.py.

Explanation of the configuration

  • Matcher's configuration is defined by the MatchTableMap = Dict[int, List[MatchTable]] type, the key of MatchTableMap is called match_id, for each match_id, the table_id inside should but isn't required to be unique.
  • SimpleMatcher's configuration is defined by the SimpleMatchTableMap = Dict[SimpleMatchType, Dict[int, str]] type, the value Dict[int, str]'s key is called word_id, word_id is required to be globally unique.

MatchTable

  • table_id: The unique ID of the match table.
  • match_table_type: The type of the match table.
  • word_list: The word list of the match table.
  • exemption_simple_match_type: The type of the exemption simple match.
  • exemption_word_list: The exemption word list of the match table.

For each match table, word matching is performed over the word_list, and exemption word matching is performed over the exemption_word_list. If the exemption word matching result is True, the word matching result will be False.

MatchTableType

  • Simple: Supports simple multiple patterns matching with text normalization defined by simple_match_type.
    • We offer transformation methods for text normalization, including Fanjian, Normalize, PinYin ···.
    • It can handle combination patterns and repeated times sensitive matching, delimited by &, such as hello&world&hello will match hellohelloworld and worldhellohello, but not helloworld due to the repeated times of hello.
  • Regex: Supports regex patterns matching.
    • SimilarChar: Supports similar character matching using regex.
      • ["hello,hallo,hollo,hi", "word,world,wrd,🌍", "!,?,~"] will match helloworld, hollowrd, hi🌍 ··· any combinations of the words split by , in the list.
    • Acrostic: Supports acrostic matching using regex (currently only supports Chinese and simple English sentences).
      • ["h,e,l,l,o", "你,好"] will match hope, endures, love, lasts, onward. and 你的笑容温暖, 好心情常伴。.
    • Regex: Supports regex matching.
      • ["h[aeiou]llo", "w[aeiou]rd"] will match hello, world, hillo, wurld ··· any text that matches the regex in the list.
  • Similar: Supports similar text matching based on distance and threshold.
    • Levenshtein: Supports similar text matching based on Levenshtein distance.
    • DamerauLevenshtein: Supports similar text matching based on Damerau-Levenshtein distance.
    • Indel: Supports similar text matching based on Indel distance.
    • Jaro: Supports similar text matching based on Jaro distance.
    • JaroWinkler: Supports similar text matching based on Jaro-Winkler distance.

SimpleMatchType

  • None: No transformation.
  • Fanjian: Traditional Chinese to simplified Chinese transformation. Based on FANJIAN and UNICODE.
    • 妳好 -> 你好
    • 現⾝ -> 现身
  • Delete: Delete all punctuation, special characters and white spaces.
    • hello, world! -> helloworld
    • 《你∷好》 -> 你好
  • Normalize: Normalize all English character variations and number variations to basic characters. Based on UPPER_LOWER, EN_VARIATION and NUM_NORM.
    • ℋЀ⒈㈠ϕ -> he11o
    • ⒈Ƨ㊂ -> 123
  • PinYin: Convert all unicode Chinese characters to pinyin with boundaries. Based on PINYIN.
    • 你好 -> ␀ni␀␀hao␀
    • 西安 -> ␀xi␀␀an␀
  • PinYinChar: Convert all unicode Chinese characters to pinyin without boundaries. Based on PINYIN_CHAR.
    • 你好 -> nihao
    • 西安 -> xian

You can combine these transformations as needed. Pre-defined combinations like DeleteNormalize and FanjianDeleteNormalize are provided for convenience.

Avoid combining PinYin and PinYinChar due to that PinYin is a more limited version of PinYinChar, in some cases like xian, can be treat as two words xi and an, or only one word xian.

Delete is technologically a combination of TextDelete and WordDelete, we implement different delete methods for text and word. 'Cause we believe CN_SPECIAL and EN_SPECIAL are parts of the word, but not for text. For text_process and reduce_text_process functions, users should use TextDelete instead of WordDelete.

Text Process Usage

Here’s an example of how to use the reduce_text_process and text_process functions:

from matcher_py import reduce_text_process, text_process
from matcher_py.extension_types import SimpleMatchType

print(reduce_text_process(SimpleMatchType.MatchTextDelete | SimpleMatchType.MatchNormalize, "hello, world!"))
print(text_process(SimpleMatchType.MatchTextDelete, "hello, world!"))

Matcher Basic Usage

Here’s an example of how to use the Matcher:

import msgspec
import numpy as np
from matcher_py import Matcher
from matcher_py.extension_types import MatchTable, MatchTableType, SimpleMatchType

msgpack_encoder = msgspec.msgpack.Encoder()
matcher = Matcher(
    msgpack_encoder.encode({
        1: [
            MatchTable(
                table_id=1,
                match_table_type=MatchTableType.Simple(simple_match_type = SimpleMatchType.MatchFanjianDeleteNormalize),
                word_list=["hello", "world"],
                exemption_simple_match_type=SimpleMatchType.MatchNone,
                exemption_word_list=["word"],
            )
        ]
    })
)
# Check if a text matches
assert matcher.is_match("hello")
assert not matcher.is_match("hello, word")
# Perform word matching as a dict
assert matcher.word_match(r"hello, world")[1]
# Perform word matching as a string
result = matcher.word_match_as_string("hello")
assert result == """{1:[{\"table_id\":1,\"word\":\"hello\"}]"}"""
# Perform batch processing as a dict using a list
text_list = ["hello", "world", "hello,word"]
batch_results = matcher.batch_word_match(text_list)
print(batch_results)
# Perform batch processing as a string using a list
text_list = ["hello", "world", "hello,word"]
batch_results = matcher.batch_word_match_as_string(text_list)
print(batch_results)
# Perform batch processing as a dict using a numpy array
text_array = np.array(["hello", "world", "hello,word"], dtype=np.dtype("object"))
numpy_results = matcher.numpy_word_match(text_array)
print(numpy_results)
# Perform batch processing as a string using a numpy array
text_array = np.array(["hello", "world", "hello,word"], dtype=np.dtype("object"))
numpy_results = matcher.numpy_word_match_as_string(text_array)
print(numpy_results)

Simple Matcher Basic Usage

Here’s an example of how to use the SimpleMatcher:

import msgspec
import numpy as np
from matcher_py import SimpleMatcher
from matcher_py.extension_types import SimpleMatchType

msgpack_encoder = msgspec.msgpack.Encoder()
simple_matcher = SimpleMatcher(
    msgpack_encoder.encode({SimpleMatchType.MatchNone: {1: "example"}})
)
# Check if a text matches
assert simple_matcher.is_match("example")
# Perform simple processing
results = simple_matcher.simple_process("example")
print(results)
# Perform batch processing using a list
text_list = ["example", "test", "example test"]
batch_results = simple_matcher.batch_simple_process(text_list)
print(batch_results)
# Perform batch processing using a NumPy array
text_array = np.array(["example", "test", "example test"], dtype=np.dtype("object"))
numpy_results = simple_matcher.numpy_simple_process(text_array)
print(numpy_results)

Contributing

Contributions to matcher_py are welcome! If you find a bug or have a feature request, please open an issue on the GitHub repository. If you would like to contribute code, please fork the repository and submit a pull request.

License

matcher_py is licensed under the MIT OR Apache-2.0 license.

More Information

For more details, visit the GitHub repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

matcher_py-0.4.0.tar.gz (303.8 kB view details)

Uploaded Source

Built Distributions

matcher_py-0.4.0-cp38-abi3-win_amd64.whl (1.6 MB view details)

Uploaded CPython 3.8+ Windows x86-64

matcher_py-0.4.0-cp38-abi3-musllinux_1_2_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.8+ musllinux: musl 1.2+ x86-64

matcher_py-0.4.0-cp38-abi3-musllinux_1_2_aarch64.whl (2.0 MB view details)

Uploaded CPython 3.8+ musllinux: musl 1.2+ ARM64

matcher_py-0.4.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ x86-64

matcher_py-0.4.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.8 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ ARM64

matcher_py-0.4.0-cp38-abi3-macosx_11_0_arm64.whl (1.6 MB view details)

Uploaded CPython 3.8+ macOS 11.0+ ARM64

matcher_py-0.4.0-cp38-abi3-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8+ macOS 10.12+ x86-64

File details

Details for the file matcher_py-0.4.0.tar.gz.

File metadata

  • Download URL: matcher_py-0.4.0.tar.gz
  • Upload date:
  • Size: 303.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for matcher_py-0.4.0.tar.gz
Algorithm Hash digest
SHA256 8f1ffe9e88b5bf867d7a77d4fd27c91e79763e1f677bcd0959cdd9ec9da825ed
MD5 5457f8753f430692d31eac0c40eae15f
BLAKE2b-256 169da4a96a4025cfa0e26d351b7f498d8bf8b6427f79fb24692690a477aa51e1

See more details on using hashes here.

File details

Details for the file matcher_py-0.4.0-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for matcher_py-0.4.0-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 422470ab016fa6874973fe0f370d9ab9250c94667f7f15fe294d74108cfa024f
MD5 ad87306206650a0cd974e44798c2ea4b
BLAKE2b-256 ecbc73966703ed212afc4176c1b04ffc2cfaf9c5867dce7b45ac3f8a7cee2d4c

See more details on using hashes here.

File details

Details for the file matcher_py-0.4.0-cp38-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for matcher_py-0.4.0-cp38-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 16a50015effdd16249f283f7a3d48a5d442499bc149f9b6c8fb1dff92f2c524c
MD5 1dbea6d2976a38f2c596ddfa5a03a0d8
BLAKE2b-256 ce85c57b2670cd3c1b735ffab6bf8f12ba0e6ac3c086aa41118b9a98e2098726

See more details on using hashes here.

File details

Details for the file matcher_py-0.4.0-cp38-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for matcher_py-0.4.0-cp38-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 34301b21d10bd6f6deb83cdad9da014e228d1d1c7980d3bd701a8fd047c8548a
MD5 7fc39966118716ca1359ac3cbc2209b8
BLAKE2b-256 45b99c7339afed494cf13fd70dc58d9b0caebbf40f75c75630e1721e39871295

See more details on using hashes here.

File details

Details for the file matcher_py-0.4.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for matcher_py-0.4.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6021fb7cee63bf814ac2f57311df247b76ce3054aa872738489ba4c4e3b7bcb1
MD5 32646ce5652dea9c6fbe2695f19b963e
BLAKE2b-256 3a371aa8ed75bfdd35b6d96b358cdd7166bf1b4091c14185d2124f2075453414

See more details on using hashes here.

File details

Details for the file matcher_py-0.4.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for matcher_py-0.4.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 bc35cd5c0a18ff37181ab7e599e2e900befa6466dfafa1d2010d5a69e3aea4b9
MD5 7c37fee9a031d49f324c6b6cd3929957
BLAKE2b-256 dd57a65bb4ac0f8946c627394fd298d454cd1eab5380b5ba5b495ebf868672f4

See more details on using hashes here.

File details

Details for the file matcher_py-0.4.0-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for matcher_py-0.4.0-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5d8b15e15e7596073fc40d017ce90e4e93c9025049f9a4f12264b2ce9d97d3d6
MD5 4101e3b02b3690b62d0b58b069c6ff28
BLAKE2b-256 2c99fb76b2ec90c21bf29ed8284147cc9173bd02964a30a8ea5e582e68fa2438

See more details on using hashes here.

File details

Details for the file matcher_py-0.4.0-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for matcher_py-0.4.0-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 3de88423c70d358d523baaf294d3d151b30f45603d0801d9cd3fd04f683f0a43
MD5 95f2b6678c72909b22fdf7199790eb91
BLAKE2b-256 667dd72253aaa07a3997e3b6242c63d56aa1e9637701a7eaa8633d048677b452

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page