A high performance multiple functional word matcher

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Matcher Rust Implementation with PyO3 Binding

Installation

Use pip

pip install matcher_py

Install pre-built binary

Visit the release page to download the pre-built binary.

Usage

The msgspec library is recommended for serializing the matcher configuration due to its performance benefits. You can also use other msgpack serialization libraries like ormsgpack. All relevant types are defined in extension_types.py.

Explaination of the configuration

Matcher's configuration is defined by the MatchTableMap = Dict[int, List[MatchTable]] type, the key of MatchTableMap is called match_id, for each match_id, the table_id inside should but isn't required to be unique.
SimpleMatcher's configuration is defined by the SimpleMatchTableMap = Dict[SimpleMatchType, Dict[int, str]] type, the value Dict[int, str]'s key is called word_id, word_id is required to be globally unique.

MatchTable

table_id: The unique ID of the match table.
match_table_type: The type of the match table.
simple_match_type: The type of the simple match (only relevant if match_table_type is "simple").
word_list: The word list of the match table.
exemption_simple_match_type: The type of the exemption simple match.
exemption_word_list: The exemption word list of the match table.

For each match table, word matching is performed over the word_list, and exemption word matching is performed over the exemption_word_list. If the exemption word matching result is True, the word matching result will be False.

MatchTableType

Simple = "simple": Supports simple multiple patterns matching with text normalization defined by simple_match_type.
- We offer transformation methods for text normalization, including MatchFanjian, MatchNormalize, MatchPinYin ···.
- It can handle combination patterns and repeated times sensitive matching, delimited by ,, such as hello,world,hello will match hellohelloworld and worldhellohello, but not helloworld due to the repeated times of hello.
SimilarChar = "similar_char": Supports similar character matching using regex.
- ["hello,hallo,hollo,hi", "word,world,wrd,🌍", "!,?,~"] will match helloworld, hollowrd, hi🌍 ··· any combinations of the words split by , in the list.
Acrostic = "acrostic": Supports acrostic matching using regex (currently only supports Chinese and simple English sentences).
- ["h,e,l,l,o", "你,好"] will match hope, endures, love, lasts, onward. and 你的笑容温暖, 好心情常伴。.
SimilarTextLevenshtein = "similar_text_levenshtei"n": Supports similar text matching based on Levenshtein distance (threshold is 0.8).
- ["helloworld"] will match helloworld, hellowrld, helloworld! ··· any similar text to the words in the list.
Regex = "regex": Supports regex matching.
- ["h[aeiou]llo", "w[aeiou]rd"] will match hello, world, hillo, wurld ··· any text that matches the regex in the list.

SimpleMatchType

MatchNone = 1: No transformation.
MatchFanjian = 2: Traditional Chinese to simplified Chinese transformation.
- 妳好 -> 你好
- 現⾝ -> 现身
MatchDelete = 12: Delete all non-alphanumeric and non-unicode Chinese characters.
- hello, world! -> helloworld
- 《你∷好》 -> 你好
MatchNormalize = 16: Normalize all English character variations and number variations to basic characters.
- ℋЀ⒈㈠ϕ -> he11o
- ⒈Ƨ㊂ -> 123
MatchPinYin = 32: Convert all unicode Chinese characters to pinyin with boundaries.
- 你好 -> ␀ni␀␀hao␀
- 西安 -> ␀xi␀␀an␀
MatchPinYinChar = 64: Convert all unicode Chinese characters to pinyin without boundaries.
- 你好 -> nihao
- 西安 -> xian

You can combine these transformations as needed. Pre-defined combinations like MatchDeleteNormalize = 28 and MatchFanjianDeleteNormalize = 30 are provided for convenience.

Avoid combining MatchPinYin and MatchPinYinChar due to that MatchPinYin is a more limited version of MatchPinYinChar, in some cases like xian, can be treat as two words xi and an, or only one word xian.

Limitations

Simple Match can handle words with a maximum of 32 combined words (more than 32 then effective combined words are not guaranteed) and 8 repeated words (more than 8 repeated words will be limited to 8).

Matcher Basic Usage

Here’s an example of how to use the Matcher:

import msgspec
import numpy as np
from matcher_py import Matcher
from matcher_py.extension_types import MatchTable, MatchTableType, SimpleMatchType

msgpack_encoder = msgspec.msgpack.Encoder()
matcher = Matcher(
    msgpack_encoder.encode({
        1: [
            MatchTable(
                table_id=1,
                match_table_type=MatchTableType.Simple,
                simple_match_type=SimpleMatchType.MatchFanjianDeleteNormalize,
                word_list=["hello", "world"],
                exemption_simple_match_type=SimpleMatchType.MatchNone,
                exemption_word_list=["word"],
            )
        ]
    })
)
# Check if a text matches
assert matcher.is_match("hello")
assert not matcher.is_match("hello, word")
# Perform word matching as a dict
assert matcher.word_match(r"hello, world")[1]
# Perform word matching as a string
result = matcher.word_match_as_string("hello")
assert result == """{1:[{\"table_id\":1,\"word\":\"hello\"}]"}"""
# Perform batch processing as a dict using a list
text_list = ["hello", "world", "hello,word"]
batch_results = matcher.batch_word_match(text_list)
print(batch_results)
# Perform batch processing as a string using a list
text_list = ["hello", "world", "hello,word"]
batch_results = matcher.batch_word_match_as_string(text_list)
print(batch_results)
# Perform batch processing as a dict using a numpy array
text_array = np.array(["hello", "world", "hello,word"], dtype=np.dtype("object"))
numpy_results = matcher.numpy_word_match(text_array)
print(numpy_results)
# Perform batch processing as a string using a numpy array
text_array = np.array(["hello", "world", "hello,word"], dtype=np.dtype("object"))
numpy_results = matcher.numpy_word_match_as_string(text_array)
print(numpy_results)

Simple Matcher Basic Usage

Here’s an example of how to use the SimpleMatcher:

import msgspec
import numpy as np
from matcher_py import SimpleMatcher
from matcher_py.extension_types import SimpleMatchType

msgpack_encoder = msgspec.msgpack.Encoder()
simple_matcher = SimpleMatcher(
    msgpack_encoder.encode({SimpleMatchType.MatchNone: {1: "example"}})
)
# Check if a text matches
assert simple_matcher.is_match("example")
# Perform simple processing
results = simple_matcher.simple_process("example")
print(results)
# Perform batch processing using a list
text_list = ["example", "test", "example test"]
batch_results = simple_matcher.batch_simple_process(text_list)
print(batch_results)
# Perform batch processing using a NumPy array
text_array = np.array(["example", "test", "example test"], dtype=np.dtype("object"))
numpy_results = simple_matcher.numpy_simple_process(text_array)
print(numpy_results)

Contributing

Contributions to matcher_py are welcome! If you find a bug or have a feature request, please open an issue on the GitHub repository. If you would like to contribute code, please fork the repository and submit a pull request.

License

matcher_py is licensed under the MIT OR Apache-2.0 license.

More Information

For more details, visit the GitHub repository.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.3.4

Jun 27, 2024

0.3.3

Jun 26, 2024

0.3.2

Jun 26, 2024

0.3.1

Jun 23, 2024

0.2.10

Jun 20, 2024

This version

0.2.9

Jun 19, 2024

0.2.8

Jun 18, 2024

0.2.7

Jun 17, 2024

0.2.6

Jun 16, 2024

0.2.5

Jun 15, 2024

0.2.4

Jun 14, 2024

0.2.3

Jun 14, 2024

0.2.2

Jun 13, 2024

0.2.1

Jun 13, 2024

0.2.0

Jun 12, 2024

0.1.7

Jun 12, 2024

0.1.6

Jun 12, 2024

0.1.5

Jun 12, 2024

0.1.4

Jun 12, 2024

0.1.3

Jun 11, 2024

0.1.1

Jun 11, 2024

0.1.0

Jun 11, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

matcher_py-0.2.9.tar.gz (298.7 kB view hashes)

Uploaded Jun 19, 2024 Source

Built Distributions

matcher_py-0.2.9-cp38-abi3-win_amd64.whl (1.5 MB view hashes)

Uploaded Jun 19, 2024 CPython 3.8+ Windows x86-64

matcher_py-0.2.9-cp38-abi3-musllinux_1_2_x86_64.whl (1.8 MB view hashes)

Uploaded Jun 19, 2024 CPython 3.8+ musllinux: musl 1.2+ x86-64

matcher_py-0.2.9-cp38-abi3-musllinux_1_2_aarch64.whl (1.8 MB view hashes)

Uploaded Jun 19, 2024 CPython 3.8+ musllinux: musl 1.2+ ARM64

matcher_py-0.2.9-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB view hashes)

Uploaded Jun 19, 2024 CPython 3.8+ manylinux: glibc 2.17+ x86-64

matcher_py-0.2.9-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.6 MB view hashes)

Uploaded Jun 19, 2024 CPython 3.8+ manylinux: glibc 2.17+ ARM64

matcher_py-0.2.9-cp38-abi3-macosx_11_0_arm64.whl (1.5 MB view hashes)

Uploaded Jun 19, 2024 CPython 3.8+ macOS 11.0+ ARM64

matcher_py-0.2.9-cp38-abi3-macosx_10_12_x86_64.whl (1.5 MB view hashes)

Uploaded Jun 19, 2024 CPython 3.8+ macOS 10.12+ x86-64

Hashes for matcher_py-0.2.9.tar.gz

Hashes for matcher_py-0.2.9.tar.gz
Algorithm	Hash digest
SHA256	`62d2e09ce30fa713f4d6fa468de870a661df6d76d3daee634068b4564a9de39a`
MD5	`9a3d10e8fc5fec083d1f7344fa4ec36a`
BLAKE2b-256	`1ecdcc80a52f7ea0a057c846ecc711d0cdb95d2f66ee1b183f486bd3187c4caa`

Hashes for matcher_py-0.2.9-cp38-abi3-win_amd64.whl

Hashes for matcher_py-0.2.9-cp38-abi3-win_amd64.whl
Algorithm	Hash digest
SHA256	`12485833fcbedcdb085e0bc0623a183043516cd163bcfc275691f13301f03b00`
MD5	`c6e56af4c4df427dab2ff7d6d261694b`
BLAKE2b-256	`54f1846bce3bc8041289e8d8ee3a15c7510c4a4c9380df63a12fd40443d1fad6`

Hashes for matcher_py-0.2.9-cp38-abi3-musllinux_1_2_x86_64.whl

Hashes for matcher_py-0.2.9-cp38-abi3-musllinux_1_2_x86_64.whl
Algorithm	Hash digest
SHA256	`a73424827242970c783d91cb5c36ab9e5e202ca9c08adf8f895f5ddd1d410f3d`
MD5	`2c4464fa6e9c4bc3c7d541a2faae5ddb`
BLAKE2b-256	`4d373794ef3caad6cc3843e88c152c1b697c5aa294465e504044329723a1e55c`

Hashes for matcher_py-0.2.9-cp38-abi3-musllinux_1_2_aarch64.whl

Hashes for matcher_py-0.2.9-cp38-abi3-musllinux_1_2_aarch64.whl
Algorithm	Hash digest
SHA256	`d1e7eec7724b3db8654307bd6140f84f5793c8983abc82cf509e561776978567`
MD5	`23c1cecfc3869f8393240465961546fb`
BLAKE2b-256	`ff6595aaa40018271e69b7f8a33385d79d0e364217f9634dfac8a72d5037387c`

Hashes for matcher_py-0.2.9-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

Hashes for matcher_py-0.2.9-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`f08933892ad061478261a3ec5ec206d464d4a3fa8d39337d73ec7e3780b2f23e`
MD5	`2fbf0872473718378627cef8bc9798ab`
BLAKE2b-256	`bd985693eda0e4fee67909521a266668151d2d7b55b921d38a1f1e407fdd2071`

Hashes for matcher_py-0.2.9-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl

Hashes for matcher_py-0.2.9-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm	Hash digest
SHA256	`75d683252d63094fba67a986c24ac9cab5fc5e29ad7b5fa3c2f3a267b619519d`
MD5	`8f4efae5072bf192c4897d8ada7142ee`
BLAKE2b-256	`7726222f4e7ac9b626954c6f4c9578bd2c263ef3862a4f3b65dd33496e65ccbd`

Hashes for matcher_py-0.2.9-cp38-abi3-macosx_11_0_arm64.whl

Hashes for matcher_py-0.2.9-cp38-abi3-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`f6b1f61c345f0727df74e31d8c1b263656dc5de0d472d37e805f1cf0a65faca2`
MD5	`be7fe82e615b972824865bd6dbc4a17b`
BLAKE2b-256	`af00d0c15a7cec4596601bc7a6fd947f092a805d61e6c8ea48c910d767de71e0`

Hashes for matcher_py-0.2.9-cp38-abi3-macosx_10_12_x86_64.whl

Hashes for matcher_py-0.2.9-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm	Hash digest
SHA256	`6ab9ab22bff4405db3ba51044c50dff88556e541afc3353cdc6adfb06f8f8f3b`
MD5	`72ce1fe539d63bc54f434475c5fe148e`
BLAKE2b-256	`cd00517b6fb53ef39cc41c49580d9e4c4b75c53bc688cddb13f35800a00ff637`