Skip to main content

A high performance multiple functional word matcher

Project description

Matcher Rust Implementation with PyO3 Binding

Installation

To install the matcher_py package, use pip:

pip install matcher_py

Usage

Python Usage

Refer to the test.ipynb file for Python usage examples.

The msgspec library is used to serialize the matcher configuration. You can also use ormsgpack or other msgpack serialization libraries, but for performance considerations, we recommend msgspec. All types are defined in extension_types.py.

Matcher

Here’s an example of how to use the Matcher:

import msgspec
import numpy as np
from matcher_py import Matcher # type: ignore
from matcher_py.extension_types import MatchTableType, SimpleMatchType, MatchTable

msgpack_encoder = msgspec.msgpack.Encoder()

matcher = Matcher(
    msgpack_encoder.encode(
        {
            "test": [
                MatchTable(
                    table_id=1,
                    match_table_type=MatchTableType.Simple,
                    simple_match_type=SimpleMatchType.MatchFanjian | SimpleMatchType.MatchDeleteNormalize,
                    word_list=["蔔", "你好"],
                    exemption_simple_match_type=SimpleMatchType.MatchFanjian | SimpleMatchType.MatchDeleteNormalize,
                    exemption_word_list=[],
                )
            ]
        }
    )
)

# Perform matching
matcher.is_match(r"卜")
matcher.word_match(r"你,好")
matcher.word_match_as_string("你好")
matcher.batch_word_match_as_string(["你好", "你好", "你真棒"])

# Numpy integration for batch processing
text_array = np.array(
    [
        "Laborum eiusmod anim aliqua non veniam laboris officia dolor. Adipisicing sit est irure Lorem duis adipisicing exercitation. Cillum excepteur non anim ipsum eiusmod deserunt veniam. Nulla veniam sunt sint ad velit occaecat in deserunt nulla nisi excepteur. Cillum veniam Lorem aute eu. Nisi voluptate laboris quis sint pariatur ullamco minim pariatur officia non anim nisi nulla ipsum ad. Veniam pariatur ut occaecat ut veniam velit aliquip commodo culpa elit eu eiusmod."
    ]
    * 10000,
    dtype=np.dtype("object")
)
matcher.numpy_word_match_as_string(text_array)
matcher.numpy_word_match_as_string(text_array, inplace=True)
print(text_array)

Simple Matcher

Here’s an example of how to use the SimpleMatcher:

import msgspec
import numpy as np
from matcher_py import SimpleMatcher # type: ignore
from matcher_py.extension_types import SimpleMatchType

msgpack_encoder = msgspec.msgpack.Encoder()

simple_matcher = SimpleMatcher(
    msgpack_encoder.encode(
        {
            SimpleMatchType.MatchFanjian | SimpleMatchType.MatchDeleteNormalize: {
                1: "无,法,无,天",
                2: "xxx",
                3: "你好",
                6: r"It's /\/\y duty",
                4: "xxx,yyy",
            },
            SimpleMatchType.MatchFanjian: {
                4: "xxx,yyy",
            },
            SimpleMatchType.MatchNone: {
                5: "xxxxx,xxxxyyyyxxxxx",
            },
        }
    )
)

# Perform matching
simple_matcher.is_match("xxx")
simple_matcher.simple_process(r"It's /\/\y duty")
simple_matcher.batch_simple_process([r"It's /\/\y duty", "你好", "xxxxxxx"])

# Numpy integration for batch processing
text_array = np.array(
    [
        "Laborum eiusmod anim aliqua non veniam laboris officia dolor. Adipisicing sit est irure Lorem duis adipisicing exercitation. Cillum excepteur non anim ipsum eiusmod deserunt veniam. Nulla veniam sunt sint ad velit occaecat in deserunt nulla nisi excepteur. Cillum veniam Lorem aute eu. Nisi voluptate laboris quis sint pariatur ullamco minim pariatur officia non anim nisi nulla ipsum ad. Veniam pariatur ut occaecat ut veniam velit aliquip commodo culpa elit eu eiusmod."
    ]
    * 10000,
    dtype=np.dtype("object"),
)
simple_matcher.numpy_simple_process(text_array)
simple_matcher.numpy_simple_process(text_array, inplace=True)
print(text_array)

Contributing

Contributions to matcher_py are welcome! If you find a bug or have a feature request, please open an issue on the GitHub repository. If you would like to contribute code, please fork the repository and submit a pull request.

License

matcher_py is licensed under the MIT OR Apache-2.0 license. See the LICENSE file for more information.

For more details, visit the GitHub repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

matcher_py-0.2.1.tar.gz (432.5 kB view hashes)

Uploaded Source

Built Distributions

matcher_py-0.2.1-cp38-abi3-win_amd64.whl (1.2 MB view hashes)

Uploaded CPython 3.8+ Windows x86-64

matcher_py-0.2.1-cp38-abi3-musllinux_1_2_x86_64.whl (1.5 MB view hashes)

Uploaded CPython 3.8+ musllinux: musl 1.2+ x86-64

matcher_py-0.2.1-cp38-abi3-musllinux_1_2_aarch64.whl (1.6 MB view hashes)

Uploaded CPython 3.8+ musllinux: musl 1.2+ ARM64

matcher_py-0.2.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view hashes)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ x86-64

matcher_py-0.2.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.3 MB view hashes)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ ARM64

matcher_py-0.2.1-cp38-abi3-macosx_11_0_arm64.whl (1.2 MB view hashes)

Uploaded CPython 3.8+ macOS 11.0+ ARM64

matcher_py-0.2.1-cp38-abi3-macosx_10_12_x86_64.whl (1.2 MB view hashes)

Uploaded CPython 3.8+ macOS 10.12+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page