Skip to main content

A high performance multiple functional word matcher

Project description

Matcher Rust Implement PyO3 binding

Installation

pip install matcher_py

Usage

  • Python usage is in the test.ipynb file.
  • msgspec is used to serialize the matcher config, you can use ormsgpack or other msgpack serialization library to serialize the matcher config, all the types are defined in extention_types.py. But for performance consideration, I recommend msgspec.

Matcher

import msgspec
import numpy as np

from matcher_py import Matcher, SimpleMatcher # type: ignore
from matcher_py.extension_types import MatchTableType, SimpleMatchType, MatchTable

msgpack_encoder = msgspec.msgpack.Encoder()

matcher = Matcher(
    msgpack_encoder.encode(
        {
            "test": [
                MatchTable(
                    table_id=1,
                    match_table_type=MatchTableType.Simple,
                    simple_match_type=SimpleMatchType.MatchFanjian | SimpleMatchType.MatchDeleteNormalize,
                    word_list=["蔔", "你好"],
                    exemption_simple_match_type=SimpleMatchType.MatchFanjian | SimpleMatchType.MatchDeleteNormalize,
                    exemption_word_list=[],
                )
            ]
        }
    )
)

matcher.is_match(r"卜")

matcher.word_match(r"你,好")

matcher.word_match_as_string("你好")

matcher.batch_word_match_as_string(["你好", "你好", "你真棒"])

text_array = np.array(
    [
        "Laborum eiusmod anim aliqua non veniam laboris officia dolor. Adipisicing sit est irure Lorem duis adipisicing exercitation. Cillum excepteur non anim ipsum eiusmod deserunt veniam. Nulla veniam sunt sint ad velit occaecat in deserunt nulla nisi excepteur. Cillum veniam Lorem aute eu. Nisi voluptate laboris quis sint pariatur ullamco minim pariatur officia non anim nisi nulla ipsum ad. Veniam pariatur ut occaecat ut veniam velit aliquip commodo culpa elit eu eiusmod."
    ]
    * 10000,
    dtype=np.dtype("object")
)
matcher.numpy_word_match_as_string(text_array)

text_array = np.array(
    [
        "Laborum eiusmod anim aliqua non veniam laboris officia dolor. Adipisicing sit est irure Lorem duis adipisicing exercitation. Cillum excepteur non anim ipsum eiusmod deserunt veniam. Nulla veniam sunt sint ad velit occaecat in deserunt nulla nisi excepteur. Cillum veniam Lorem aute eu. Nisi voluptate laboris quis sint pariatur ullamco minim pariatur officia non anim nisi nulla ipsum ad. Veniam pariatur ut occaecat ut veniam velit aliquip commodo culpa elit eu eiusmod."
    ]
    * 10000,
    dtype=np.dtype("object")
)
matcher.numpy_word_match_as_string(text_array, inplace=True)
text_array

Simple Matcher

import msgspec
import numpy as np

from matcher_py import Matcher, SimpleMatcher # type: ignore
from matcher_py.extension_types import MatchTableType, SimpleMatchType, MatchTable

msgpack_encoder = msgspec.msgpack.Encoder()

simple_matcher = SimpleMatcher(
    msgpack_encoder.encode(
        {
            SimpleMatchType.MatchFanjian | SimpleMatchType.MatchDeleteNormalize: {
                1: "无,法,无,天",
                2: "xxx",
                3: "你好",
                6: r"It's /\/\y duty",
                4: "xxx,yyy",
            },
            SimpleMatchType.MatchFanjian: {
                4: "xxx,yyy",
            },
            SimpleMatchType.MatchNone: {
                5: "xxxxx,xxxxyyyyxxxxx",
            },
        }
    )
)

simple_matcher.is_match("xxx")

simple_matcher.simple_process(r"It's /\/\y duty")

simple_matcher.batch_simple_process([r"It's /\/\y duty", "你好", "xxxxxxx"])

text_array = np.array(
    [
        "Laborum eiusmod anim aliqua non veniam laboris officia dolor. Adipisicing sit est irure Lorem duis adipisicing exercitation. Cillum excepteur non anim ipsum eiusmod deserunt veniam. Nulla veniam sunt sint ad velit occaecat in deserunt nulla nisi excepteur. Cillum veniam Lorem aute eu. Nisi voluptate laboris quis sint pariatur ullamco minim pariatur officia non anim nisi nulla ipsum ad. Veniam pariatur ut occaecat ut veniam velit aliquip commodo culpa elit eu eiusmod."
    ]
    * 10000,
    dtype=np.dtype("object"),
)
simple_matcher.numpy_simple_process(text_array)

text_array = np.array(
    [
        "Laborum eiusmod anim aliqua non veniam laboris officia dolor. Adipisicing sit est irure Lorem duis adipisicing exercitation. Cillum excepteur non anim ipsum eiusmod deserunt veniam. Nulla veniam sunt sint ad velit occaecat in deserunt nulla nisi excepteur. Cillum veniam Lorem aute eu. Nisi voluptate laboris quis sint pariatur ullamco minim pariatur officia non anim nisi nulla ipsum ad. Veniam pariatur ut occaecat ut veniam velit aliquip commodo culpa elit eu eiusmod."
    ]
    * 10000,
    dtype=np.dtype("object"),
)
simple_matcher.numpy_simple_process(text_array, inplace=True)
text_array

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

matcher_py-0.1.7.tar.gz (432.6 kB view hashes)

Uploaded Source

Built Distributions

matcher_py-0.1.7-cp38-abi3-win_amd64.whl (1.2 MB view hashes)

Uploaded CPython 3.8+ Windows x86-64

matcher_py-0.1.7-cp38-abi3-musllinux_1_2_x86_64.whl (1.5 MB view hashes)

Uploaded CPython 3.8+ musllinux: musl 1.2+ x86-64

matcher_py-0.1.7-cp38-abi3-musllinux_1_2_aarch64.whl (1.6 MB view hashes)

Uploaded CPython 3.8+ musllinux: musl 1.2+ ARM64

matcher_py-0.1.7-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view hashes)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ x86-64

matcher_py-0.1.7-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.3 MB view hashes)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ ARM64

matcher_py-0.1.7-cp38-abi3-macosx_11_0_arm64.whl (1.2 MB view hashes)

Uploaded CPython 3.8+ macOS 11.0+ ARM64

matcher_py-0.1.7-cp38-abi3-macosx_10_12_x86_64.whl (1.2 MB view hashes)

Uploaded CPython 3.8+ macOS 10.12+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page