Skip to main content

文本多模匹配AC自动机的Python实现

Project description

AC自动机

1.如何安装

pip install ac_auto

2.该包包含的类

AhoCorasickAutomation

AC自动机,用于多模匹配,在使用前需要先根据词典创建Trie树,构建对象稍微耗时,搜索时间平均性能O(N+M),N为待识别文本的长度,M为所有模式字符串加总长度,但也不建议过长的输入

请注意,AC自动机不能避免分词错误,如“佳保安全”,若“保安”是关键词,也会将其识别出,使用前请确认实际的需求场景

ac_auto_entity = AhoCorasickAutomation(["关键词1", "关键词2"])
ac_auto_entity.search("需要搜索的文本,其中可能包含关键词1")

输出结果:

{'关键词1': [(14, 17)], '关键词2': [(28, 31)]}

也可指定输出格式

ac_auto_entity.search("需要搜索的文本,其中可能包含关键词1", output_mode=AhoCorasickAutomation.OUTPUT_LIST_ONLY_KEY)

输出结果:

['关键词1', '关键词2']

AhoCorasickAutomationConditionalFilter

对AC自动机的匹配结果进行条件过滤,可设置前后一定距离内的文本需要包含或不包含某些关键词的条件

ac_auto_entity = AhoCorasickAutomation(["关键词1", "关键词2"])
text_to_scan = "需要搜索的文本,其中可能包含关键词1,要求附近有条件1"
hits = ac_auto_entity.search(text_to_scan)
ac_filter_entity = AhoCorasickAutomationConditionalFilter({
    "关键词1": ["条件1"]
}, distance=10, mode=AhoCorasickAutomationConditionalFilter.FILTER_MODE_WHITE)
hits = ac_filter_entity.filter(text_to_scan, hits)

输出结果:

{'关键词1': [(14, 17)]} # 关键词2由于不符合条件,将被过滤

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ac_auto-0.1.2.tar.gz (5.3 kB view details)

Uploaded Source

Built Distribution

ac_auto-0.1.2-py3-none-any.whl (5.7 kB view details)

Uploaded Python 3

File details

Details for the file ac_auto-0.1.2.tar.gz.

File metadata

  • Download URL: ac_auto-0.1.2.tar.gz
  • Upload date:
  • Size: 5.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.8.5

File hashes

Hashes for ac_auto-0.1.2.tar.gz
Algorithm Hash digest
SHA256 34151ea1053929f193df343f7458e2d8e8eabaee1104e85cfde604ce326a0be8
MD5 ae783e8d30fd53a9bfa6a507cae2910e
BLAKE2b-256 98fbc6ad7305da5c493b0ede5caf9166e7b07847cb87bcaf30b790a0f9332336

See more details on using hashes here.

File details

Details for the file ac_auto-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: ac_auto-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 5.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.8.5

File hashes

Hashes for ac_auto-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a4b486b7e3a63670c34cdd84238cd4b74efc2fb1fd9ddfdc2e45ba9765581d25
MD5 7a6f50c87fd3d94bcba6fb83cde1c070
BLAKE2b-256 0f8b9ffe9e8cc671fa46365d316619e2c7ab2943118f755b790927bf0cfef2cf

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page