文本多模匹配AC自动机的Python实现
Project description
AC自动机
1.如何安装
pip install ac_auto
2.该包包含的类
AhoCorasickAutomation
AC自动机,用于多模匹配,在使用前需要先根据词典创建Trie树,构建对象稍微耗时,搜索时间平均性能O(N+M),N为待识别文本的长度,M为所有模式字符串加总长度,但也不建议过长的输入
请注意,AC自动机不能避免分词错误,如“佳保安全”,若“保安”是关键词,也会将其识别出,使用前请确认实际的需求场景
ac_auto_entity = AhoCorasickAutomation(["关键词1", "关键词2"]) ac_auto_entity.search("需要搜索的文本,其中可能包含关键词1")
输出结果:
{'关键词1': [(14, 17)], '关键词2': [(28, 31)]}
也可指定输出格式
ac_auto_entity.search("需要搜索的文本,其中可能包含关键词1", output_mode=AhoCorasickAutomation.OUTPUT_LIST_ONLY_KEY)
输出结果:
['关键词1', '关键词2']
AhoCorasickAutomationConditionalFilter
对AC自动机的匹配结果进行条件过滤,可设置前后一定距离内的文本需要包含或不包含某些关键词的条件
ac_auto_entity = AhoCorasickAutomation(["关键词1", "关键词2"]) text_to_scan = "需要搜索的文本,其中可能包含关键词1,要求附近有条件1" hits = ac_auto_entity.search(text_to_scan) ac_filter_entity = AhoCorasickAutomationConditionalFilter({ "关键词1": ["条件1"] }, distance=10, mode=AhoCorasickAutomationConditionalFilter.FILTER_MODE_WHITE) hits = ac_filter_entity.filter(text_to_scan, hits)
输出结果:
{'关键词1': [(14, 17)]} # 关键词2由于不符合条件,将被过滤
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
ac_auto-0.1.2.tar.gz
(5.3 kB
view details)
Built Distribution
File details
Details for the file ac_auto-0.1.2.tar.gz
.
File metadata
- Download URL: ac_auto-0.1.2.tar.gz
- Upload date:
- Size: 5.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 34151ea1053929f193df343f7458e2d8e8eabaee1104e85cfde604ce326a0be8 |
|
MD5 | ae783e8d30fd53a9bfa6a507cae2910e |
|
BLAKE2b-256 | 98fbc6ad7305da5c493b0ede5caf9166e7b07847cb87bcaf30b790a0f9332336 |
File details
Details for the file ac_auto-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: ac_auto-0.1.2-py3-none-any.whl
- Upload date:
- Size: 5.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a4b486b7e3a63670c34cdd84238cd4b74efc2fb1fd9ddfdc2e45ba9765581d25 |
|
MD5 | 7a6f50c87fd3d94bcba6fb83cde1c070 |
|
BLAKE2b-256 | 0f8b9ffe9e8cc671fa46365d316619e2c7ab2943118f755b790927bf0cfef2cf |