darmatch Python bindings
Project description
https://github.com/zejunwang1/darmatch
darmatch 是一个非常高效的字符串匹配工具,支持正向/反向最大匹配分词和多模式字符串精确匹配:
仅包含头文件 (header-only)
基于双数组字典树 (double-array trie) 的模式匹配
C++
使用示例可以参考 tests 文件夹中的 test.cpp:
#include <iostream>
#include <darmatch.h>
int main(int argc, char** argv) {
std::vector<std::string> args(argv, argv + argc);
std::string dict_path, user_dict_path;
for (int i = 1; i < args.size(); i += 2) {
if (args[i] == "--dict_path") {
dict_path = std::string(args.at(i + 1));
} else if (args[i] == "--user_dict_path") {
user_dict_path = std::string(args.at(i + 1));
} else {
std::cout << "Unknown argument: " << args[i] << std::endl;
std::cout << "Supported argument: --dict_path --user_dict_path" << std::endl;
exit(EXIT_FAILURE);
}
}
/*
initialization methods:
darmatch::DarMatch da;
darmatch::DarMatch da(dict_path, user_dict_path = "");
*/
darmatch::DarMatch da(dict_path, user_dict_path);
std::string text = "俄罗斯联邦总统普京决定在顿巴斯地区开展特别军事行动。";
/*
maximum forward matching:
std::vector<std::pair<size_t, std::string>> fwords = da.seg(text);
----------------------------------------------
std::vector<std::pair<size_t, std::string>> fwords;
da.seg(text, fwords);
*/
std::vector<std::pair<size_t, std::string>> fwords = da.seg(text);
std::cout << "The Chinese word segmentation based on Maximum Forward Matching: " << std::endl;
for (size_t i = 0; i < fwords.size(); i++) {
std::cout << fwords[i].second << " ";
}
std::cout << std::endl;
/*
maximum backward matching:
std::vector<std::pair<size_t, std::string>> bwords = da.seg(text, false);
------------------------------------------------------
std::vector<std::pair<size_t, std::string>> bwords;
da.seg(text, bwords, false);
*/
std::vector<std::pair<size_t, std::string>> bwords = da.seg(text, false);
std::cout << "The Chinese word segmentation based on Maximum Backward Matching: " << std::endl;
for (size_t i = 0; i < bwords.size(); i++) {
std::cout << bwords[i].second << " ";
}
std::cout << std::endl;
/*
update the double-array trie by insert:
da.insert(const std::string&);
da.insert(const std::vector<std::string>&);
*/
da.insert("俄罗斯联邦总统");
// multi-pattern string matching
std::vector<std::pair<size_t, std::string>> result = da.parse(text);
std::cout << "The result of multi-pattern string matching: " << std::endl;
for (size_t i = 0; i < result.size(); i++) {
std::cout << result[i].first << "\t" << result[i].second << std::endl;
}
return 0;
}
通过 cmake 进行编译:
git clone https://github.com/zejunwang1/darmatch
cd darmatch
mkdir build
cmake ..
# cmake -DUSE_PREFIX_TRIE=ON ..
make
执行上述命令后,会在 darmatch/build 文件夹中生成可执行文件 test。
./test --dict_path ../tests/dict.txt
运行后结果如下:
The Chinese word segmentation based on Maximum Forward Matching: 俄罗斯联邦 总统 普京 决定 在 顿巴斯地区 开展 特别 军事行动 。 The Chinese word segmentation based on Maximum Backward Matching: 俄罗斯 联邦总统 普京 决定 在 顿巴斯地区 开展 特别 军事行动 。 The result of multi-pattern string matching: 0 俄罗斯联邦 0 俄罗斯联邦总统 9 联邦总统 21 普京 27 决定 36 顿巴斯地区 51 开展 63 军事行动
Python
Requirements
Python version >= 3.6
pybind11 >= 2.2
setuptools >= 0.7.0
typing
Installation
通过 pip 命令直接安装:
pip install darmatch
或者从 github 仓库中获取最新版本安装:
git clone https://github.com/zejunwang1/darmatch
cd darmatch
pip install .
# or:
python setup.py install
Demo
from darmatch import DarMatch
da = DarMatch()
# da = DarMatch(dict_path, user_dict_path="")
words = ["俄罗斯联邦", "联邦总统", "普京", "决定", "顿巴斯地区", "开展", "军事行动"]
da.insert(words)
text = "俄罗斯联邦总统普京决定在顿巴斯地区开展特别军事行动。"
# maximum forward matching
word_list = da.seg(text, forward=True, return_loc=True)
print("The Chinese word segmentation based on Maximum Forward Matching:")
print(word_list)
# maximum backward matching
word_list = da.seg(text, forward=False, return_loc=True)
print("The Chinese word segmentation based on Maximum Backward Matching:")
print(word_list)
# multi-pattern string matching
da.insert("俄罗斯联邦总统")
word_list = da.parse(text, char_loc=True)
print("The result of multi-pattern string matching:")
print(word_list)
运行结果如下:
The Chinese word segmentation based on Maximum Forward Matching: [(0, '俄罗斯联邦'), (5, '总统'), (7, '普京'), (9, '决定'), (11, '在'), (12, '顿巴斯地区'), (17, '开展'), (19, '特别'), (21, '军事行动'), (25, '。')] The Chinese word segmentation based on Maximum Backward Matching: [(0, '俄罗斯'), (3, '联邦总统'), (7, '普京'), (9, '决定'), (11, '在'), (12, '顿巴斯地区'), (17, '开展'), (19, '特别'), (21, '军事行动'), (25, '。')] The result of multi-pattern string matching: [(0, '俄罗斯联邦'), (0, '俄罗斯联邦总统'), (3, '联邦总统'), (7, '普京'), (9, '决定'), (12, '顿巴斯地区'), (17, '开展'), (21, '军事行动')]
Speed
和基于 Aho-Corasick 的字符串匹配正则表达式工具 esmre 进行处理速度对比。
可以使用 pip 命令安装 esmre:
pip install esmre
在 tests 文件夹中包含字符串匹配需要用到的关键词词典文件 string_match_dict.txt,共计有 348982 个关键词。待进行匹配的文本字符串文件 check_text.txt,共计有 273864 个字符。
python test_speed.py
运行后结果如下:
the number of matching results by esm: 343623 esm time usage: 0.4515085220336914s ---------------------------------------------------- the number of matching results by darmatch: 343623 darmatch time usage: 0.1248319149017334s121s
可以看出,darmatch 比 esm 快 3~4 倍左右。
Contact
微信:autonlp
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
darmatch-0.2.0.tar.gz
(6.9 kB
view hashes)
Built Distribution
Close
Hashes for darmatch-0.2.0-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 034ee9c9eb07ae8cfe36bd7b63cebd5a926d13fb1752d62500e30dfb851dcd1a |
|
MD5 | ee79ed0a281393883f7879bf1ee6c433 |
|
BLAKE2b-256 | 1b75eb20a6f458316aaa81bd7ddf8762d23adb7bdac80e35332e808a92f3572a |