CPPJieba python wrapper
Project description
Python wrapper for cppjieba without any dependency, no need to compile by using gcc/c++
Python 封装C++版本的jieba分词器,性能好,并且已经预编译,无需安装时编译,即拿即用。
已在Centos7、MaxOS10.13.6上测试 【Windows可能在某些情况下运行异常】。
C++ 代码源自”结巴”中文分词的C++版本,CPPJieba
使用说明
pip安装
pip install pyjieba
手动安装
cd pyjieba
python setup.py install
接口使用
# 导入包
import pyjieba
# 初始化[可选]
pyjieba.initialize()
# 初始化时,可以指定自己的词典
pyjieba.initialize(dictPath='yourpath', # 默认词典路径, 默认在包下路径/dict/jieba.dict.utf8
userPath='yourpath', # 用户词典,默认在包下路径/dict/user.dict.utf8,多个词典使用英文分号;分隔开
idfPath='yourpath', # IDF词典,默认在包下路径/dict/idf.utf8,提取关键词时使用
stopwordsPath='yourpath' # 停用词词典,默认在包下路径/dict/stop_words.utf8
)
# 分词
sentence = '视觉中国再次致歉'
pyjieba.cut(sentence)
# 输出结果 ['视觉', '中国', '再次', '致歉']
# 词性标注
pyjieba.tag(sentence)
# 输出结果 ['视觉/n', '中国/ns', '再次/d', '致歉/v']
# 使用TFIDF提取关键词
pyjieba.keywordsTFIDF(sentence, # 句子,可以是文本,也可以传入分词的结果例如 ['视觉', '中国', '再次', '致歉']
topN=5, # 最多返回几个关键词, 可选
allowedPOS='ns,n,vn,v,x' # 关键词属性限制, 可选
)
# 输出结果 [('致歉', 3.618718), ('视觉', 2.680915), ('中国', 1.009107)]
# 使用Textrank算法提取关键词
pyjieba.keywordsTextrank(sentence, # 句子,可以是文本,也可以传入分词的结果例如 ['视觉', '中国', '再次', '致歉']
topN=5, # 最多返回几个关键词, 可选
allowedPOS='ns,n,vn,v,x' # 关键词属性限制, 可选
)
# 输出结果 [('视觉', 1.0), ('致歉', 0.996685), ('中国', 0.992994)]
源码修改与编译命令
如果需要进行cppjieba源码修改,需要在三个平台上分别进行编译,生成新的so和dll文件
> cd cppjieba_src
# windows
> g++ jiebaapi.cpp -fPIC -I deps -I include -std=c++11 -shared -o ../pyjieba/libs/cppjieba_API_win64.dll
# MaxOS
> g++ jiebaapi.cpp -fPIC -I deps -I include -std=c++11 -shared -o ../pyjieba/libs/cppjieba_API_osx64.dll
# Linux
> g++ jiebaapi.cpp -fPIC -I deps -I include -std=c++11 -shared -o ../pyjieba/libs/cppjieba_API_linux64.so
Benchmark
平台 Centos7, 8核16G,Python3.6
小说文本长度:83791
循环分词次数:10
次数 |
pyjieba |
jieba |
---|---|---|
1 |
3147.3ms |
11137.5ms |
2 |
4692.9ms |
12792.7ms |
3 |
3257.1ms |
10830.7ms |
总体来看,pyjieba平均耗时为jieba的1/3。
鸣谢
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
No source distribution files available for this release.See tutorial on generating distribution archives.
Built Distribution
File details
Details for the file pyjieba-1.0-py2.py3-none-any.whl
.
File metadata
- Download URL: pyjieba-1.0-py2.py3-none-any.whl
- Upload date:
- Size: 5.0 MB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.4.2 requests/2.19.1 setuptools/39.1.0 requests-toolbelt/0.9.1 tqdm/4.26.0 CPython/3.6.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 15e85ffebcc29cacc7202715e4162e6d5d9dbda5830404a321abcfbbda9b9bc3 |
|
MD5 | 32bb806695053753cfa1871ae3ba4e67 |
|
BLAKE2b-256 | 940a1c3a029e9a6dee38b76508d833c1c3c6a83845ccff83342150199bcea0cc |