中文字符特征提取工具,可以从中文汉字中提取出拼音、声调、拆分偏旁部首、四角编码,并且可以转化为tensor作为模型的输入。
Project description
char_featurizer
char_featurizer 是一个汉字字符特征提取工具,他可以提取汉字的字音(包括声母、韵母、声调)、字形(偏旁、部首)、四角符号等信息。 同时可以将这些特征信息转换为tensor,作为模型的输入特征。这个项目是在安德森大佬的 字符提取工具 的基础上做了优化整合
目前 char_featurizer 支持的功能有:
1、字形特征提取
2、字音特征提取
3、四角编码提取
4、tensor转换
二、安装使用
1、安装
pip install char_featurizer
2、使用
1、字符特征提取
from char_featurizer import Featurizer
featurizer = Featurizer()
data = '明天去你家玩'
result = featurizer.featurize(data)
print(result)
2、作为特征输入模型
3、相关资源
三、Update News
2020.5.4 完成V1版本
四、Resources
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
No source distribution files available for this release.See tutorial on generating distribution archives.
Built Distribution
File details
Details for the file char_featurizer-1.0.0-py3-none-any.whl
.
File metadata
- Download URL: char_featurizer-1.0.0-py3-none-any.whl
- Upload date:
- Size: 978.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/46.0.0.post20200309 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4a973271cd1270998b01928a99e0c1d10dca030d4fcf3635da6f5114d0ff4be7 |
|
MD5 | c3b977510fa011ed748205806941a91a |
|
BLAKE2b-256 | 9a8727a3e0e89b719c525189d610a983216e381490769a8b24877cdaa09ce6bf |