char-similar: Tookit of Chinese Character Similarity, Especially the Shape of the Characters.
Project description
char-similar
汉字字形/拼音/语义相似度(单字, 可用于数据增强, CSC错别字检测识别任务(构建混淆集))
一、安装
0. 注意事项
默认不指定numpy版本(标准版numpy==1.22.4), 过高或者过低的版本可能不支持
标准版本的依赖包详见 requirements-all.txt
1. 通过PyPI安装
pip install char-similar
使用镜像源, 如:
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple char-similar
二、使用方式
2.1 快速使用
from char_similar import std_cal_sim
char1 = "我"
char2 = "他"
res = std_cal_sim(char1, char2)
print(res)
# output:
# 0.5821
2.2 详细使用
from char_similar import std_cal_sim
# "all"(字形:拼音:字义=1:1:1) # "w2v"(字形:字义=1:1) # "pinyin"(字形:拼音=1:1) # "shape"(字形=1)
kind = "shape"
rounded = 4 # 保留x位小数
char1 = "我"
char2 = "他"
res = std_cal_sim(char1, char2, rounded=rounded, kind=kind)
print(res)
# output:
# 0.5821
2.3 多线程使用
from char_similar import pool_cal_sim
# "all"(字形:拼音:字义=1:1:1) # "w2v"(字形:字义=1:1) # "pinyin"(字形:拼音=1:1) # "shape"(字形=1)
kind = "shape"
rounded = 4 # 保留x位小数
char1 = "我"
char2 = "他"
res = pool_cal_sim(char1, char2, rounded=rounded, kind=kind)
print(res)
# output:
# 0.5821
2.4 多进程使用(不建议, 实现得较慢)
if __name__ == '__main__':
from char_similar import multi_cal_sim
# "all"(字形:拼音:字义=1:1:1) # "w2v"(字形:字义=1:1) # "pinyin"(字形:拼音=1:1) # "shape"(字形=1)
kind = "shape"
rounded = 4 # 保留x位小数
char1 = "我"
char2 = "他"
res = multi_cal_sim(char1, char2, rounded=rounded, kind=kind)
print(res)
# output:
# 0.5821
三、技术原理
char-similar最初的使用场景是计算两个汉字的字形相似度(构建csc混淆集), 后加入拼音相似度,字义相似度,字频相似度...详见源码.
# 四角码(code=4, 共5位), 统计四个数字中的相同数/4
# 偏旁部首, 相同为1
# 词频log10, 统计大规模语料macropodus中词频log10的 1-(差的绝对值/两数中的最大值)
# 笔画数, 1-(差的绝对值/两数中的最大值)
# 拆字, 集合的与 / 集合的并
# 构造结构, 相同为1
# 笔顺(实际为最小的集合), 集合的与 / 集合的并
# 拼音(code=4, 共4位), 统计四个数字中的相同数(拼音/声母/韵母/声调)/4
# 词向量, char-word2vec, cosine
四、参考(部分字典来源以下项目)
Reference
For citing this work, you can refer to the present GitHub project. For example, with BibTeX:
@misc{Macropodus,
howpublished = {https://github.com/yongzhuo/char-similar},
title = {char-similar},
author = {Yongzhuo Mo},
publisher = {GitHub},
year = {2024}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
No source distribution files available for this release.See tutorial on generating distribution archives.
Built Distribution
File details
Details for the file char_similar-0.0.1-py2.py3-none-any.whl
.
File metadata
- Download URL: char_similar-0.0.1-py2.py3-none-any.whl
- Upload date:
- Size: 1.1 MB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.8.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a0e113adda4d4acf8ecf17a8f09e87b09889de9da891b26fe25583a0f3af7da9 |
|
MD5 | 888e53005ef2fca7e14d5db8f4a2d99a |
|
BLAKE2b-256 | f05af97b85706b4ac123acaa90a4616d0a9f4c5f105c426b7c1e30a4f149094f |