Skip to main content

char-similar: Tookit of Chinese Character Similarity, Especially the Shape of the Characters.

Project description

char-similar

汉字字形/拼音/语义相似度(单字, 可用于数据增强, CSC错别字检测识别任务(构建混淆集))

一、安装



0. 注意事项


   默认不指定numpy版本(标准版numpy==1.22.4), 过高或者过低的版本可能不支持


   标准版本的依赖包详见 requirements-all.txt


   


1. 通过PyPI安装


   pip install char-similar


   使用镜像源, 如:


   pip install -i https://pypi.tuna.tsinghua.edu.cn/simple char-similar


二、使用方式

2.1 快速使用

from char_similar import std_cal_sim


char1 = "我"


char2 = "他"


res = std_cal_sim(char1, char2)


print(res)


# output:


# 0.5821

2.2 详细使用

from char_similar import std_cal_sim


# "all"(字形:拼音:字义=1:1:1)  # "w2v"(字形:字义=1:1)  # "pinyin"(字形:拼音=1:1)  # "shape"(字形=1)


kind = "shape"


rounded = 4  # 保留x位小数


char1 = "我"


char2 = "他"


res = std_cal_sim(char1, char2, rounded=rounded, kind=kind)


print(res)


# output:


# 0.5821

2.3 多线程使用

from char_similar import pool_cal_sim


# "all"(字形:拼音:字义=1:1:1)  # "w2v"(字形:字义=1:1)  # "pinyin"(字形:拼音=1:1)  # "shape"(字形=1)


kind = "shape"


rounded = 4  # 保留x位小数


char1 = "我"


char2 = "他"


res = pool_cal_sim(char1, char2, rounded=rounded, kind=kind)


print(res)


# output:


# 0.5821

2.4 多进程使用(不建议, 实现得较慢)

if __name__ == '__main__':


    from char_similar import multi_cal_sim


    # "all"(字形:拼音:字义=1:1:1)  # "w2v"(字形:字义=1:1)  # "pinyin"(字形:拼音=1:1)  # "shape"(字形=1)


    kind = "shape"


    rounded = 4  # 保留x位小数


    char1 = "我"


    char2 = "他"


    res = multi_cal_sim(char1, char2, rounded=rounded, kind=kind)


    print(res)


    # output:


    # 0.5821

三、技术原理



char-similar最初的使用场景是计算两个汉字的字形相似度(构建csc混淆集), 后加入拼音相似度,字义相似度,字频相似度...详见源码.





# 四角码(code=4, 共5位), 统计四个数字中的相同数/4


# 偏旁部首, 相同为1


# 词频log10, 统计大规模语料macropodus中词频log10的 1-(差的绝对值/两数中的最大值)


# 笔画数, 1-(差的绝对值/两数中的最大值)


# 拆字, 集合的与 / 集合的并


# 构造结构, 相同为1


# 笔顺(实际为最小的集合), 集合的与 / 集合的并


# 拼音(code=4, 共4位), 统计四个数字中的相同数(拼音/声母/韵母/声调)/4


# 词向量, char-word2vec, cosine


四、参考(部分字典来源以下项目)

Reference

For citing this work, you can refer to the present GitHub project. For example, with BibTeX:



@misc{Macropodus,


    howpublished = {https://github.com/yongzhuo/char-similar},


    title = {char-similar},


    author = {Yongzhuo Mo},


    publisher = {GitHub},


    year = {2024}


}


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

char_similar-0.0.1-py2.py3-none-any.whl (1.1 MB view details)

Uploaded Python 2 Python 3

File details

Details for the file char_similar-0.0.1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for char_similar-0.0.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 a0e113adda4d4acf8ecf17a8f09e87b09889de9da891b26fe25583a0f3af7da9
MD5 888e53005ef2fca7e14d5db8f4a2d99a
BLAKE2b-256 f05af97b85706b4ac123acaa90a4616d0a9f4c5f105c426b7c1e30a4f149094f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page