Skip to main content

char-similar: Tookit of Chinese Character Similarity, Especially the Shape of the Characters.

Project description

char-similar

汉字字形/拼音/语义相似度(单字, 可用于数据增强, CSC错别字检测识别任务(构建混淆集))

一、安装



0. 注意事项


   默认不指定numpy版本(标准版numpy==1.22.4), 过高或者过低的版本可能不支持


   标准版本的依赖包详见 requirements-all.txt


   


1. 通过PyPI安装


   pip install char-similar


   使用镜像源, 如:


   pip install -i https://pypi.tuna.tsinghua.edu.cn/simple char-similar


二、使用方式

2.1 快速使用

from char_similar import std_cal_sim


char1 = "我"


char2 = "他"


res = std_cal_sim(char1, char2)


print(res)


# output:


# 0.5821

2.2 详细使用

from char_similar import std_cal_sim


# "all"(字形:拼音:字义=1:1:1)  # "w2v"(字形:字义=1:1)  # "pinyin"(字形:拼音=1:1)  # "shape"(字形=1)


kind = "shape"


rounded = 4  # 保留x位小数


char1 = "我"


char2 = "他"


res = std_cal_sim(char1, char2, rounded=rounded, kind=kind)


print(res)


# output:


# 0.5821

2.3 多线程使用

from char_similar import pool_cal_sim


# "all"(字形:拼音:字义=1:1:1)  # "w2v"(字形:字义=1:1)  # "pinyin"(字形:拼音=1:1)  # "shape"(字形=1)


kind = "shape"


rounded = 4  # 保留x位小数


char1 = "我"


char2 = "他"


res = pool_cal_sim(char1, char2, rounded=rounded, kind=kind)


print(res)


# output:


# 0.5821

2.4 多进程使用(不建议, 实现得较慢)

if __name__ == '__main__':


    from char_similar import multi_cal_sim


    # "all"(字形:拼音:字义=1:1:1)  # "w2v"(字形:字义=1:1)  # "pinyin"(字形:拼音=1:1)  # "shape"(字形=1)


    kind = "shape"


    rounded = 4  # 保留x位小数


    char1 = "我"


    char2 = "他"


    res = multi_cal_sim(char1, char2, rounded=rounded, kind=kind)


    print(res)


    # output:


    # 0.5821

三、技术原理



char-similar最初的使用场景是计算两个汉字的字形相似度(构建csc混淆集), 后加入拼音相似度,字义相似度,字频相似度...详见源码.





# 四角码(code=4, 共5位), 统计四个数字中的相同数/4


# 偏旁部首, 相同为1


# 词频log10, 统计大规模语料macropodus中词频log10的 1-(差的绝对值/两数中的最大值)


# 笔画数, 1-(差的绝对值/两数中的最大值)


# 拆字, 集合的与 / 集合的并


# 构造结构, 相同为1


# 笔顺(实际为最小的集合), 集合的与 / 集合的并


# 拼音(code=4, 共4位), 统计四个数字中的相同数(拼音/声母/韵母/声调)/4


# 词向量, char-word2vec, cosine


四、参考(部分字典来源以下项目)

Reference

For citing this work, you can refer to the present GitHub project. For example, with BibTeX:



@misc{Macropodus,


    howpublished = {https://github.com/yongzhuo/char-similar},


    title = {char-similar},


    author = {Yongzhuo Mo},


    publisher = {GitHub},


    year = {2024}


}


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

char_similar-0.0.2-py2.py3-none-any.whl (777.8 kB view details)

Uploaded Python 2Python 3

File details

Details for the file char_similar-0.0.2-py2.py3-none-any.whl.

File metadata

  • Download URL: char_similar-0.0.2-py2.py3-none-any.whl
  • Upload date:
  • Size: 777.8 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.8.13

File hashes

Hashes for char_similar-0.0.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 18d25caff0cd430bb142277320df5061e1997165b1cb8f47bfdc4b5d2ec09adc
MD5 81600b8cb61876f26661587aec6a2347
BLAKE2b-256 e45380f49837ff1df6e6b74cb4e8f1674e229133d287aac0e9dfe13c3620162f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page