Skip to main content

计算两个字符的相似度

Project description

char-similar-z

汉字字形/拼音/语义相似度(单字, 可用于数据增强, CSC错别字检测识别任务(构建混淆集))

  • 备注: 完全基于https://github.com/yongzhuo/char-similar 项目, 仅修改了部分代码(删除了多线程和多进程,只保留了基础的功能), 使其支持python3.10+版本, 其他功能和使用方式保持一致.

一、安装

pip install char_similar_z

二、使用方式

2.1 详细使用

import time
from char_similar_z import CharSimilarity

# "shape"-字形; "all"-汇总字形/词义/拼音; "w2v"-词义优先+字形; "pinyin"-拼音优先+字形
# kind = "shape" ,  "all"  , "w2v"  , "pinyin"  选其一即可 
# 如果要使用 w2v或all,需要安装 pip install xiangsi,
# 对于字符而言, 使用 w2v 和all 无意义, 推荐使用 pinyin

sim = CharSimilarity()
char1 = "我"
char2 = "他"
for kind in ["shape", "pinyin"]:
    t0 = time.time()
    score = sim.std_cal_sim(char1, char2, kind=kind)
    t1 = time.time()
    print(f"相似度({char1}, {char2})[{kind}]: {score}, 用时: {round(t1 - t0, 4)}s")

三、技术原理

char-similar最初的使用场景是计算两个汉字的字形相似度(构建csc混淆集), 后加入拼音相似度,字义相似度,字频相似度...详见源码.

# 四角码(code=4, 共5位), 统计四个数字中的相同数/4
# 偏旁部首, 相同为1
# 词频log10, 统计大规模语料macropodus中词频log10的 1-(差的绝对值/两数中的最大值)
# 笔画数, 1-(差的绝对值/两数中的最大值)
# 拆字, 集合的与 / 集合的并
# 构造结构, 相同为1
# 笔顺(实际为最小的集合), 集合的与 / 集合的并
# 拼音(code=4, 共4位), 统计四个数字中的相同数(拼音/声母/韵母/声调)/4
# 词向量, char-word2vec, cosine

四、参考(部分字典来源以下项目)

Reference

For citing this work, you can refer to the present GitHub project. For example, with BibTeX:

@misc{Macropodus,
    howpublished = {https://github.com/yongzhuo/char-similar},
    title = {char-similar},
    author = {Yongzhuo Mo},
    publisher = {GitHub},
    year = {2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

char_similar_z-0.2.0.tar.gz (729.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

char_similar_z-0.2.0-py3-none-any.whl (763.0 kB view details)

Uploaded Python 3

File details

Details for the file char_similar_z-0.2.0.tar.gz.

File metadata

  • Download URL: char_similar_z-0.2.0.tar.gz
  • Upload date:
  • Size: 729.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.2

File hashes

Hashes for char_similar_z-0.2.0.tar.gz
Algorithm Hash digest
SHA256 ad2d64b21e49dd8c452fca550e2a964a6bfcbab3055b68f6713df4af15e9e0eb
MD5 b0f303cef7c5997e6c74b0627b824e3c
BLAKE2b-256 c6696cb84b9082c459491cfe92af6eaffb94277e1d84dfb0ce80f8d7e1cfe646

See more details on using hashes here.

File details

Details for the file char_similar_z-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for char_similar_z-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4273aa15052d7bc1fbd9e1463840a6cba4c3649f16c630d680a3aef55ee1590d
MD5 33e70e394ae92c750aaf839ea003c7b8
BLAKE2b-256 60e50be3a3ab06f2090a0d6b050166b995c0fca41f4535f7f05adc7099dc625b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page