Skip to main content

计算两个字符的相似度

Project description

char-similar-z

汉字字形/拼音/语义相似度(单字, 可用于数据增强, CSC错别字检测识别任务(构建混淆集))

  • 备注: 完全基于https://github.com/yongzhuo/char-similar 项目, 仅修改了部分代码(删除了多线程和多进程,只保留了基础的功能), 使其支持python3.10+版本, 其他功能和使用方式保持一致.

一、安装

pip install char_similar_z

二、使用方式

2.1 详细使用

import time
from char_similar_z import CharSimilarity

# "shape"-字形; "all"-汇总字形/词义/拼音; "w2v"-词义优先+字形; "pinyin"-拼音优先+字形
# kind = "shape"  # "all"  # "w2v"  # "pinyin"  # "shape"
# 对于字符而言, 使用 w2v 和all 无意义, 推荐使用 pinyin
sim = CharSimilarity()
char1 = "我"
char2 = "他"
for kind in ["shape", "pinyin", "w2v", "all"]:
    t0 = time.time()
    score = sim.std_cal_sim(char1, char2, kind=kind)
    t1 = time.time()
    print(f"相似度({char1}, {char2})[{kind}]: {score}, 用时: {round(t1 - t0, 4)}s")

三、技术原理

char-similar最初的使用场景是计算两个汉字的字形相似度(构建csc混淆集), 后加入拼音相似度,字义相似度,字频相似度...详见源码.

# 四角码(code=4, 共5位), 统计四个数字中的相同数/4
# 偏旁部首, 相同为1
# 词频log10, 统计大规模语料macropodus中词频log10的 1-(差的绝对值/两数中的最大值)
# 笔画数, 1-(差的绝对值/两数中的最大值)
# 拆字, 集合的与 / 集合的并
# 构造结构, 相同为1
# 笔顺(实际为最小的集合), 集合的与 / 集合的并
# 拼音(code=4, 共4位), 统计四个数字中的相同数(拼音/声母/韵母/声调)/4
# 词向量, char-word2vec, cosine

四、参考(部分字典来源以下项目)

Reference

For citing this work, you can refer to the present GitHub project. For example, with BibTeX:

@misc{Macropodus,
    howpublished = {https://github.com/yongzhuo/char-similar},
    title = {char-similar},
    author = {Yongzhuo Mo},
    publisher = {GitHub},
    year = {2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

char_similar_z-0.1.0.tar.gz (727.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

char_similar_z-0.1.0-py3-none-any.whl (766.8 kB view details)

Uploaded Python 3

File details

Details for the file char_similar_z-0.1.0.tar.gz.

File metadata

  • Download URL: char_similar_z-0.1.0.tar.gz
  • Upload date:
  • Size: 727.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.10.16 Windows/10

File hashes

Hashes for char_similar_z-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4d1a2111b8f2be2e76c7f504d43fd2a5b6aef0690558c0915ce9c8553888f868
MD5 ca4b742a25246f72a451e8f8a9f4e16b
BLAKE2b-256 02082a10abb4f09a1e9440fcae414f366b1ea53db8d5624c80a0067d1e8d96cc

See more details on using hashes here.

File details

Details for the file char_similar_z-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: char_similar_z-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 766.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.10.16 Windows/10

File hashes

Hashes for char_similar_z-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5664a65c07856dce9b38967b253807ca2e583b7032a5b075fe64c2f8be105c9d
MD5 d6945c5fbd1b06e4f052cc94c92e56e8
BLAKE2b-256 7d5d31b296cb9d630a3302cc37026b6133a061a8a3786b5a83496a2fe15e1ad6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page