计算两个字符的相似度
Project description
char-similar-z
汉字字形/拼音/语义相似度(单字, 可用于数据增强, CSC错别字检测识别任务(构建混淆集))
- 备注: 完全基于https://github.com/yongzhuo/char-similar 项目, 仅修改了部分代码(删除了多线程和多进程,只保留了基础的功能), 使其支持python3.10+版本, 其他功能和使用方式保持一致.
一、安装
pip install char_similar_z
二、使用方式
2.1 详细使用
import time
from char_similar_z import CharSimilarity
# "shape"-字形; "all"-汇总字形/词义/拼音; "w2v"-词义优先+字形; "pinyin"-拼音优先+字形
# kind = "shape" # "all" # "w2v" # "pinyin" # "shape"
# 对于字符而言, 使用 w2v 和all 无意义, 推荐使用 pinyin
sim = CharSimilarity()
char1 = "我"
char2 = "他"
for kind in ["shape", "pinyin", "w2v", "all"]:
t0 = time.time()
score = sim.std_cal_sim(char1, char2, kind=kind)
t1 = time.time()
print(f"相似度({char1}, {char2})[{kind}]: {score}, 用时: {round(t1 - t0, 4)}s")
三、技术原理
char-similar最初的使用场景是计算两个汉字的字形相似度(构建csc混淆集), 后加入拼音相似度,字义相似度,字频相似度...详见源码.
# 四角码(code=4, 共5位), 统计四个数字中的相同数/4
# 偏旁部首, 相同为1
# 词频log10, 统计大规模语料macropodus中词频log10的 1-(差的绝对值/两数中的最大值)
# 笔画数, 1-(差的绝对值/两数中的最大值)
# 拆字, 集合的与 / 集合的并
# 构造结构, 相同为1
# 笔顺(实际为最小的集合), 集合的与 / 集合的并
# 拼音(code=4, 共4位), 统计四个数字中的相同数(拼音/声母/韵母/声调)/4
# 词向量, char-word2vec, cosine
四、参考(部分字典来源以下项目)
- https://github.com/contr4l/SimilarCharacter
- https://github.com/houbb/nlp-hanzi-similar
- https://github.com/mozillazg/python-pinyin
- https://github.com/CNMan/UnicodeCJK-WuBi
- https://github.com/yongzhuo/Macropodus
- https://github.com/kfcd/chaizi
Reference
For citing this work, you can refer to the present GitHub project. For example, with BibTeX:
@misc{Macropodus,
howpublished = {https://github.com/yongzhuo/char-similar},
title = {char-similar},
author = {Yongzhuo Mo},
publisher = {GitHub},
year = {2024}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
char_similar_z-0.1.0.tar.gz
(727.7 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file char_similar_z-0.1.0.tar.gz.
File metadata
- Download URL: char_similar_z-0.1.0.tar.gz
- Upload date:
- Size: 727.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.10.16 Windows/10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4d1a2111b8f2be2e76c7f504d43fd2a5b6aef0690558c0915ce9c8553888f868
|
|
| MD5 |
ca4b742a25246f72a451e8f8a9f4e16b
|
|
| BLAKE2b-256 |
02082a10abb4f09a1e9440fcae414f366b1ea53db8d5624c80a0067d1e8d96cc
|
File details
Details for the file char_similar_z-0.1.0-py3-none-any.whl.
File metadata
- Download URL: char_similar_z-0.1.0-py3-none-any.whl
- Upload date:
- Size: 766.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.10.16 Windows/10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5664a65c07856dce9b38967b253807ca2e583b7032a5b075fe64c2f8be105c9d
|
|
| MD5 |
d6945c5fbd1b06e4f052cc94c92e56e8
|
|
| BLAKE2b-256 |
7d5d31b296cb9d630a3302cc37026b6133a061a8a3786b5a83496a2fe15e1ad6
|