An easy Python package for fuzzy matching Chinese(simplified and traditional), Japanese and Korean, using character similarity trained from ViT transformer

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

HomoglyphsCJK

An efficient and useful tool to fuzzy match Japanese, Korean, Simplified Chinese or Traditional Chinese words, using character visual similarity.

Installation

pip install HomoglyphsCJK

Usage

There are two functionalities of this package: use hg_distance to calculate homoglyph distance between two strings, or use hg_merge to merge two dataframes based on keys using homoglyphic edit distance which uses substitution cost considering character visual similarity.

If you use hg_merge or hg_distance on specific language, the dict will be downloaded automatically if not already exist, otherwise load from your current directory. So please make sure you run the script from a folder that has permission to write. The available languages are [zhs, zht, ko, ja] for simplified Chinese, traditional Chinese, Korean and Japanese respectively.
hg_merge merges two dataframes. When you merge two dataframes, you can specify the parallel argument to use multiprocessing. If you don't specify the num_workers when using parallel, it will automatically use the number of all detected CPU cores.
Note that hg_merge de-duplicates your passed in key columns and will in the end only return one unique value of the key specified. if you need to merge panel dataset to cross-sectional dataset for instance, you can de-duplicate the panel dataset key before you pass it in, then you will need to merge back your panel data using the matched key.

from HomoglyphsCJK import  hg_distance,hg_merge
import pandas as pd
df_1 = pd.DataFrame(list(['苏萃乡','办雄','虐格给','雪拉普岗']),columns=['query'])
df_2 = pd.DataFrame(list(['雪拉普岗日','小苏莽乡','协雄','唐格给','太阳村','月亮湾']),columns=['key'])

# merge two dataframes, note that the homoglyph dict of specified language will be downloaded automatically when first run.
## run in parallel with pool of 4, if num_workers is not specified, all available CPU cores are used.
dataframe_merged = hg_merge('zhs',df_1,df_2,'query','key',homo_lambda=1, insertion=1, deletion=1,parallel=True,num_workers=4)

## not run in parallel
dataframe_merged = hg_merge('zhs',df_1,df_2,'query','key',homo_lambda=1, insertion=1, deletion=1) 
'''
lang: choose from zhs, zht, ja, ko or your own trained homoglyph dict pickle file path
dataframe 1: the first dataframe
dataframe 2: the second dataframe
key from dataframe 1
key from dataframe 2
weight on substitution homoglyph distance, default is 1
weight on insertion cost, default is 1
weight on deletion cost, default is 1
'''

ocred_text	homo_matched_truth_text	homo_dist
苏萃乡	小苏莽乡	1.88
办雄	协雄	0.15
虐格给	唐格给	0.87
雪拉普岗	雪拉普岗日	1.0

hg_distance calculates homoglyphic edit distance between two strings. The default weight on substitution, insertion, deletion is 1.

hg_distance('苏萃乡','小苏莽乡','zhs',homo_lambda=1, insertion=1, deletion=1)
# 1.88

Contributing

We encourage you to contribute to HomoglyphsCJK!

Questions

If you have any questions using this package, you can open an issue under our GitHub repository. We are maintaining and updating this package, so stay tuned!

Citation

@misc{yang2023quantifying,
      title={Quantifying Character Similarity with Vision Transformers}, 
      author={Xinmei Yang and Abhishek Arora and Shao-Yu Jheng and Melissa Dell},
      year={2023},
      eprint={2305.14672},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.6

Oct 13, 2023

0.1.5

Jun 22, 2023

0.1.4

Jun 6, 2023

0.1.3

May 27, 2023

0.1.2

May 27, 2023

0.1.1

May 25, 2023

0.1.0

May 25, 2023

0.0.9

May 22, 2023

0.0.8

May 22, 2023

0.0.7

May 22, 2023

0.0.6

May 22, 2023

0.0.5

May 22, 2023

0.0.4

May 22, 2023

0.0.3

May 22, 2023

0.0.2

May 22, 2023

0.0.1

May 22, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

HomoglyphsCJK-0.1.6.tar.gz (6.8 kB view details)

Uploaded Oct 13, 2023 Source

File details

Details for the file HomoglyphsCJK-0.1.6.tar.gz.

File metadata

Download URL: HomoglyphsCJK-0.1.6.tar.gz
Upload date: Oct 13, 2023
Size: 6.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.5

File hashes

Hashes for HomoglyphsCJK-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`c8922276a2ad5e9d62c8630b41455014e09f57323331ceb5df058c86c9bec4eb`
MD5	`7d5bf5b787e0d85b4cb31ce9ea090dc6`
BLAKE2b-256	`75abdff9ab43580e9df67d20b86de5d089ef75c3dea9c181e0b0d153dd35cd49`

See more details on using hashes here.

HomoglyphsCJK 0.1.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

HomoglyphsCJK

Installation

Usage

Contributing

Questions

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes