An easy Python package for fuzzy matching Chinese(simplified and traditional), Japanese and Korean, using character similarity trained from ViT transformer

These details have not been verified by PyPI

Project links

Homepage

Project description

HomoglyphsCJK

An efficient and useful tool to fuzzy match Japanese, Korean, Simplified Chinese or Traditional Chinese words, particular useful for OCRed text record linkage.

Installation

pip install HomoglyphsCJK==0.0.3

Usage

There are two functionalities of this package: calculate homoglyph distance between two strings, or merge two dataframes based on keys using homoglyph distance.

When you firstly use this on one language, the homoglyph dict will be downloaded automatically in the current directory you run your script. So please make sure you run the script from a folder that has permission to write.

Merge two dataframes. When you merge two dataframes, you can specify the parallel argument to run multiprocessing. Mac users probably want to use Python version == 3.7 for multiprocessing.

    from homo import homoglyph_distance,homoglyph_merge,download_dict
    import pandas as pd
    df_1 = pd.DataFrame(list(['苏萃乡','办雄','虐格给','雪拉普岗']),columns=['ocred_text'])
    df_2 = pd.DataFrame(list(['雪拉普岗日','小苏莽乡','协雄','唐格给']),columns=['truth_text'])

    # merge two dataframes, note that the homoglyph dict of specified language will be downloaded automatically when first run.
    ## run in parallel with pool of 4, if num_workers is not specified, all available CPU cores are used.
    dataframe_merged = homoglyph_merge('zhs',df_1,df_2,'ocred_text','truth_text',homo_lambda=1, insertion=1, deletion=1,parallel=True,num_workers=4)
    
    ## not run in parallel
    dataframe_merged = homoglyph_merge('zhs',df_1,df_2,'ocred_text','truth_text',homo_lambda=1, insertion=1, deletion=1) 
    '''
    lang: choose from zhs, zht, ja, ko
    dataframe 1: the first dataframe
    dataframe 2: the second dataframe
    key from dataframe 1
    key from dataframe 2
    weight on substitution homoglyph distance, default is 1
    weight on insertion cost, default is 1
    weight on deletion cost, default is 1
    '''

ocred_text	homo_matched_truth_text	homo_dist
苏萃乡	小苏莽乡	1.88
办雄	协雄	0.15
虐格给	唐格给	0.87
雪拉普岗	雪拉普岗日	1.0

Homoglyph distance between two strings. The default weight on substitution, insertion, deletion is 1.

    download_dict('zhs')
    homoglyph_distance('苏萃乡','小苏莽乡',homo_lambda=1, insertion=1, deletion=1)
    # 1.88

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.6

Oct 13, 2023

0.1.5

Jun 22, 2023

0.1.4

Jun 6, 2023

0.1.3

May 27, 2023

0.1.2

May 27, 2023

0.1.1

May 25, 2023

0.1.0

May 25, 2023

0.0.9

May 22, 2023

0.0.8

May 22, 2023

0.0.7

May 22, 2023

0.0.6

May 22, 2023

0.0.5

May 22, 2023

This version

0.0.4

May 22, 2023

0.0.3

May 22, 2023

0.0.2

May 22, 2023

0.0.1

May 22, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

HomoglyphsCJK-0.0.4.tar.gz (5.8 kB view hashes)

Uploaded May 22, 2023 Source

Hashes for HomoglyphsCJK-0.0.4.tar.gz

Hashes for HomoglyphsCJK-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`aafa48b3a5e1c4ba9694f3b217fdd6d68a0aa5e24e3651e8dc0b9993725d32d6`
MD5	`3f7d665ed18b74be31e5d76ba3925669`
BLAKE2b-256	`c0f9c9cbc5eb252fc033fbb7e3c6cdc1c65bc800e4bd30f2f37028ffe7435882`