Skip to main content

A small package to fuzzy match chinese words 中文模糊匹配

Project description

fuzzychinese

形近词中文模糊匹配

A simple tool to fuzzy match chinese words, particular useful for proper name matching and address matching.

一个可以模糊匹配形近字词的小工具。对于专有名词,地址的匹配尤其有用。

安装说明

pip install fuzzychinese

使用说明

首先使用想要匹配的字典对模型进行训练。

然后用FuzzyChineseMatch.transform(raw_words, n) 来快速查找与raw_words的词最相近的前n个词。

训练模型时有三种分析方式可以选择,笔划分析(stroke),部首分析(radical),和单字分析(char)。也可以通过调整ngram_range的值来提高模型性能。

匹配完成后返回相似度分数,匹配的相近词语及其原有索引号。

    import pandas as pd
    from fuzzychinese import FuzzyChineseMatch
    test_dict =  pd.Series(['长白朝鲜族自治县','长阳土家族自治县','城步苗族自治县','达尔罕茂明安联合旗','汨罗市'])
    raw_word = pd.Series(['达茂联合旗','长阳县','汩罗市'])
    assert('汩罗市'!='汨罗市') # They are not the same!

    fcm = FuzzyChineseMatch(ngram_range=(3, 3), analyzer='stroke')
    fcm.fit(test_dict)
    top2_similar = fcm.transform(raw_word, n=2)
    res = pd.concat([
        raw_word,
        pd.DataFrame(top2_similar, columns=['top1', 'top2']),
        pd.DataFrame(
            fcm.get_similarity_score(),
            columns=['top1_score', 'top2_score']),
        pd.DataFrame(
            fcm.get_index(),
            columns=['top1_index', 'top2_index'])],
                    axis=1)
top1 top2 top1_score top2_score top1_index top2_index
达茂联合旗 达尔罕茂明安联合旗 长白朝鲜族自治县 0.824751 0.287237 3 0
长阳县 长阳土家族自治县 长白朝鲜族自治县 0.610285 0.475000 1 0
汩罗市 汨罗市 长白朝鲜族自治县 1.000000 0.152093 4 0

其他功能

  • 直接使用Stroke, Radical进行汉字分解。

    stroke = Stroke()
    radical = Radical()
    print("像", stroke.get_stroke("像"))
    print("像", radical.get_radical("像"))
    
    像 ㇒〡㇒㇇〡㇕一㇒㇁㇒㇒㇒㇏
    像 人象
    
  • 使用FuzzyChineseMatch.compare_two_columns(X, Y)对每一行的两个词进行比较,获得相似度分数。

  • 详情请参见说明文档.

致谢

拆字数据来自于 漢語拆字字典 by 開放詞典網

Installation

pip install fuzzychinese

Quickstart

First train a model with the target list of words you want to match to.

Then use FuzzyChineseMatch.transform(raw_words, n) to find top n most similar words in the target for your raw_words .

There are three analyzers to choose from when training a model: stroke, radical, and char. You can also change ngram_range to fine-tune the model.

After the matching, similarity score, matched words and its corresponding index are returned.

    from fuzzychinese import FuzzyChineseMatch
    test_dict =  pd.Series(['长白朝鲜族自治县','长阳土家族自治县','城步苗族自治县','达尔罕茂明安联合旗','汨罗市'])
    raw_word = pd.Series(['达茂联合旗','长阳县','汩罗市'])
    assert('汩罗市'!='汨罗市') # They are not the same!

    fcm = FuzzyChineseMatch(ngram_range=(3, 3), analyzer='stroke')
    fcm.fit(test_dict)
    top2_similar = fcm.transform(raw_word, n=2)
    res = pd.concat([
        raw_word,
        pd.DataFrame(top2_similar, columns=['top1', 'top2']),
        pd.DataFrame(
            fcm.get_similarity_score(),
            columns=['top1_score', 'top2_score']),
        pd.DataFrame(
            fcm.get_index(),
            columns=['top1_index', 'top2_index'])],
                    axis=1)
top1 top2 top1_score top2_score top1_index top2_index
达茂联合旗 达尔罕茂明安联合旗 长白朝鲜族自治县 0.824751 0.287237 3 0
长阳县 长阳土家族自治县 长白朝鲜族自治县 0.610285 0.475000 1 0
汩罗市 汨罗市 长白朝鲜族自治县 1.000000 0.152093 4 0

Other use

  • Directly use Stroke, Radical to decompose Chinese character into strokes or radicals.

    stroke = Stroke()
    radical = Radical()
    print("像", stroke.get_stroke("像"))
    print("像", radical.get_radical("像"))
    
    像 ㇒〡㇒㇇〡㇕一㇒㇁㇒㇒㇒㇏
    像 人象
    
  • Use FuzzyChineseMatch.compare_two_columns(X, Y) to compare the pair of words in each row to get similarity score.

  • See documentation for details.

Credits

Data for Chinese radicals are from 漢語拆字字典 by 開放詞典網.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fuzzychinese-0.1.5.tar.gz (287.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fuzzychinese-0.1.5-py3-none-any.whl (302.5 kB view details)

Uploaded Python 3

File details

Details for the file fuzzychinese-0.1.5.tar.gz.

File metadata

  • Download URL: fuzzychinese-0.1.5.tar.gz
  • Upload date:
  • Size: 287.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.2

File hashes

Hashes for fuzzychinese-0.1.5.tar.gz
Algorithm Hash digest
SHA256 a8640118865bda3b0317a3c04342e336a8cd92b085760388e1d9bb1c644cfac7
MD5 dbb684871d940e84326d22653587fc64
BLAKE2b-256 f78c54db3f0384ce9050adbb320ccd6cc137b34d4940453c99dda629b9816a01

See more details on using hashes here.

File details

Details for the file fuzzychinese-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: fuzzychinese-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 302.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.2

File hashes

Hashes for fuzzychinese-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 59e62dab7eb3585e8d334fe867d97cbde6a7ecae5750c9e4e4041a42db2ff593
MD5 153c83f22e6a615c9fbbf528f7dde97e
BLAKE2b-256 48e7d5186b34c7919c31f5dd7e7b6a437ac97d0149d882c611009efad270aadf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page