Project description

fuzzychinese

形近词中文模糊匹配

A simple tool to fuzzy match chinese words, particular useful for proper name matching and address matching.

一个可以模糊匹配形近字词的小工具。对于专有名词，地址的匹配尤其有用。

安装说明

pip install fuzzychinese

使用说明

首先使用想要匹配的字典对模型进行训练。

然后用FuzzyChineseMatch.transform(raw_words, n) 来快速查找与raw_words的词最相近的前n个词。

训练模型时有两种分析方式可以选择，一种是笔划分析，一种是单字分析。也可以通过调整ngram_range的值来提高模型性能。

    import pandas as pd
    from fuzzychinese import FuzzyChineseMatch
    test_dict =  pd.Series(['长白朝鲜族自治县','长阳土家族自治县','城步苗族自治县','达尔罕茂明安联合旗','汨罗市'])
    raw_word = pd.Series(['达茂联合旗','长阳县','汩罗市'])
    assert('汩罗市'!='汨罗市') # They are not the same!

    fcm = FuzzyChineseMatch(ngram_range=(3, 3), analyzer='stroke')
    fcm.fit(test_dict)
    top2_similar = fcm.transform(raw_word, n=2)
    res = pd.concat([
        raw_word,
        pd.DataFrame(top2_similar, columns=['top1', 'top2']),
        pd.DataFrame(
            fcm.get_similarity_score(),
            columns=['top1_score', 'top2_score']),
        pd.DataFrame(
            fcm.get_index(),
            columns=['top1_index', 'top2_index'])],
                    axis=1)

	top1	top2	top1_score	top2_score	top1_index
达茂联合旗	达尔罕茂明安联合旗	长白朝鲜族自治县	0.824751	0.287237	3
长阳县	长阳土家族自治县	长白朝鲜族自治县	0.610285	0.475000	1
汩罗市	汨罗市	长白朝鲜族自治县	1.000000	0.152093	4

Installation

pip install fuzzychinese

Quickstart

First train a model with the target list of words you want to match to.

Then use FuzzyChineseMatch.transform(raw_words, n) to find top n most similar words in the target for your raw_words .

There are two analyzers to choose from when training a model: stroke and character. You can also change ngram_range to fine-tune the model.

    from fuzzychinese import FuzzyChineseMatch
    test_dict =  pd.Series(['长白朝鲜族自治县','长阳土家族自治县','城步苗族自治县','达尔罕茂明安联合旗','汨罗市'])
    raw_word = pd.Series(['达茂联合旗','长阳县','汩罗市'])
    assert('汩罗市'!='汨罗市') # They are not the same!

    fcm = FuzzyChineseMatch(ngram_range=(3, 3), analyzer='stroke')
    fcm.fit(test_dict)
    top2_similar = fcm.transform(raw_word, n=2)
    res = pd.concat([
        raw_word,
        pd.DataFrame(top2_similar, columns=['top1', 'top2']),
        pd.DataFrame(
            fcm.get_similarity_score(),
            columns=['top1_score', 'top2_score']),
        pd.DataFrame(
            fcm.get_index(),
            columns=['top1_index', 'top2_index'])],
                    axis=1)

	top1	top2	top1_score	top2_score	top1_index
达茂联合旗	达尔罕茂明安联合旗	长白朝鲜族自治县	0.824751	0.287237	3
长阳县	长阳土家族自治县	长白朝鲜族自治县	0.610285	0.475000	1
汩罗市	汨罗市	长白朝鲜族自治县	1.000000	0.152093	4

Project details

These details have not been verified by PyPI

Project links

Homepage

Intended Audience
- Developers
License
- OSI Approved :: BSD License
Natural Language
- Chinese (Simplified)
- Chinese (Traditional)
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Text Processing
- Text Processing :: Linguistic

Release history Release notifications | RSS feed

0.1.5

Apr 29, 2019

This version

0.1.4

Apr 18, 2019

0.1.3

Apr 16, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fuzzychinese-0.1.4.tar.gz (155.9 kB view hashes)

Uploaded Apr 18, 2019 Source

Built Distribution

fuzzychinese-0.1.4-py3-none-any.whl (170.5 kB view hashes)

Uploaded Apr 18, 2019 Python 3

Hashes for fuzzychinese-0.1.4.tar.gz

Hashes for fuzzychinese-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`d7fb4a12dacfc280b3d0df3472fb89e13ff3a98e0feff28456a22d5014744039`
MD5	`cb484ceb7809ef47082c700c452e2a19`
BLAKE2b-256	`43e2281dda9d5f791eee44fe42eceec8fc27fd76929f9eaa41d9f3e7cc763832`

Hashes for fuzzychinese-0.1.4-py3-none-any.whl

Hashes for fuzzychinese-0.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`74c320f91c6cfe62f07a766fdec9526b49ece5afa2f0e38c83743e6d6e33387c`
MD5	`8f3842cda13b3500949a96970e52f6db`
BLAKE2b-256	`c4da200767c139c205e2fd691e242e2a201ea6b54fbbf25c1adecab9fa6e540f`