Skip to main content

No project description provided

Project description

kuro2sudachi

PyPi version PyTest

kuro2sudachi lets you to convert kuromoji user dict to sudachi user dict.

Usage

$ pip install kuro2sudachi
$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt

Custom pos convert dict

you can overwrite convert config with setting json file.

{
    "固有名詞": {
        "sudachi_pos": "名詞,固有名詞,一般,*,*,*",
        "left_id": 4786,
        "right_id": 4786,
        "cost": 5000
    },
    "名詞": {
        "sudachi_pos": "名詞,普通名詞,一般,*,*,*",
        "left_id": 5146,
        "right_id": 5146,
        "cost": 5000
    }
}
$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt -c kuro2sudachi.json

if you want to ignore unsupported pos error & invalid format, use --ignore flag.

Dictionary type

You can specify the dictionary with the tokenize option -s (default: core).

$ pip install sudachidict_full
$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt -s full

Auto Splitting

kuro2sudachi supports suto splitting.

{
    "名詞": {
        "sudachi_pos": "名詞,普通名詞,一般,*,*,*",
        "left_id": 5146,
        "right_id": 5146,
        "cost": 5000,
        "split_mode": "C",
        "unit_div_mode": [
            "A", "B"
        ]
    }
}

output includes unit devision info.

$ cat kuromoji_dict.txt
融合たんぱく質,融合たんぱく質,融合たんぱく質,名詞
発作性心房細動,発作性心房細動,発作性心房細動,名詞

$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt -c kuro2sudachi.json --ignore

$ cat sudachi_user_dict.txt
融合たんぱく質,4786,4786,5000,融合たんぱく質,名詞,普通名詞,一般,*,*,*,,融合たんぱく質,*,C,*,660881/810248,*
発作性心房細動,4786,4786,5000,発作性心房細動,名詞,普通名詞,一般,*,*,*,,発作性心房細動,*,C,584006/434835/428494/619020,2756385/428494/619020,*

Splitting Words defined by kuromoji

Currently, the CLI does not support word splitting defined by kuromoji. Therefore, the split representation of kuromoji is ignored.

中咽頭ガン,中咽頭 ガン,チュウイントウ ガン,カスタム名詞
↓
中咽頭ガン,4786,4786,7000,中咽頭ガン,名詞,固有名詞,一般,*,*,*,チュウイントウガン,中咽頭ガン,*,*,*,*,*

For Developer

test kuro2sudachi

$ poetry install
$ poetry run pytest

exec kuro2sudachi command

$ poetry run kuro2sudachi tests/kuromoji_dict_test.txt -o sudachi_user_dict.txt

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kuro2sudachi-0.4.7.tar.gz (11.9 kB view details)

Uploaded Source

Built Distribution

kuro2sudachi-0.4.7-py3-none-any.whl (12.8 kB view details)

Uploaded Python 3

File details

Details for the file kuro2sudachi-0.4.7.tar.gz.

File metadata

  • Download URL: kuro2sudachi-0.4.7.tar.gz
  • Upload date:
  • Size: 11.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for kuro2sudachi-0.4.7.tar.gz
Algorithm Hash digest
SHA256 8ecb24fa47b08f7b8c9b39005c87c434422b7f5b85a41ea802832092f24c83b4
MD5 5a99a6b4beac18debf02762cfe488ca5
BLAKE2b-256 618ba5129758a99253bf4a3f1e3faf9a4cb0f5aec24f52354c9dcd7f9f94632b

See more details on using hashes here.

File details

Details for the file kuro2sudachi-0.4.7-py3-none-any.whl.

File metadata

  • Download URL: kuro2sudachi-0.4.7-py3-none-any.whl
  • Upload date:
  • Size: 12.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for kuro2sudachi-0.4.7-py3-none-any.whl
Algorithm Hash digest
SHA256 dbae12621e96eb6bcf9d71a5ee409f0b8411656b7334063d2a955a31d5dfe2d3
MD5 a831dc2efb41990f923b2c4e6a91eab6
BLAKE2b-256 1077447d3cd58c13252562c8138b3af1efaa9a56a7fadf17dc0dc309fef508a5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page