Skip to main content

No project description provided

Project description

kuro2sudachi

PyPi version PyTest

kuro2sudachi lets you to convert kuromoji user dict to sudachi user dict.

Usage

$ pip install kuro2sudachi
$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt

Custom pos convert dict

you can overwrite convert config with setting json file.

{
    "固有名詞": {
        "sudachi_pos": "名詞,固有名詞,一般,*,*,*",
        "left_id": 4786,
        "right_id": 4786,
        "cost": 5000
    },
    "名詞": {
        "sudachi_pos": "名詞,普通名詞,一般,*,*,*",
        "left_id": 5146,
        "right_id": 5146,
        "cost": 5000
    }
}
$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt -c kuro2sudachi.json

if you want to ignore unsupported pos error & invalid format, use --ignore flag.

Dictionary type

You can specify the dictionary with the tokenize option -s (default: core).

$ pip install sudachidict_full
$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt -s full

Auto Splitting

kuro2sudachi supports suto splitting.

{
    "名詞": {
        "sudachi_pos": "名詞,普通名詞,一般,*,*,*",
        "left_id": 5146,
        "right_id": 5146,
        "cost": 5000,
        "split_mode": "C",
        "unit_div_mode": [
            "A", "B"
        ]
    }
}

output includes unit devision info.

$ cat kuromoji_dict.txt
融合たんぱく質,融合たんぱく質,融合たんぱく質,名詞
発作性心房細動,発作性心房細動,発作性心房細動,名詞

$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt -c kuro2sudachi.json --ignore

$ cat sudachi_user_dict.txt
融合たんぱく質,4786,4786,5000,融合たんぱく質,名詞,普通名詞,一般,*,*,*,,融合たんぱく質,*,C,*,660881/810248,*
発作性心房細動,4786,4786,5000,発作性心房細動,名詞,普通名詞,一般,*,*,*,,発作性心房細動,*,C,584006/434835/428494/619020,2756385/428494/619020,*

Splitting Words defined by kuromoji

Currently, the CLI does not support word splitting defined by kuromoji. Therefore, the split representation of kuromoji is ignored.

中咽頭ガン,中咽頭 ガン,チュウイントウ ガン,カスタム名詞
↓
中咽頭ガン,4786,4786,7000,中咽頭ガン,名詞,固有名詞,一般,*,*,*,チュウイントウガン,中咽頭ガン,*,*,*,*,*

For Developer

test kuro2sudachi

$ poetry install
$ poetry run pytest

exec kuro2sudachi command

$ poetry run kuro2sudachi tests/kuromoji_dict_test.txt -o sudachi_user_dict.txt

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kuro2sudachi-0.4.6.tar.gz (8.1 kB view details)

Uploaded Source

Built Distribution

kuro2sudachi-0.4.6-py3-none-any.whl (8.6 kB view details)

Uploaded Python 3

File details

Details for the file kuro2sudachi-0.4.6.tar.gz.

File metadata

  • Download URL: kuro2sudachi-0.4.6.tar.gz
  • Upload date:
  • Size: 8.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for kuro2sudachi-0.4.6.tar.gz
Algorithm Hash digest
SHA256 a2191070a688d1ea2586c351379e8dab851abbca5748532424d5a8d5215e3f80
MD5 972028eccccf8ada4e60bfd6fbf86520
BLAKE2b-256 3a7413b0f5d12efdb38e9031205622444a50ce80cc7ac21911a010b294e6b0ce

See more details on using hashes here.

File details

Details for the file kuro2sudachi-0.4.6-py3-none-any.whl.

File metadata

File hashes

Hashes for kuro2sudachi-0.4.6-py3-none-any.whl
Algorithm Hash digest
SHA256 67f4143bb1f2c2017ebb07b2877956d5aebefda8d284ef358e10139558ed351a
MD5 06bb847352613bab3554fb9462fc13a2
BLAKE2b-256 f0a5ee7461dd34311a80a4cd89b1905e4ef470fad7df927679b84485687a3c24

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page