No project description provided
Project description
kuro2sudachi
kuro2sudachi lets you to convert kuromoji user dict to sudachi user dict.
Usage
$ pip install kuro2sudachi
$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt
Custom pos convert dict
you can overwrite convert config with setting json file.
{
"固有名詞": {
"sudachi_pos": "名詞,固有名詞,一般,*,*,*",
"left_id": 4786,
"right_id": 4786,
"cost": 5000
},
"名詞": {
"sudachi_pos": "名詞,普通名詞,一般,*,*,*",
"left_id": 5146,
"right_id": 5146,
"cost": 5000
}
}
$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt -c kuro2sudachi.json
if you want to ignore unsupported pos error & invalid format, use --ignore
flag.
Dictionary type
You can specify the dictionary with the tokenize option -s (default: core).
$ pip install sudachidict_full
$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt -s full
Auto Splitting
kuro2sudachi supports suto splitting.
{
"名詞": {
"sudachi_pos": "名詞,普通名詞,一般,*,*,*",
"left_id": 5146,
"right_id": 5146,
"cost": 5000,
"split_mode": "C",
"unit_div_mode": [
"A", "B"
]
}
}
output includes unit devision info.
$ cat kuromoji_dict.txt
融合たんぱく質,融合たんぱく質,融合たんぱく質,名詞
発作性心房細動,発作性心房細動,発作性心房細動,名詞
$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt -c kuro2sudachi.json --ignore
$ cat sudachi_user_dict.txt
融合たんぱく質,4786,4786,5000,融合たんぱく質,名詞,普通名詞,一般,*,*,*,,融合たんぱく質,*,C,*,660881/810248,*
発作性心房細動,4786,4786,5000,発作性心房細動,名詞,普通名詞,一般,*,*,*,,発作性心房細動,*,C,584006/434835/428494/619020,2756385/428494/619020,*
Splitting Words defined by kuromoji
Currently, the CLI does not support word splitting defined by kuromoji. Therefore, the split representation of kuromoji is ignored.
中咽頭ガン,中咽頭 ガン,チュウイントウ ガン,カスタム名詞
↓
中咽頭ガン,4786,4786,7000,中咽頭ガン,名詞,固有名詞,一般,*,*,*,チュウイントウガン,中咽頭ガン,*,*,*,*,*
For Developer
test kuro2sudachi
$ poetry install
$ poetry run pytest
exec kuro2sudachi command
$ poetry run kuro2sudachi tests/kuromoji_dict_test.txt -o sudachi_user_dict.txt
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file kuro2sudachi-0.4.6.tar.gz
.
File metadata
- Download URL: kuro2sudachi-0.4.6.tar.gz
- Upload date:
- Size: 8.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a2191070a688d1ea2586c351379e8dab851abbca5748532424d5a8d5215e3f80 |
|
MD5 | 972028eccccf8ada4e60bfd6fbf86520 |
|
BLAKE2b-256 | 3a7413b0f5d12efdb38e9031205622444a50ce80cc7ac21911a010b294e6b0ce |
File details
Details for the file kuro2sudachi-0.4.6-py3-none-any.whl
.
File metadata
- Download URL: kuro2sudachi-0.4.6-py3-none-any.whl
- Upload date:
- Size: 8.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 67f4143bb1f2c2017ebb07b2877956d5aebefda8d284ef358e10139558ed351a |
|
MD5 | 06bb847352613bab3554fb9462fc13a2 |
|
BLAKE2b-256 | f0a5ee7461dd34311a80a4cd89b1905e4ef470fad7df927679b84485687a3c24 |