No project description provided
Project description
kuro2sudachi
kuro2sudachi lets you to convert kuromoji user dict to sudachi user dict.
Usage
$ pip install kuro2sudachi
$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt
Custom pos convert dict
you can overwrite convert config with setting json file.
{
"固有名詞": {
"sudachi_pos": "名詞,固有名詞,一般,*,*,*",
"left_id": 4786,
"right_id": 4786,
"cost": 5000
},
"名詞": {
"sudachi_pos": "名詞,普通名詞,一般,*,*,*",
"left_id": 5146,
"right_id": 5146,
"cost": 5000
}
}
$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt -c convert_config.json
if you want to ignore unsupported pos error & invalid format, use --ignore
flag.
Dictionary type
You can specify the dictionary with the tokenize option -s (default: core).
$ pip install sudachidict_full
$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt -s full
Auto Splitting
kuro2sudachi supports suto splitting.
{
"名詞": {
"sudachi_pos": "名詞,普通名詞,一般,*,*,*",
"left_id": 5146,
"right_id": 5146,
"cost": 5000,
"split_mode": "C",
"unit_div_mode": [
"A", "B"
]
}
}
output includes unit devision info.
$ cat kuromoji_dict.txt
融合たんぱく質,融合たんぱく質,融合たんぱく質,名詞
発作性心房細動,発作性心房細動,発作性心房細動,名詞
$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt -c convert_config.json --ignore
$ cat sudachi_user_dict.txt
融合たんぱく質,4786,4786,5000,融合たんぱく質,名詞,普通名詞,一般,*,*,*,,融合たんぱく質,*,C,"融合,名詞,普通名詞,サ変可能,*,*,*,ユウゴウ/たんぱく,名詞,普通名詞,一般,*,*,*,タンパク/質,接尾辞,名詞的,一般,*,*,*,シツ","融合,名詞,普通名詞,サ変可能,*,*,*,ユウゴウ/たんぱく質,名詞,普通名詞,一般,*,*,*,タンパクシツ",*
発作性心房細動,4786,4786,5000,発作性心房細動,名詞,普通名詞,一般,*,*,*,,発作性心房細動,*,C,"発作,名詞,普通名詞,一般,*,*,*,ホッサ/性,接尾辞,名詞的,一般,*,*,*,セイ/心房,名詞,普通名詞,一般,*,*,*,シンボウ/細動,名詞,普通名詞,一般,*,*,*,サイドウ","発作,名詞,普通名詞,一般,*,*,*,ホッサ/性,接尾辞,名詞的,一般,*,*,*,セイ/心房,名詞,普通名詞,一般,*,*,*,シンボウ/細動,名詞,普通名詞,一般,*,*,*,サイドウ",*
Splitting Words defined by kuromoji
Currently, the CLI does not support word splitting defined by kuromoji. Therefore, the split representation of kuromoji is ignored.
中咽頭ガン,中咽頭 ガン,チュウイントウ ガン,カスタム名詞
↓
中咽頭ガン,4786,4786,7000,中咽頭ガン,名詞,固有名詞,一般,*,*,*,チュウイントウガン,中咽頭ガン,*,*,*,*,*
For Developer
test kuro2sudachi
$ poetry install
$ poetry run pytest
exec kuro2sudachi command
$ poetry run kuro2sudachi tests/kuromoji_dict_test.txt -o sudachi_user_dict.txt
TODO
- split mode
- default rewrite
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
kuro2sudachi-0.2.9.tar.gz
(8.3 kB
view hashes)
Built Distribution
Close
Hashes for kuro2sudachi-0.2.9-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 76bcf29bc4a47517b68f12ebbdf08e6e2b581bd4d919f23698f1b7cd549fad9f |
|
MD5 | f33c7b2e4d5024d8b2ffdffe4fa40b46 |
|
BLAKE2b-256 | e077834963dce84f3307a4531ad203bbfdaecf45438866a5e9b451bbf93e8f7b |