No project description provided
Project description
kuro2sudachi
kuro2sudachi lets you to convert kuromoji user dict to sudachi user dict.
Usage
$ pip install kuro2sudachi
# prepase riwirte.def
# https://github.com/WorksApplications/Sudachi/blob/develop/src/main/resources/rewrite.def
$ ls
rewiite.def
$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt
Custom pos convert dict
you can overwrite convert config with setting json file.
{
"固有名詞": {
"sudachi_pos": "名詞,固有名詞,一般,*,*,*",
"left_id": 4786,
"right_id": 4786,
"cost": 5000
},
"名詞": {
"sudachi_pos": "名詞,普通名詞,一般,*,*,*",
"left_id": 5146,
"right_id": 5146,
"cost": 5000
}
}
$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt -c convert_config.json
if you want to ignore unsupported pos error & invalid format, use --ignore
flag.
Dictionary type
You can specify the dictionary with the tokenize option -s (default: core).
$ pip install sudachidict_full
$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt -s full
Splitting Words
Currently, the CLI does not support word splitting. Therefore, the split representation of kuromoji is ignored.
中咽頭ガン,中咽頭 ガン,チュウイントウ ガン,カスタム名詞
↓
中咽頭ガン,4786,4786,7000,中咽頭ガン,名詞,固有名詞,一般,*,*,*,チュウイントウガン,中咽頭ガン,*,*,*,*,*
Develop
test kuro2sudachi
$ poetry install
$ poetry run pytest
exec kuro2sudachi command
$ poetry run kuro2sudachi tests/kuromoji_dict_test.txt -o sudachi_user_dict.txt
TODO
- split mode
- change connection cost
- supports many pos
- supports custom dict converts pos
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
kuro2sudachi-0.2.5.tar.gz
(4.4 kB
view hashes)
Built Distribution
Close
Hashes for kuro2sudachi-0.2.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c3c5d5a1e2fbfe10bc13fe9d777a075945e1c0823ac208202fdf42f6a43c8bbf |
|
MD5 | 1107c6b85b80ddffb26c4024d299d2f9 |
|
BLAKE2b-256 | 9c4ba73bc129236a82a0f5695f82e54e1828c581a30180877204cb79473513db |