Python version of Sudachi, the Japanese Morphological Analyzer
Project description
SudachiPy
SudachiPy is a Python version of Sudachi, a Japanese morphological analyzer.
TL;DR
$ pip install sudachipy sudachidict_core
$ echo "高輪ゲートウェイ駅" | sudachipy
高輪ゲートウェイ駅 名詞,固有名詞,一般,*,*,* 高輪ゲートウェイ駅
EOS
$ echo "高輪ゲートウェイ駅" | sudachipy -m A
高輪 名詞,固有名詞,地名,一般,*,* 高輪
ゲートウェイ 名詞,普通名詞,一般,*,*,* ゲートウェー
駅 名詞,普通名詞,一般,*,*,* 駅
EOS
$ echo "空缶空罐空きカン" | sudachipy -a
空缶 名詞,普通名詞,一般,*,*,* 空き缶 空缶 アキカン 0
空罐 名詞,普通名詞,一般,*,*,* 空き缶 空罐 アキカン 0
空きカン 名詞,普通名詞,一般,*,*,* 空き缶 空きカン アキカン 0
EOS
Setup
You need SudachiPy and a dictionary.
Step 1. Install SudachiPy
$ pip install sudachipy
Step 2. Get a Dictionary
You can get dictionary as a Python package. It make take a while to download the dictionary file (around 70MB for the core
edition).
$ pip install sudachidict_core
Alternatively, you can choose other dictionary editions. See this section for the detail.
Usage: As a command
There is a CLI command sudachipy
.
$ echo "外国人参政権" | sudachipy
外国人参政権 名詞,普通名詞,一般,*,*,* 外国人参政権
EOS
$ echo "外国人参政権" | sudachipy -m A
外国 名詞,普通名詞,一般,*,*,* 外国
人 接尾辞,名詞的,一般,*,*,* 人
参政 名詞,普通名詞,一般,*,*,* 参政
権 接尾辞,名詞的,一般,*,*,* 権
EOS
$ sudachipy tokenize -h
usage: sudachipy tokenize [-h] [-r file] [-m {A,B,C}] [-o file] [-a] [-d] [-v]
[file [file ...]]
Tokenize Text
positional arguments:
file text written in utf-8
optional arguments:
-h, --help show this help message and exit
-r file the setting file in JSON format
-m {A,B,C} the mode of splitting
-o file the output file
-a print all of the fields
-d print the debug information
-v, --version print sudachipy version
Output
Columns are tab separated.
- Surface
- Part-of-Speech Tags (comma separated)
- Normalized Form
When you add the -a
option, it additionally outputs
- Dictionary Form
- Reading Form
- Dictionary ID
0
for the system dictionary1
and above for the user dictionaries-1\t(OOV)
if a word is Out-of-Vocabulary (not in the dictionary)
$ echo "外国人参政権" | sudachipy -a
外国人参政権 名詞,普通名詞,一般,*,*,* 外国人参政権 外国人参政権 ガイコクジンサンセイケン 0
EOS
echo "阿quei" | sudachipy -a
阿 名詞,普通名詞,一般,*,*,* 阿 阿 -1 (OOV)
quei 名詞,普通名詞,一般,*,*,* quei quei -1 (OOV)
EOS
Usage: As a Python package
Here is an example;
from sudachipy import tokenizer
from sudachipy import dictionary
tokenizer_obj = dictionary.Dictionary().create()
# Multi-granular Tokenization
mode = tokenizer.Tokenizer.SplitMode.C
[m.surface() for m in tokenizer_obj.tokenize("国家公務員", mode)]
# => ['国家公務員']
mode = tokenizer.Tokenizer.SplitMode.B
[m.surface() for m in tokenizer_obj.tokenize("国家公務員", mode)]
# => ['国家', '公務員']
mode = tokenizer.Tokenizer.SplitMode.A
[m.surface() for m in tokenizer_obj.tokenize("国家公務員", mode)]
# => ['国家', '公務', '員']
# Morpheme information
m = tokenizer_obj.tokenize("食べ", mode)[0]
m.surface() # => '食べ'
m.dictionary_form() # => '食べる'
m.reading_form() # => 'タベ'
m.part_of_speech() # => ['動詞', '一般', '*', '*', '下一段-バ行', '連用形-一般']
# Normalization
tokenizer_obj.tokenize("附属", mode)[0].normalized_form()
# => '付属'
tokenizer_obj.tokenize("SUMMER", mode)[0].normalized_form()
# => 'サマー'
tokenizer_obj.tokenize("シュミレーション", mode)[0].normalized_form()
# => 'シミュレーション'
(With 20200330
core
dictionary. The results may change when you use other versions)
Dictionary Edition
There are three editions of Sudachi Dictionary, namely, small
, core
, and full
. See WorksApplications/SudachiDict for the detail.
SudachiPy uses sudachidict_core
by default. You can specify the dictionary with the link -t
command.
$ pip install sudachidict_small
$ sudachipy link -t small
$ pip install sudachidict_full
$ sudachipy link -t full
You can remove the dictionary link with the link -u
commnad.
$ sudachipy link -u
Dictionaries are installed as Python packages sudachidict_small
, sudachidict_core
, and sudachidict_full
. SudachiPy tries to refer sudachidict
package to use a dictionary. The link
subcommand creates a symbolic link of sudachidict_*
as sudachidict
, to switch the packages.
The dictionary files are not in the package itself, but it is downloaded upon installation.
Dictionary in The Setting File
Alternatively, if the dictionary file is specified in the setting file, sudachi.json
, SudachiPy will use that file.
{
"systemDict" : "relative/path/to/system.dic",
...
}
The default setting file is sudachipy/resources/sudachi.json. You can specify your sudachi.json
with the -r
option.
$ sudachipy -r path/to/sudachi.json
User Dictionary
To use a user dictionary, user.dic
, place sudachi.json to anywhere you like, and add userDict
value with the relative path from sudachi.json
to your user.dic
.
{
"userDict" : ["relative/path/to/user.dic"],
...
}
Then specify your sudachi.json
with the -r
option.
$ sudachipy -r path/to/sudachi.json
You can build a user dictionary with the subcommand ubuild
.
WARNING: v0.3.* ubuild contains bug.
$ sudachipy ubuild -h
usage: sudachipy ubuild [-h] [-d string] [-o file] [-s file] file [file ...]
Build User Dictionary
positional arguments:
file source files with CSV format (one or more)
optional arguments:
-h, --help show this help message and exit
-d string description comment to be embedded on dictionary
-o file output file (default: user.dic)
-s file system dictionary (default: linked system_dic, see link -h)
About the dictionary file format, please refer to this document (written in Japanese, English version is not available yet).
Customized System Dictionary
$ sudachipy build -h
usage: sudachipy build [-h] [-o file] [-d string] -m file file [file ...]
Build Sudachi Dictionary
positional arguments:
file source files with CSV format (one of more)
optional arguments:
-h, --help show this help message and exit
-o file output file (default: system.dic)
-d string description comment to be embedded on dictionary
required named arguments:
-m file connection matrix file with MeCab's matrix.def format
To use your customized system.dic
, place sudachi.json to anywhere you like, and overwrite systemDict
value with the relative path from sudachi.json
to your system.dic
.
{
"systemDict" : "relative/path/to/system.dic",
...
}
Then specify your sudachi.json
with the -r
option.
$ sudachipy -r path/to/sudachi.json
For Developers
Code Format
Run scripts/format.sh
to check if your code is formatted correctly.
You need packages flake8
flake8-import-order
flake8-buitins
(See requirements.txt
).
Test
Run scripts/test.sh
to run the tests.
Contact
Sudachi and SudachiPy are developed by WAP Tokushima Laboratory of AI and NLP.
Open an issue, or come to our Slack workspace for questions and discussion.
https://sudachi-dev.slack.com/ (Get invitation here)
Enjoy tokenization!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for SudachiPy-0.4.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bb5fc2d9d2dddfeabc6901b9e4adb81e09037d18c3852357525d817bd774cf31 |
|
MD5 | 2974225fd8dbe475eb32bfa222d60545 |
|
BLAKE2b-256 | c1c2891aee144e98bbfe331f02558d5b3608731ca70926657cbe9e80737b57d5 |