Python version of Sudachi, the Japanese Morphological Analyzer
Project description
SudachiPy
SudachiPy is a Python version of Sudachi, a Japanese morphological analyzer.
Sudachi & SudachiPy are developed in WAP Tokushima Laboratory of AI and NLP, an institute under Works Applications that focuses on Natural Language Processing (NLP).
Warning: SudachiPy is still under development, and some of the functions are still not complete. Please use it at your own risk.
Breaking changes
v0.3.0
resources/
directory was moved tosudachipy/
.
V0.2.2
- Distribute SudachiPy package via PyPI
pip install SudachiPy
v0.2.0
- User dictionary feature added
Easy Setup
SudachiPy requires Python3.5+.
Step 1: Install SudachiPy
SudachiPy is distributed from PyPI. You can install SudachiPy by executing pip install SudachiPy
from the command line.
$ pip install SudachiPy
SudachiPy(>=v0.3.0) refers to system.dic of SudachiDict_core (not included in SudachiPy) package by default. Please proceed to Step 2 to install the dict package.
Step 2: Install SudachiDict_core
The default dict package SudachiDict_core
is distributed from our download site.
Run pip install
like below:
$ pip install https://object-storage.tyo2.conoha.io/v1/nc_2520839e1f9641b08211a5c85243124a/sudachi/SudachiDict_core-20190718.tar.gz
Usage
As a command
After installing SudachiPy, you may also use it in the terminal via command sudachipy
.
You can excute sudachipy
with standard input by this way:
$ sudachipy
sudachipy
has 4 subcommands (in default tokenize
)
$ sudachipy tokenize -h
usage: sudachipy tokenize [-h] [-r file] [-m {A,B,C}] [-o file] [-a] [-d] [-v]
[file [file ...]]
Tokenize Text
positional arguments:
file text written in utf-8
optional arguments:
-h, --help show this help message and exit
-r file the setting file in JSON format
-m {A,B,C} the mode of splitting
-o file the output file
-a print all of the fields
-d print the debug information
-v, --version print sudachipy version
$ sudachipy link -h
usage: sudachipy link [-h] [-t {small,core,full}] [-u]
Link Default Dict Package
optional arguments:
-h, --help show this help message and exit
-t {small,core,full} dict dict
-u unlink sudachidict
$ sudachipy build -h
usage: sudachipy build [-h] [-o file] [-d string] -m file file [file ...]
Build Sudachi Dictionary
positional arguments:
file source files with CSV format (one of more)
optional arguments:
-h, --help show this help message and exit
-o file output file (default: system.dic)
-d string description comment to be embedded on dictionary
required named arguments:
-m file connection matrix file with MeCab's matrix.def format
$ sudachipy ubuild -h
usage: sudachipy ubuild [-h] [-d string] [-o file] [-s file] file [file ...]
Build User Dictionary
positional arguments:
file source files with CSV format (one or more)
optional arguments:
-h, --help show this help message and exit
-d string description comment to be embedded on dictionary
-o file output file (default: user.dic)
-s file system dictionary (default: ${SUDACHIPY}/resouces/system.dic)
As a Python package
Here is an example usage;
from sudachipy import tokenizer
from sudachipy import dictionary
tokenizer_obj = dictionary.Dictionary().create()
# Multi-granular tokenization
# (following results are w/ `system_full.dic`
# you may not be able to replicate this particular example w/ `system_core.dic`)
mode = tokenizer.Tokenizer.SplitMode.C
[m.surface() for m in tokenizer_obj.tokenize("医薬品安全管理責任者", mode)]
# => ['医薬品安全管理責任者']
mode = tokenizer.Tokenizer.SplitMode.B
[m.surface() for m in tokenizer_obj.tokenize("医薬品安全管理責任者", mode)]
# => ['医薬品', '安全', '管理', '責任者']
mode = tokenizer.Tokenizer.SplitMode.A
[m.surface() for m in tokenizer_obj.tokenize("医薬品安全管理責任者", mode)]
# => ['医薬', '品', '安全', '管理', '責任', '者']
# Morpheme information
m = tokenizer_obj.tokenize("食べ", mode)[0]
m.surface() # => '食べ'
m.dictionary_form() # => '食べる'
m.reading_form() # => 'タベ'
m.part_of_speech() # => ['動詞', '一般', '*', '*', '下一段-バ行', '連用形-一般']
# Normalization
tokenizer_obj.tokenize("附属", mode)[0].normalized_form()
# => '付属'
tokenizer_obj.tokenize("SUMMER", mode)[0].normalized_form()
# => 'サマー'
tokenizer_obj.tokenize("シュミレーション", mode)[0].normalized_form()
# => 'シミュレーション'
Install dict packages
You can download and install the built dictionaries from Python packages · WorksApplications/SudachiDict.
$ pip install SudachiDict_full-20190531.tar.gz
You can change the default dict package by executing link command.
$ sudachipy link -t full
You can remove default dict setting.
$ sudachipy link -u
Customized dictionary
If you need to apply customized system.dic
,
place sudachi.json to anywhere you like,
and overwrite systemDict
value with the relative path from sudachi.json
to your system.dic
.
{
"systemDict" : "relative/path/to/system.dic",
...
}
Then you can specify sudachi.json
with -r
option.
$ sudachipy -r path/to/sudachi.json
In the end, we would like to make a flow to get these resources via the code, like NLTK (e.g., import nltk; nltk.download()
) or spaCy (e.g., $python -m spacy download en
).
For developer
Code format
You can use ./scripts/format.sh
and check if your code is in rule. flake8
flake8-import-order
flake8-buitins
is required. See requirements.txt
Test
You can use ./script/test.sh
and check if not your change cause regression.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file SudachiPy-0.3.6.tar.gz
.
File metadata
- Download URL: SudachiPy-0.3.6.tar.gz
- Upload date:
- Size: 46.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
5220f85607b46fbea23f1df14d058d33160eba71ce11045b0cbbe818d026f9e7
|
|
MD5 |
e0a44aee08abf4781b39319d5961d48b
|
|
BLAKE2b-256 |
1ac8aa3be6de5d4d6ba4329cde1c629c54a477697bae8deb660d6e475cae908a
|
File details
Details for the file SudachiPy-0.3.6-py3-none-any.whl
.
File metadata
- Download URL: SudachiPy-0.3.6-py3-none-any.whl
- Upload date:
- Size: 62.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
115416ad8ce14c302d764c2e1777eec1cb14c2b3776ad2ec1c430ab2fa9ef114
|
|
MD5 |
368ce3e4efee6a5a90f0bee83ca66d86
|
|
BLAKE2b-256 |
f9a68eb0fdc3ebee66b9add4de41914793132d9cab455ebc5e43645e099d0c11
|