Python version of Sudachi, the Japanese Morphological Analyzer
Project description
SudachiPy
SudachiPy is a Python version of Sudachi, a Japanese morphological analyzer.
Sudachi & SudachiPy are developed in WAP Tokushima Laboratory of AI and NLP, an institute under Works Applications that focuses on Natural Language Processing (NLP).
Warning: SudachiPy is still under development, and some of the functions are still not complete. Please use it at your own risk.
Breaking changes
v0.3.0
resources/
directory was moved tosudachipy/
.
V0.2.2
- Distribute SudachiPy package via PyPI
pip install SudachiPy
v0.2.0
- User dictionary feature added
Easy Setup
SudachiPy requires Python3.5+.
Step 1: Install SudachiPy
SudachiPy is distributed from PyPI. You can install SudachiPy by executing pip install SudachiPy
from the command line.
$ pip install SudachiPy
SudachiPy(>=v0.3.0) refers to system.dic of SudachiDict_core (not included in SudachiPy) package by default. Please proceed to Step 2 to install the dict package.
Step 2: Install SudachiDict_core
The default dict package SudachiDict_core
is distributed from our download site.
Run pip install
like below:
$ pip install https://object-storage.tyo2.conoha.io/v1/nc_2520839e1f9641b08211a5c85243124a/sudachi/SudachiDict_core-20190531.tar.gz
Usage
As a command
After installing SudachiPy, you may also use it in the terminal via command sudachipy
.
You can excute sudachipy
with standard input by this way:
$ sudachipy
sudachipy
has 4 subcommands (in default tokenize
)
$ sudachipy tokenize -h
usage: sudachipy tokenize [-h] [-r file] [-m {A,B,C}] [-o file] [-a] [-d] [-v]
[file [file ...]]
Tokenize Text
positional arguments:
file text written in utf-8
optional arguments:
-h, --help show this help message and exit
-r file the setting file in JSON format
-m {A,B,C} the mode of splitting
-o file the output file
-a print all of the fields
-d print the debug information
-v, --version print sudachipy version
$ sudachipy link -h
usage: sudachipy link [-h] [-t {small,core,full}] [-u]
Link Default Dict Package
optional arguments:
-h, --help show this help message and exit
-t {small,core,full} dict dict
-u unlink sudachidict
$ sudachipy build -h
usage: sudachipy build [-h] [-o file] [-d string] -m file file [file ...]
Build Sudachi Dictionary
positional arguments:
file source files with CSV format (one of more)
optional arguments:
-h, --help show this help message and exit
-o file output file (default: system.dic)
-d string description comment to be embedded on dictionary
required named arguments:
-m file connection matrix file with MeCab's matrix.def format
$ sudachipy ubuild -h
usage: sudachipy ubuild [-h] [-d string] [-o file] [-s file] file [file ...]
Build User Dictionary
positional arguments:
file source files with CSV format (one or more)
optional arguments:
-h, --help show this help message and exit
-d string description comment to be embedded on dictionary
-o file output file (default: user.dic)
-s file system dictionary (default: ${SUDACHIPY}/resouces/system.dic)
As a Python package
Here is an example usage;
from sudachipy import tokenizer
from sudachipy import dictionary
tokenizer_obj = dictionary.Dictionary().create()
# Multi-granular tokenization
# (following results are w/ `system_full.dic`
# you may not be able to replicate this particular example w/ `system_core.dic`)
mode = tokenizer.Tokenizer.SplitMode.C
[m.surface() for m in tokenizer_obj.tokenize("医薬品安全管理責任者", mode)]
# => ['医薬品安全管理責任者']
mode = tokenizer.Tokenizer.SplitMode.B
[m.surface() for m in tokenizer_obj.tokenize("医薬品安全管理責任者", mode)]
# => ['医薬品', '安全', '管理', '責任者']
mode = tokenizer.Tokenizer.SplitMode.A
[m.surface() for m in tokenizer_obj.tokenize("医薬品安全管理責任者", mode)]
# => ['医薬', '品', '安全', '管理', '責任', '者']
# Morpheme information
m = tokenizer_obj.tokenize("食べ", mode)[0]
m.surface() # => '食べ'
m.dictionary_form() # => '食べる'
m.reading_form() # => 'タベ'
m.part_of_speech() # => ['動詞', '一般', '*', '*', '下一段-バ行', '連用形-一般']
# Normalization
tokenizer_obj.tokenize("附属", mode)[0].normalized_form()
# => '付属'
tokenizer_obj.tokenize("SUMMER", mode)[0].normalized_form()
# => 'サマー'
tokenizer_obj.tokenize("シュミレーション", mode)[0].normalized_form()
# => 'シミュレーション'
Install dict packages
You can download and install the built dictionaries from Python packages · WorksApplications/SudachiDict.
$ pip install SudachiDict_full-20190531.tar.gz
You can change the default dict package by executing link command.
$ sudachipy link -t full
You can remove default dict setting.
$ sudachipy link -u
Customized dictionary
If you need to apply customized system.dic
,
place sudachi.json to anywhere you like,
and overwrite systemDict
value with the relative path from sudachi.json
to your system.dic
.
{
"systemDict" : "relative/path/to/system.dic",
...
}
Then you can specify sudachi.json
with -r
option.
$ sudachipy -r path/to/sudachi.json
In the end, we would like to make a flow to get these resources via the code, like NLTK (e.g., import nltk; nltk.download()
) or spaCy (e.g., $python -m spacy download en
).
For developer
Code format
You can use ./scripts/format.sh
and check if your code is in rule. flake8
flake8-import-order
flake8-buitins
is required. See requirements.txt
Test
You can use ./script/test.sh
and check if not your change cause regression.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for SudachiPy-0.3.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 608cfcb87c163852c4576970d199de63d6cf92108e710ba3cef9ead72b198928 |
|
MD5 | 5bb353599d2901dadfedfc71c97329b4 |
|
BLAKE2b-256 | d305dcf7d90ed7c7eeb016361fecbb21ce6d88bfd1a0e36eb01cb3431f3410de |