Skip to main content

Python version of Sudachi, the Japanese Morphological Analyzer

Project description

SudachiPy

PyPi version Documentation

SudachiPy is a Python version of Sudachi, a Japanese morphological analyzer.

This is not a pure Python implementation, but bindings for the Sudachi.rs.

Binary wheels

We provide binary builds for macOS (10.14+), Windows and Linux only for x86_64 architecture. x86 32-bit architecture is not supported and is not tested. MacOS source builds seem to work on ARM-based (Aarch64) Macs, but this architecture also is not tested and require installing Rust toolchain and Cargo.

More information here.

TL;DR

$ pip install sudachipy sudachidict_core

$ echo "高輪ゲートウェイ駅" | sudachipy
高輪ゲートウェイ駅	名詞,固有名詞,一般,*,*,*	高輪ゲートウェイ駅
EOS

$ echo "高輪ゲートウェイ駅" | sudachipy -m A
高輪	名詞,固有名詞,地名,一般,*,*	高輪
ゲートウェイ	名詞,普通名詞,一般,*,*,*	ゲートウェー
駅	名詞,普通名詞,一般,*,*,*	駅
EOS

$ echo "空缶空罐空きカン" | sudachipy -a
空缶	名詞,普通名詞,一般,*,*,*	空き缶	空缶	アキカン	0
空罐	名詞,普通名詞,一般,*,*,*	空き缶	空罐	アキカン	0
空きカン	名詞,普通名詞,一般,*,*,*	空き缶	空きカン	アキカン	0
EOS
from sudachipy import Dictionary, SplitMode

tokenizer = Dictionary().create()

morphemes = tokenizer.tokenize("国会議事堂前駅")
print(morphemes[0].surface())  # '国会議事堂前駅'
print(morphemes[0].reading_form())  # 'コッカイギジドウマエエキ'
print(morphemes[0].part_of_speech())  # ['名詞', '固有名詞', '一般', '*', '*', '*']

morphemes = tokenizer.tokenize("国会議事堂前駅", SplitMode.A)
print([m.surface() for m in morphemes])  # ['国会', '議事', '堂', '前', '駅']

Setup

You need SudachiPy and a dictionary.

Step 1. Install SudachiPy

$ pip install sudachipy

Step 2. Get a Dictionary

You can get dictionary as a Python package. It make take a while to download the dictionary file (around 70MB for the core edition).

$ pip install sudachidict_core

Alternatively, you can choose other dictionary editions. See this section for the detail.

Usage: As a command

There is a CLI command sudachipy.

$ echo "外国人参政権" | sudachipy
外国人参政権	名詞,普通名詞,一般,*,*,*	外国人参政権
EOS
$ echo "外国人参政権" | sudachipy -m A
外国	名詞,普通名詞,一般,*,*,*	外国
人	接尾辞,名詞的,一般,*,*,*	人
参政	名詞,普通名詞,一般,*,*,*	参政
権	接尾辞,名詞的,一般,*,*,*	権
EOS
$ sudachipy tokenize -h
usage: sudachipy tokenize [-h] [-r file] [-m {A,B,C}] [-o file] [-s string]
                          [-a] [-d] [-v]
                          [file [file ...]]

Tokenize Text

positional arguments:
  file           text written in utf-8

optional arguments:
  -h, --help     show this help message and exit
  -r file        the setting file in JSON format
  -m {A,B,C}     the mode of splitting
  -o file        the output file
  -s string      sudachidict type
  -a             print all of the fields
  -d             print the debug information
  -v, --version  print sudachipy version

Note: The Debug option (-d) is disabled in version 0.6.0.

Output

Columns are tab separated.

  • Surface
  • Part-of-Speech Tags (comma separated)
  • Normalized Form

When you add the -a option, it additionally outputs

  • Dictionary Form
  • Reading Form
  • Dictionary ID
    • 0 for the system dictionary
    • 1 and above for the user dictionaries
    • -1 if a word is Out-of-Vocabulary (not in the dictionary)
  • Synonym group IDs
  • (OOV) if a word is Out-of-Vocabulary (not in the dictionary)
$ echo "外国人参政権" | sudachipy -a
外国人参政権	名詞,普通名詞,一般,*,*,*	外国人参政権	外国人参政権	ガイコクジンサンセイケン	0	[]
EOS
echo "阿quei" | sudachipy -a
阿	名詞,普通名詞,一般,*,*,*				-1	[]	(OOV)
quei	名詞,普通名詞,一般,*,*,*	quei	quei		-1	[]	(OOV)
EOS

Usage: As a Python package

API

See API reference page.

Example

from sudachipy import Dictionary, SplitMode

tokenizer_obj = Dictionary().create()
# Multi-granular Tokenization

# SplitMode.C is the default mode
[m.surface() for m in tokenizer_obj.tokenize("国家公務員", SplitMode.C)]
# => ['国家公務員']

[m.surface() for m in tokenizer_obj.tokenize("国家公務員", SplitMode.B)]
# => ['国家', '公務員']

[m.surface() for m in tokenizer_obj.tokenize("国家公務員", SplitMode.A)]
# => ['国家', '公務', '員']
# Morpheme information

m = tokenizer_obj.tokenize("食べ")[0]

m.surface() # => '食べ'
m.dictionary_form() # => '食べる'
m.reading_form() # => 'タベ'
m.part_of_speech() # => ['動詞', '一般', '*', '*', '下一段-バ行', '連用形-一般']
# Normalization

tokenizer_obj.tokenize("附属", mode)[0].normalized_form()
# => '付属'
tokenizer_obj.tokenize("SUMMER", mode)[0].normalized_form()
# => 'サマー'
tokenizer_obj.tokenize("シュミレーション", mode)[0].normalized_form()
# => 'シミュレーション'

(With 20210802 core dictionary. The results may change when you use other versions)

Dictionary Edition

There are three editions of Sudachi Dictionary, namely, small, core, and full. See WorksApplications/SudachiDict for the detail.

SudachiPy uses sudachidict_core by default.

Dictionaries are installed as Python packages sudachidict_small, sudachidict_core, and sudachidict_full.

The dictionary files are not in the package itself, but it is downloaded upon installation.

Dictionary option: command line

You can specify the dictionary with the tokenize option -s.

$ pip install sudachidict_small
$ echo "外国人参政権" | sudachipy -s small
$ pip install sudachidict_full
$ echo "外国人参政権" | sudachipy -s full

Dictionary option: Python package

You can specify the dictionary with the Dicionary() argument; config_path or dict_type.

class Dictionary(config_path=None, resource_dir=None, dict_type=None)
  1. config_path
    • You can specify the file path to the setting file with config_path (See [Dictionary in The Setting File](#Dictionary in The Setting File) for the detail).
    • If the dictionary file is specified in the setting file as systemDict, SudachiPy will use the dictionary.
  2. dict_type
    • You can also specify the dictionary type with dict_type.
    • The available arguments are small, core, or full.
    • If different dictionaries are specified with config_path and dict_type, a dictionary defined dict_type overrides those defined in the config path.
from sudachipy import Dictionary

# default: sudachidict_core
tokenizer_obj = Dictionary().create()

# The dictionary given by the `systemDict` key in the config file (/path/to/sudachi.json) will be used
tokenizer_obj = Dictionary(config_path="/path/to/sudachi.json").create()

# The dictionary specified by `dict_type` will be set.
tokenizer_obj = Dictionary(dict_type="core").create()  # sudachidict_core (same as default)
tokenizer_obj = Dictionary(dict_type="small").create()  # sudachidict_small
tokenizer_obj = Dictionary(dict_type="full").create()  # sudachidict_full

# The dictionary specified by `dict_type` overrides those defined in the config path.
# In the following code, `sudachidict_full` will be used regardless of a dictionary defined in the config file.
tokenizer_obj = Dictionary(config_path="/path/to/sudachi.json", dict_type="full").create()

Dictionary in The Setting File

Alternatively, if the dictionary file is specified in the setting file, sudachi.json, SudachiPy will use that file.

{
    "systemDict" : "relative/path/from/resourceDir/to/system.dic",
    ...
}

The default setting file is sudachi.json. You can specify your sudachi.json with the -r option.

$ sudachipy -r path/to/sudachi.json

User Dictionary

To use a user dictionary, user.dic, place sudachi.json to anywhere you like, and add userDict value with the relative path from sudachi.json to your user.dic.

{
    "userDict" : ["relative/path/to/user.dic"],
    ...
}

Then specify your sudachi.json with the -r option.

$ sudachipy -r path/to/sudachi.json

You can build a user dictionary with the subcommand ubuild.

$ sudachipy ubuild -h
usage: sudachipy ubuild [-h] [-d string] [-o file] [-s file] file [file ...]

Build User Dictionary

positional arguments:
  file        source files with CSV format (one or more)

optional arguments:
  -h, --help  show this help message and exit
  -d string   description comment to be embedded on dictionary
  -o file     output file (default: user.dic)
  -s file     system dictionary path (default: system core dictionary path)

About the dictionary file format, please refer to this document (written in Japanese, English version is not available yet).

Customized System Dictionary

$ sudachipy build -h
usage: sudachipy build [-h] [-o file] [-d string] -m file file [file ...]

Build Sudachi Dictionary

positional arguments:
  file        source files with CSV format (one of more)

optional arguments:
  -h, --help  show this help message and exit
  -o file     output file (default: system.dic)
  -d string   description comment to be embedded on dictionary

required named arguments:
  -m file     connection matrix file with MeCab's matrix.def format

To use your customized system.dic, place sudachi.json to anywhere you like, and overwrite systemDict value with the relative path from sudachi.json to your system.dic.

{
    "systemDict" : "relative/path/to/system.dic",
    ...
}

Then specify your sudachi.json with the -r option.

$ sudachipy -r path/to/sudachi.json

For Developers

Build from source

Install sdist via pip

  1. Install python module setuptools and setuptools-rust.
  2. Run ./build-sdist.sh in python dir.
    • source distribution will be generated under python/dist/ dir.
  3. Install it via pip: pip install ./python/dist/SudachiPy-[version].tar.gz

Install develop build

  1. Install python module setuptools and setuptools-rust.
  2. Run python3 setup.py develop.
    • develop will create a debug build, while install will create a release build.
  3. Now you can import the module by import sudachipy.

ref: setuptools-rust

Test

Run build_and_test.sh to run the tests.

Contact

Sudachi and SudachiPy are developed by WAP Tokushima Laboratory of AI and NLP.

Open an issue, or come to our Slack workspace for questions and discussion.

https://sudachi-dev.slack.com/ (Get invitation here)

Enjoy tokenization!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SudachiPy-0.6.4.tar.gz (150.8 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

SudachiPy-0.6.4-cp310-cp310-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.10Windows x86-64

SudachiPy-0.6.4-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.12+ x86-64manylinux: glibc 2.5+ x86-64

SudachiPy-0.6.4-cp310-cp310-macosx_10_15_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.10macOS 10.15+ x86-64

SudachiPy-0.6.4-cp39-cp39-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.9Windows x86-64

SudachiPy-0.6.4-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.12+ x86-64manylinux: glibc 2.5+ x86-64

SudachiPy-0.6.4-cp39-cp39-macosx_10_15_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.9macOS 10.15+ x86-64

SudachiPy-0.6.4-cp38-cp38-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.8Windows x86-64

SudachiPy-0.6.4-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.12+ x86-64manylinux: glibc 2.5+ x86-64

SudachiPy-0.6.4-cp38-cp38-macosx_10_15_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.8macOS 10.15+ x86-64

SudachiPy-0.6.4-cp37-cp37m-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.7mWindows x86-64

SudachiPy-0.6.4-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.7mmanylinux: glibc 2.12+ x86-64manylinux: glibc 2.5+ x86-64

SudachiPy-0.6.4-cp37-cp37m-macosx_10_15_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.7mmacOS 10.15+ x86-64

File details

Details for the file SudachiPy-0.6.4.tar.gz.

File metadata

  • Download URL: SudachiPy-0.6.4.tar.gz
  • Upload date:
  • Size: 150.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for SudachiPy-0.6.4.tar.gz
Algorithm Hash digest
SHA256 3456ec50d9c64ecfd82a1248ec8db10405277b39a8810601ede208fad689de8d
MD5 2c97a5bafce116d17a56ee629479134e
BLAKE2b-256 9134c996d316e11089fff1a04bdbfb4aabf5b9340e2c2209d7528e10a880431a

See more details on using hashes here.

File details

Details for the file SudachiPy-0.6.4-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: SudachiPy-0.6.4-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for SudachiPy-0.6.4-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 af76c0f6ce7fbd5473b19e8f5b1b29aa6e0366c4f0b3b1ae2e4c7ce6a396737b
MD5 59c044b1be39f5f3811fee8d10828cf2
BLAKE2b-256 d705dd78fe56f6b292eb4130d87e482a66fcab67a2f0c59eb1b1834a67dcb23d

See more details on using hashes here.

File details

Details for the file SudachiPy-0.6.4-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for SudachiPy-0.6.4-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 088a53ceeacb70a97f040bfc53948fce04df4b2b24c2007db72eef9261b2ef33
MD5 809eb735a149df8f12558f6febbe46d7
BLAKE2b-256 572eebcad90f2934556846139a4138a695eefce963ba1ae2aecb8d4eadd6c566

See more details on using hashes here.

File details

Details for the file SudachiPy-0.6.4-cp310-cp310-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for SudachiPy-0.6.4-cp310-cp310-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 05bacf522e5765fffa28973e39b564761b64716fc7bd47ebb8208a6952d6b0d4
MD5 c312a7e8a4c77067db6379f26dc37831
BLAKE2b-256 d1ef29aa3b41f748c0a81817bd6672238617728f4a9894487b4b36d78e418b11

See more details on using hashes here.

File details

Details for the file SudachiPy-0.6.4-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: SudachiPy-0.6.4-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for SudachiPy-0.6.4-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 b2b82c909c16c58940e6acf3ce95897b251bc2ef553b5de9feccd3025d1c07d9
MD5 10e27cec78d647ebfa59c6174ec5d249
BLAKE2b-256 84dd6ea5d5831de91f18575b9977016744848f26fa0cb357fe9b3317105168b9

See more details on using hashes here.

File details

Details for the file SudachiPy-0.6.4-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for SudachiPy-0.6.4-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 9041550857b3f589877db29ed829a8debd91500f1dd92f38d79008080c2eba44
MD5 17454a7d5bc1586c82c88de9440f3b04
BLAKE2b-256 de033a9649c8df2c2843df868060d38a242b46b01baae28108f1976ef0de64a0

See more details on using hashes here.

File details

Details for the file SudachiPy-0.6.4-cp39-cp39-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for SudachiPy-0.6.4-cp39-cp39-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 bad647950db477216d5979f1a7056535319c211df2b121bb681446b02d6945b1
MD5 3b4f9605ba3261c274c48ef8ec90ba92
BLAKE2b-256 0cf055f1b159e1036f7e5576522f536a9137017d12683f22fffd3d6d2c11f4ea

See more details on using hashes here.

File details

Details for the file SudachiPy-0.6.4-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: SudachiPy-0.6.4-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for SudachiPy-0.6.4-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 b8dfb0763ad1455c2e9bdeb37489c63ea868579761828058b26adb2d053a7962
MD5 fb59aa5ab08f2259111bdcf06e1bdadf
BLAKE2b-256 ff5dc756bf984c2a67f51ff05679eb2dea721ab8bf23679a41af36ed21708d7d

See more details on using hashes here.

File details

Details for the file SudachiPy-0.6.4-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for SudachiPy-0.6.4-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 966e5995963cd9574003b9861f97e0eba01c62273bc4c020243cd79a2d7209ef
MD5 dbcd927457b8ca918ccbe9d91e5e7dc4
BLAKE2b-256 9bd7b2dc6a2a665fc17dbaf4e97835c3da0f7dc2e591069c227a6242e29c319d

See more details on using hashes here.

File details

Details for the file SudachiPy-0.6.4-cp38-cp38-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for SudachiPy-0.6.4-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 834af19f02b4e248f681d269cc54b09b6cf1d5e6bc3d6878b513c1ace8215da1
MD5 bd95ef9ed639a9f962e3c10840f1fc8f
BLAKE2b-256 f2d9c0451412df91422515557237af7f031bc7d5ef8a783f2e05260f15690220

See more details on using hashes here.

File details

Details for the file SudachiPy-0.6.4-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: SudachiPy-0.6.4-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for SudachiPy-0.6.4-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 d98955275c1e85f0061f3448a35e0961e42f480fc7c0b27eaef790d19169b16d
MD5 20e81b8a66e1f492ca796792f30ab515
BLAKE2b-256 d0bd1ccc2327176cb64331436d81632c146ede2df55ee916c09d3651854d3079

See more details on using hashes here.

File details

Details for the file SudachiPy-0.6.4-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for SudachiPy-0.6.4-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 cdcbd540ff0c7edd060cd2aec6e2f69b33473de1680c35b85031ba38a5b40a6b
MD5 cb6ca8f6803c79daea1b19a3b79b1f55
BLAKE2b-256 75dc4e00c8995eaeef8a8282c12bc3046ecdae3afe50b2f02e17efe90c4ed19c

See more details on using hashes here.

File details

Details for the file SudachiPy-0.6.4-cp37-cp37m-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for SudachiPy-0.6.4-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 5817d955c743da16e8c2700712030f2bbb61943a407c4820909c3b009e8652f0
MD5 8b055146f4afa78e5f239bf69c0c493f
BLAKE2b-256 46f73c09dec077e347ab746cf69976793dc0209d7e316d0072d94dc4a5a55a6f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page