Skip to main content

Python version of Sudachi, the Japanese Morphological Analyzer

Project description

SudachiPy

PyPi version Build Status

SudachiPy is a Python version of Sudachi, a Japanese morphological analyzer.

Sudachi & SudachiPy are developed in WAP Tokushima Laboratory of AI and NLP, an institute under Works Applications that focuses on Natural Language Processing (NLP).

Warning: some functions are still incompatible with Java Sudachi.

Easy Setup

Step 1: Install SudachiPy

SudachiPy is distributed from PyPI. You can install SudachiPy by executing pip install SudachiPy from the command line.

$ pip install SudachiPy

SudachiPy(>=v0.3.0) refers to system.dic of SudachiDict_core (not included in SudachiPy) package by default. Please proceed to Step 2 to install the dict package.

Step 2: Install SudachiDict_core

The default dict package SudachiDict_core is distributed from our download site. Run pip install like below:

$ pip install https://object-storage.tyo2.conoha.io/v1/nc_2520839e1f9641b08211a5c85243124a/sudachi/SudachiDict_core-20200330.tar.gz

Usage

As a command

After installing SudachiPy, you may also use it in the terminal via command sudachipy.

You can excute sudachipy with standard input by this way:

$ sudachipy

sudachipy has 4 subcommands (default: tokenize)

$ sudachipy tokenize -h
usage: sudachipy tokenize [-h] [-r file] [-m {A,B,C}] [-o file] [-a] [-d] [-v]
                          [file [file ...]]

Tokenize Text

positional arguments:
  file           text written in utf-8

optional arguments:
  -h, --help     show this help message and exit
  -r file        the setting file in JSON format
  -m {A,B,C}     the mode of splitting
  -o file        the output file
  -a             print all of the fields
  -d             print the debug information
  -v, --version  print sudachipy version
$ sudachipy link -h
usage: sudachipy link [-h] [-t {small,core,full}] [-u]

Link Default Dict Package

optional arguments:
  -h, --help            show this help message and exit
  -t {small,core,full}  dict dict
  -u                    unlink sudachidict
$ sudachipy build -h
usage: sudachipy build [-h] [-o file] [-d string] -m file file [file ...]

Build Sudachi Dictionary

positional arguments:
  file        source files with CSV format (one of more)

optional arguments:
  -h, --help  show this help message and exit
  -o file     output file (default: system.dic)
  -d string   description comment to be embedded on dictionary

required named arguments:
  -m file     connection matrix file with MeCab's matrix.def format

WARNING: v0.3.* ubuild contains bug.

$ sudachipy ubuild -h
usage: sudachipy ubuild [-h] [-d string] [-o file] [-s file] file [file ...]

Build User Dictionary

positional arguments:
  file        source files with CSV format (one or more)

optional arguments:
  -h, --help  show this help message and exit
  -d string   description comment to be embedded on dictionary
  -o file     output file (default: user.dic)
  -s file     system dictionary (default: linked system_dic, see link -h)

As a Python package

Here is an example usage;

from sudachipy import tokenizer
from sudachipy import dictionary


tokenizer_obj = dictionary.Dictionary().create()


# Multi-granular tokenization
# using `system_core.dic` or `system_full.dic` version 20190781
# you may not be able to replicate this particular example due to dictionary you use

mode = tokenizer.Tokenizer.SplitMode.C
[m.surface() for m in tokenizer_obj.tokenize("国家公務員", mode)]
# => ['国家公務員']

mode = tokenizer.Tokenizer.SplitMode.B
[m.surface() for m in tokenizer_obj.tokenize("国家公務員", mode)]
# => ['国家', '公務員']

mode = tokenizer.Tokenizer.SplitMode.A
[m.surface() for m in tokenizer_obj.tokenize("国家公務員", mode)]
# => ['国家', '公務', '員']


# Morpheme information

m = tokenizer_obj.tokenize("食べ", mode)[0]

m.surface() # => '食べ'
m.dictionary_form() # => '食べる'
m.reading_form() # => 'タベ'
m.part_of_speech() # => ['動詞', '一般', '*', '*', '下一段-バ行', '連用形-一般']


# Normalization

tokenizer_obj.tokenize("附属", mode)[0].normalized_form()
# => '付属'
tokenizer_obj.tokenize("SUMMER", mode)[0].normalized_form()
# => 'サマー'
tokenizer_obj.tokenize("シュミレーション", mode)[0].normalized_form()
# => 'シミュレーション'

Install dict packages

You can download and install the built dictionaries from Python packages · WorksApplications/SudachiDict.

$ pip install SudachiDict_full-20190718.tar.gz

You can change the default dict package by executing link command.

$ sudachipy link -t full

You can remove default dict setting.

$ sudachipy link -u

Customized dictionary

If you need to apply customized system.dic, place sudachi.json to anywhere you like, and overwrite systemDict value with the relative path from sudachi.json to your system.dic.

{
    "systemDict" : "relative/path/to/system.dic",
    ...
}

Then you can specify sudachi.json with -r option.

$ sudachipy -r path/to/sudachi.json

In the end, we would like to make a flow to get these resources via the code, like NLTK (e.g., import nltk; nltk.download()) or spaCy (e.g., $python -m spacy download en).

User defined Dictionary

If you need to apply customized user dictionary, user.dic, place sudachi.json to anywhere you like, and add userDict value with the relative path from sudachi.json to your user.dic.

{
    "userDict" : ["relative/path/to/user.dic"],
    ...
}

Also, you can build user dictionary with sub-command ubuild.

About file format, see here (written in Japanese, English document is unavailable now)

For developer

Code format

You can use ./scripts/format.sh and check if your code is in rule. flake8 flake8-import-order flake8-buitins is required. See requirements.txt

Test

You can use ./scripts/test.sh and check if your changes do not cause regression.

Contact

We have a Slack workspace for developers and users to ask questions and discuss a variety of topics.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SudachiPy-0.4.4.tar.gz (65.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

SudachiPy-0.4.4-py3-none-any.whl (73.7 kB view details)

Uploaded Python 3

File details

Details for the file SudachiPy-0.4.4.tar.gz.

File metadata

  • Download URL: SudachiPy-0.4.4.tar.gz
  • Upload date:
  • Size: 65.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.6.7

File hashes

Hashes for SudachiPy-0.4.4.tar.gz
Algorithm Hash digest
SHA256 1c2ef0754206674ff51426858d095c498fe7da9b647f9c927ff58b1bec0a5c5f
MD5 f70996b1ab4923354e0b2ba694fdd993
BLAKE2b-256 4459e9c70ca8c0010f0d7662d9e8e64c977e0d0fec788ccd2e4c470bf3d5853c

See more details on using hashes here.

File details

Details for the file SudachiPy-0.4.4-py3-none-any.whl.

File metadata

  • Download URL: SudachiPy-0.4.4-py3-none-any.whl
  • Upload date:
  • Size: 73.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.6.7

File hashes

Hashes for SudachiPy-0.4.4-py3-none-any.whl
Algorithm Hash digest
SHA256 e82cd3c37db4f938108a4a9af650c211e647eb81cbbed2baca12c82613b88148
MD5 ba3200cd560feefdc6105b1b390d8138
BLAKE2b-256 9d326a61a3ca4051583e14b70dd7b5da5af36aa8baea22feb9eafb1f920193e0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page