Python version of Sudachi, the Japanese Morphological Analyzer

Project description

SudachiPy

SudachiPy is a Python version of Sudachi, a Japanese morphological analyzer.

Sudachi & SudachiPy are developed in WAP Tokushima Laboratory of AI and NLP, an institute under Works Applications that focuses on Natural Language Processing (NLP).

Warning: SudachiPy is still under development, and some of the functions are still not complete. Please use it at your own risk.

Breaking changes

v0.3.0

resources/ directory was moved to sudachipy/.

V0.2.2

Distribute SudachiPy package via PyPI
- pip install SudachiPy

v0.2.0

User dictionary feature added

Easy Setup

SudachiPy requires Python3.5+.

Step 1: Install SudachiPy

SudachiPy is distributed from PyPI. You can install SudachiPy by executing pip install SudachiPy from the command line.

$ pip install SudachiPy

SudachiPy(>=v0.3.0) refers to system.dic of SudachiDict_core (not included in SudachiPy) package by default. Please proceed to Step 2 to install the dict package.

Step 2: Install SudachiDict_core

The default dict package SudachiDict_core is distributed from our download site. Run pip install like below:

$ pip install https://object-storage.tyo2.conoha.io/v1/nc_2520839e1f9641b08211a5c85243124a/sudachi/SudachiDict_core-20190718.tar.gz

Usage

As a command

After installing SudachiPy, you may also use it in the terminal via command sudachipy.

You can excute sudachipy with standard input by this way:

$ sudachipy

sudachipy has 4 subcommands (in default tokenize)

$ sudachipy tokenize -h
usage: sudachipy tokenize [-h] [-r file] [-m {A,B,C}] [-o file] [-a] [-d] [-v]
                          [file [file ...]]

Tokenize Text

positional arguments:
  file           text written in utf-8

optional arguments:
  -h, --help     show this help message and exit
  -r file        the setting file in JSON format
  -m {A,B,C}     the mode of splitting
  -o file        the output file
  -a             print all of the fields
  -d             print the debug information
  -v, --version  print sudachipy version

$ sudachipy link -h
usage: sudachipy link [-h] [-t {small,core,full}] [-u]

Link Default Dict Package

optional arguments:
  -h, --help            show this help message and exit
  -t {small,core,full}  dict dict
  -u                    unlink sudachidict

$ sudachipy build -h
usage: sudachipy build [-h] [-o file] [-d string] -m file file [file ...]

Build Sudachi Dictionary

positional arguments:
  file        source files with CSV format (one of more)

optional arguments:
  -h, --help  show this help message and exit
  -o file     output file (default: system.dic)
  -d string   description comment to be embedded on dictionary

required named arguments:
  -m file     connection matrix file with MeCab's matrix.def format

$ sudachipy ubuild -h
usage: sudachipy ubuild [-h] [-d string] [-o file] [-s file] file [file ...]

Build User Dictionary

positional arguments:
  file        source files with CSV format (one or more)

optional arguments:
  -h, --help  show this help message and exit
  -d string   description comment to be embedded on dictionary
  -o file     output file (default: user.dic)
  -s file     system dictionary (default: ${SUDACHIPY}/resouces/system.dic)

As a Python package

Here is an example usage;

from sudachipy import tokenizer
from sudachipy import dictionary


tokenizer_obj = dictionary.Dictionary().create()


# Multi-granular tokenization
# (following results are w/ `system_full.dic`
# you may not be able to replicate this particular example w/ `system_core.dic`)

mode = tokenizer.Tokenizer.SplitMode.C
[m.surface() for m in tokenizer_obj.tokenize("医薬品安全管理責任者", mode)]
# => ['医薬品安全管理責任者']

mode = tokenizer.Tokenizer.SplitMode.B
[m.surface() for m in tokenizer_obj.tokenize("医薬品安全管理責任者", mode)]
# => ['医薬品', '安全', '管理', '責任者']

mode = tokenizer.Tokenizer.SplitMode.A
[m.surface() for m in tokenizer_obj.tokenize("医薬品安全管理責任者", mode)]
# => ['医薬', '品', '安全', '管理', '責任', '者']


# Morpheme information

m = tokenizer_obj.tokenize("食べ", mode)[0]

m.surface() # => '食べ'
m.dictionary_form() # => '食べる'
m.reading_form() # => 'タベ'
m.part_of_speech() # => ['動詞', '一般', '*', '*', '下一段-バ行', '連用形-一般']


# Normalization

tokenizer_obj.tokenize("附属", mode)[0].normalized_form()
# => '付属'
tokenizer_obj.tokenize("SUMMER", mode)[0].normalized_form()
# => 'サマー'
tokenizer_obj.tokenize("シュミレーション", mode)[0].normalized_form()
# => 'シミュレーション'

Install dict packages

You can download and install the built dictionaries from Python packages · WorksApplications/SudachiDict.

$ pip install SudachiDict_full-20190531.tar.gz

You can change the default dict package by executing link command.

$ sudachipy link -t full

You can remove default dict setting.

$ sudachipy link -u

Customized dictionary

If you need to apply customized system.dic, place sudachi.json to anywhere you like, and overwrite systemDict value with the relative path from sudachi.json to your system.dic.

{
    "systemDict" : "relative/path/to/system.dic",
    ...
}

Then you can specify sudachi.json with -r option.

$ sudachipy -r path/to/sudachi.json

In the end, we would like to make a flow to get these resources via the code, like NLTK (e.g., import nltk; nltk.download()) or spaCy (e.g., $python -m spacy download en).

For developer

Code format

You can use ./scripts/format.sh and check if your code is in rule. flake8 flake8-import-order flake8-buitins is required. See requirements.txt

Test

You can use ./script/test.sh and check if not your change cause regression.

Project details

Release history Release notifications | RSS feed

0.6.10

Jan 10, 2025

0.6.9

Nov 20, 2024

0.6.8

Dec 14, 2023

0.6.7

Feb 15, 2023

0.6.6

Jul 25, 2022

0.6.5

Jun 17, 2022

0.6.4

Jun 16, 2022

0.6.3

Feb 10, 2022

0.6.2

Dec 9, 2021

0.6.1

Dec 8, 2021

0.6.0

Nov 11, 2021

0.6.0rc1 pre-release

Oct 26, 2021

0.5.4

Sep 27, 2021

0.5.3

Sep 10, 2021

0.5.2

Mar 26, 2021

0.5.1

Jan 4, 2021

0.5.0

Dec 25, 2020

0.4.9

Jun 19, 2020

0.4.8

Jun 18, 2020

0.4.7

Jun 15, 2020

0.4.6

Jun 11, 2020

0.4.5

Jun 2, 2020

0.4.4

Apr 30, 2020

0.4.3

Feb 26, 2020

0.4.2

Dec 6, 2019

0.4.1

Nov 26, 2019

0.4.0

Sep 7, 2019

0.3.13

Aug 31, 2019

0.3.12

Aug 27, 2019

0.3.11

Aug 8, 2019

0.3.10

Aug 7, 2019

0.3.9

Aug 1, 2019

0.3.8

Aug 1, 2019

This version

0.3.6

Jul 20, 2019

0.3.5

Jul 18, 2019

0.3.4

Jul 17, 2019

0.3.3

Jul 9, 2019

0.3.2

Jul 7, 2019

0.3.1

Jul 7, 2019

0.2.1

Jul 5, 2019

0.2.0.1

Jul 5, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SudachiPy-0.3.6.tar.gz (46.9 kB view details)

Uploaded Jul 20, 2019 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

SudachiPy-0.3.6-py3-none-any.whl (62.5 kB view details)

Uploaded Jul 20, 2019 Python 3

File details

Details for the file SudachiPy-0.3.6.tar.gz.

File metadata

Download URL: SudachiPy-0.3.6.tar.gz
Upload date: Jul 20, 2019
Size: 46.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.7

File hashes

Hashes for SudachiPy-0.3.6.tar.gz
Algorithm	Hash digest
SHA256	`5220f85607b46fbea23f1df14d058d33160eba71ce11045b0cbbe818d026f9e7`
MD5	`e0a44aee08abf4781b39319d5961d48b`
BLAKE2b-256	`1ac8aa3be6de5d4d6ba4329cde1c629c54a477697bae8deb660d6e475cae908a`

See more details on using hashes here.

File details

Details for the file SudachiPy-0.3.6-py3-none-any.whl.

File metadata

Download URL: SudachiPy-0.3.6-py3-none-any.whl
Upload date: Jul 20, 2019
Size: 62.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.7

File hashes

Hashes for SudachiPy-0.3.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`115416ad8ce14c302d764c2e1777eec1cb14c2b3776ad2ec1c430ab2fa9ef114`
MD5	`368ce3e4efee6a5a90f0bee83ca66d86`
BLAKE2b-256	`f9a68eb0fdc3ebee66b9add4de41914793132d9cab455ebc5e43645e099d0c11`

See more details on using hashes here.

SudachiPy 0.3.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

SudachiPy

Breaking changes

v0.3.0

V0.2.2

v0.2.0

Easy Setup

Step 1: Install SudachiPy

Step 2: Install SudachiDict_core

Usage

As a command

As a Python package

Install dict packages

Customized dictionary

For developer

Code format

Test

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes