Skip to main content

Python version of Sudachi, the Japanese Morphological Analyzer

Project description

SudachiPy

Build Status

SudachiPy is a Python version of Sudachi, a Japanese morphological analyzer.

Sudachi & SudachiPy are developed in WAP Tokushima Laboratory of AI and NLP, an institute under Works Applications that focuses on Natural Language Processing (NLP).

Warning: SudachiPy is still under development, and some of the functions are still not complete. Please use it at your own risk.

Setup

SudachiPy requires Python3.5+.

SudachiPy is not registered to PyPI just yet, so you may not install it via pip command at the moment.

$ pip install -e git+git://github.com/WorksApplications/SudachiPy@develop#egg=SudachiPy

The dictionary file is not included in the repository. You can get the built dictionary from Releases · WorksApplications/Sudachi. Please download either sudachi-x.y.z-dictionary-core.zip or sudachi-x.y.z-dictionary-full.zip, unzip and rename it to system.dic, then place it under SudachiPy/resources/. In the end, we would like to make a flow to get these resources via the code, like NLTK (e.g., import nltk; nltk.download()) or spaCy (e.g., $python -m spacy download en).

Usage

As a command

After installing SudachiPy, you may also use it in the terminal via command sudachipy. sudachipy has 3 subcommands (in default tokenize)

$ sudachipy tokenize -h
usage: sudachipy tokenize [-h] [-r file] [-m {A,B,C}] [-o file] [-a] [-d]
                          file [file ...]

Tokenize Text

positional arguments:
  file        text written in utf-8

optional arguments:
  -h, --help  show this help message and exit
  -r file     the setting file in JSON format
  -m {A,B,C}  the mode of splitting
  -o file     the output file
  -a          print all of the fields
  -d          print the debug information
$ sudachipy build -h
usage: sudachipy build [-h] [-o file] [-d string] -m file file [file ...]

Build Sudachi Dictionary

positional arguments:
  file        source files with CSV format (one of more)

optional arguments:
  -h, --help  show this help message and exit
  -o file     output file (default: system.dic)
  -d string   description comment to be embedded on dictionary

required named arguments:
  -m file     connection matrix file with MeCab's matrix.def format
$ sudachipy ubuild -h
usage: sudachipy ubuild [-h] [-d string] [-o file] [-s file] file [file ...]

Build User Dictionary

positional arguments:
  file        source files with CSV format (one or more)

optional arguments:
  -h, --help  show this help message and exit
  -d string   description comment to be embedded on dictionary
  -o file     output file (default: user.dic)
  -s file     system dictionary (default: ${SUDACHIPY}/resouces/system.dic)

As a Python package

Here is an example usage;

from sudachipy import tokenizer
from sudachipy import dictionary


tokenizer_obj = dictionary.Dictionary().create()


# Multi-granular tokenization
# (following results are w/ `system_full.dic`
# you may not be able to replicate this particular example w/ `system_core.dic`)

mode = tokenizer.Tokenizer.SplitMode.C
[m.surface() for m in tokenizer_obj.tokenize("医薬品安全管理責任者", mode)]
# => ['医薬品安全管理責任者']

mode = tokenizer.Tokenizer.SplitMode.B
[m.surface() for m in tokenizer_obj.tokenize("医薬品安全管理責任者", mode)]
# => ['医薬品', '安全', '管理', '責任者']

mode = tokenizer.Tokenizer.SplitMode.A
[m.surface() for m in tokenizer_obj.tokenize("医薬品安全管理責任者", mode)]
# => ['医薬', '品', '安全', '管理', '責任', '者']


# Morpheme information

m = tokenizer_obj.tokenize("食べ", mode)[0]

m.surface() # => '食べ'
m.dictionary_form() # => '食べる'
m.reading_form() # => 'タベ'
m.part_of_speech() # => ['動詞', '一般', '*', '*', '下一段-バ行', '連用形-一般']


# Normalization

tokenizer_obj.tokenize("附属", mode)[0].normalized_form()
# => '付属'
tokenizer_obj.tokenize("SUMMER", mode)[0].normalized_form()
# => 'サマー'
tokenizer_obj.tokenize("シュミレーション", mode)[0].normalized_form()
# => 'シミュレーション'

For developer

Code format

You can use ./scripts/format.sh and check if your code is in rule. flake8 flake8-import-order flake8-buitins is required. See requirements.txt

Test

You can use ./script/test.sh and check if not your change cause regression.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SudachiPy-0.2.1.tar.gz (42.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

SudachiPy-0.2.1-py3-none-any.whl (52.3 kB view details)

Uploaded Python 3

File details

Details for the file SudachiPy-0.2.1.tar.gz.

File metadata

  • Download URL: SudachiPy-0.2.1.tar.gz
  • Upload date:
  • Size: 42.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.7

File hashes

Hashes for SudachiPy-0.2.1.tar.gz
Algorithm Hash digest
SHA256 b881fc754d1f28462660b6c6395727451c140f6ab0e5397c057393b4d3dc7f64
MD5 b873a752d61fc985d2e383fb61a02584
BLAKE2b-256 a46eb5641c906ebe58a56b671f037955e6b3345b9073d2718ec33440be061ec5

See more details on using hashes here.

File details

Details for the file SudachiPy-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: SudachiPy-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 52.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.7

File hashes

Hashes for SudachiPy-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5db7af20cfbfecfb93074257bcf42ce37088f18264f140ec79165c3d08e7f6c7
MD5 d8bbfdcc6c95582f0ceee26b6c2f8690
BLAKE2b-256 b051c50d3c7eb6823eeb7cbbf307b3dfbb24cda5eabe57a32781187b63045cb4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page