Skip to main content

A unified language analyzer for Japanese

Project description

KWJA: Kyoto-Waseda Japanese Analyzer

test codecov PyPI PyPI - Python Version

[Paper] [Slides]

KWJA is a Japanese language analyzer based on pre-trained language models. KWJA performs many language analysis tasks, including:

  • Typo correction
  • Tokenization
  • Word normalization
  • Morphological analysis
  • Named entity recognition
  • Word feature tagging
  • Dependency parsing
  • PAS analysis
  • Bridging reference resolution
  • Coreference resolution
  • Discourse relation analysis

Requirements

Getting Started

Install KWJA with pip:

$ pip install kwja

Perform language analysis with the kwja command (the result is in the KNP format):

# Analyze a text
$ kwja --text "KWJAは日本語の統合解析ツールです。汎用言語モデルを利用し、様々な言語解析を統一的な方法で解いています。"

# Analyze a text file and write the result to a file
$ kwja --file path/to/file.txt > path/to/analyzed.knp

The output is in the KNP format, like the following:

# S-ID:202210010000-0-0 kwja:1.0.2
* 2D
+ 5D <rel type="=" target="ツール" sid="202210011918-0-0" id="5"/><体言><NE:ARTIFACT:KWJA>
KWJA KWJA KWJA 名詞 6 固有名詞 3 * 0 * 0 <基本句-主辞>
は は は 助詞 9 副助詞 2 * 0 * 0 "代表表記:は/は" <代表表記:は/は>
* 2D
+ 2D <体言>
日本 にほん 日本 名詞 6 地名 4 * 0 * 0 "代表表記:日本/にほん 地名:国" <代表表記:日本/にほん><地名:国><基本句-主辞>
+ 4D <体言><係:ノ格>
語 ご 語 名詞 6 普通名詞 1 * 0 * 0 "代表表記:語/ご 漢字読み:音 カテゴリ:抽象物" <代表表記:語/ご><漢字読み:音><カテゴリ:抽象物><基本句-主辞>
の の の 助詞 9 接続助詞 3 * 0 * 0 "代表表記:の/の" <代表表記:の/の>
...

You can read a KNP format file with rhoknp.

from rhoknp import Document
with open("analyzed.knp") as f:
    parsed_document = Document.from_knp(f.read())

For more details about KNP format, see Reference.

Usage from Python

Make sure you have kwja command in your path:

$ which kwja
/path/to/kwja

Install rhoknp:

$ pip install rhoknp

Perform language analysis with the kwja instance:

from rhoknp import KWJA
kwja = KWJA()
analyzed_document = kwja.apply(
    "KWJAは日本語の統合解析ツールです。汎用言語モデルを利用し、様々な言語解析を統一的な方法で解いています。"
)

Citation

@InProceedings{植田2022,
  author    = {植田 暢大 and 大村 和正 and 児玉 貴志 and 清丸 寛一 and 村脇 有吾 and 河原 大輔 and 黒橋 禎夫},
  title     = {KWJA:汎用言語モデルに基づく日本語解析器},
  booktitle = {第253回自然言語処理研究会},
  year      = {2022},
  address   = {京都},
}

Reference

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kwja-1.0.3.tar.gz (18.1 MB view hashes)

Uploaded Source

Built Distribution

kwja-1.0.3-py3-none-any.whl (18.2 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page