Skip to main content

A Japanese tokenizer based on recurrent neural networks

Project description


Alt text

Build Status Documentation Status PyPI

Nagisa is a python module for Japanese word segmentation/POS-tagging. It is designed to be a simple and easy-to-use tool.

This tool has the following features. - Based on recurrent neural networks. - The word segmentation model uses character- and word-level features [池田+]. - The POS-tagging model uses tag dictionary information [Inoue+].

For more details refer to the following links. - The slide in Japanese is available here. - The documentation is available here.


Python 2.7.x or 3.5+ is required. This tool uses DyNet (the Dynamic Neural Network Toolkit) to calcucate neural networks. You can install nagisa by using the following command.

pip install nagisa

If you use nagisa on Windows, please run it with python 3.5+.


Basic usage.

import nagisa

# Sample of word segmentation and POS-tagging for Japanese
text = 'Pythonで簡単に使えるツールです'
words = nagisa.tagging(text)
#=> Python/名詞 で/助詞 簡単/形状詞 に/助動詞 使える/動詞 ツール/名詞 です/助動詞

# Get a list of words
#=> ['Python', 'で', '簡単', 'に', '使える', 'ツール', 'です']

# Get a list of POS-tags
#=> ['名詞', '助詞', '形状詞', '助動詞', '動詞', '名詞', '助動詞']

# The nagisa.wakati method is faster than the nagisa.tagging method.
words = nagisa.wakati(text)
#=> ['Python', 'で', '簡単', 'に', '使える', 'ツール', 'です']

Post processing functions.

# Extarcting all nouns from a text
words = nagisa.extract(text, extract_postags=['名詞'])
#=> Python/名詞 ツール/名詞

# Filtering specific POS-tags from a text
words = nagisa.filter(text, filter_postags=['助詞', '助動詞'])
#=> Python/名詞 簡単/形状詞 使える/動詞 ツール/名詞

# A list of available POS-tags
#=> ['補助記号', '名詞', ... , 'URL']

# A word can be recognized as a single word forcibly.
text = 'ニューラルネットワークを使ってます。'
#=> ニューラル/名詞 ネットワーク/名詞 を/助詞 使っ/動詞 て/助動詞 ます/助動詞 。/補助記号

# If a word is included in the single_word_list, it is recognized as a single word.
tagger_nn = nagisa.Tagger(single_word_list=['ニューラルネットワーク'])
#=> ニューラルネットワーク/名詞 を/助詞 使っ/動詞 て/助動詞 ます/助動詞 。/補助記号

# Nagisa is good at capturing the URLs and kaomoji from an input text.
url = 'でコードを公開中(๑¯ω¯๑)'
words = nagisa.tagging(url)
#=> で/助詞 コード/名詞 を/助詞 公開/名詞 中/接尾辞 (๑ ̄ω ̄๑)/補助記号

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for nagisa, version 0.1.1
Filename, size File type Python version Upload date Hashes
Filename, size nagisa-0.1.1.tar.gz (20.8 MB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page