Skip to main content

A Japanese tokenizer based on recurrent neural networks

Project description

.. raw:: html

<p align="center">

.. raw:: html

</p>

--------------

|Build Status| |Documentation Status| |PyPI|

| Nagisa is a python module for Japanese word segmentation/POS-tagging.
| It is designed to be a simple and easy-to-use tool.

This tool has the following features. - Based on recurrent neural
networks. - The word segmentation model uses character- and word-level
features
`[池田+] <http://www.anlp.jp/proceedings/annual_meeting/2017/pdf_dir/B6-2.pdf>`__.
- The POS-tagging model uses tag dictionary information
`[Inoue+] <http://www.aclweb.org/anthology/K17-1042>`__.

For more details refer to the following links. - The slide in Japanese
is available
`here <https://drive.google.com/open?id=1AzR5wh5502u_OI_Jxwsq24t-er_rnJBP>`__.
- The documentation is available
`here <https://nagisa.readthedocs.io/en/latest/?badge=latest>`__.

Installation
============

| Python 2.7.x or 3.5+ is required.
| This tool uses `DyNet <https://github.com/clab/dynet>`__ (the Dynamic
Neural Network Toolkit) to calcucate neural networks.
| You can install nagisa by using the following command.

.. code:: bash

pip install nagisa

Usage
=====

Basic usage.

.. code:: python

import nagisa

# Sample of word segmentation and POS-tagging for Japanese
text = 'Pythonで簡単に使えるツールです'
words = nagisa.tagging(text)
print(words)
#=> Python/名詞 で/助詞 簡単/形状詞 に/助動詞 使える/動詞 ツール/名詞 です/助動詞

# Get a list of words
print(words.words)
#=> ['Python', 'で', '簡単', 'に', '使える', 'ツール', 'です']

# Get a list of POS-tags
print(words.postags)
#=> ['名詞', '助詞', '形状詞', '助動詞', '動詞', '名詞', '助動詞']

# The nagisa.wakati method is faster than the nagisa.tagging method.
words = nagisa.wakati(text)
print(words)
#=> ['Python', 'で', '簡単', 'に', '使える', 'ツール', 'です']

Post processing functions.

.. code:: python

# Extarcting all nouns from a text
words = nagisa.extract(text, extract_postags=['名詞'])
print(words)
#=> Python/名詞 ツール/名詞

# Filtering specific POS-tags from a text
words = nagisa.filter(text, filter_postags=['助詞', '助動詞'])
print(words)
#=> Python/名詞 簡単/形状詞 使える/動詞 ツール/名詞

# A list of available POS-tags
print(nagisa.tagger.postags)
#=> ['補助記号', '名詞', ... , 'URL']

# A word can be recognized as a single word forcibly.
text = 'ニューラルネットワークを使ってます。'
print(nagisa.tagging(text))
#=> ニューラル/名詞 ネットワーク/名詞 を/助詞 使っ/動詞 て/助動詞 ます/助動詞 。/補助記号

# If a word is included in the single_word_list, it is recognized as a single word.
tagger_nn = nagisa.Tagger(single_word_list=['ニューラルネットワーク'])
print(tagger_nn.tagging(text))
#=> ニューラルネットワーク/名詞 を/助詞 使っ/動詞 て/助動詞 ます/助動詞 。/補助記号

# Nagisa is good at capturing the URLs and kaomoji from an input text.
url = 'https://github.com/taishi-i/nagisaでコードを公開中(๑¯ω¯๑)'
words = nagisa.tagging(url)
print(words)
#=> https://github.com/taishi-i/nagisa/URL で/助詞 コード/名詞 を/助詞 公開/名詞 中/接尾辞 (๑ ̄ω ̄๑)/補助記号

.. |Build Status| image:: https://travis-ci.org/taishi-i/nagisa.svg?branch=master
:target: https://travis-ci.org/taishi-i/nagisa
.. |Documentation Status| image:: https://readthedocs.org/projects/nagisa/badge/?version=latest
:target: https://nagisa.readthedocs.io/en/latest/?badge=latest
.. |PyPI| image:: https://img.shields.io/pypi/v/nagisa.svg
:target: https://pypi.python.org/pypi/nagisa

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nagisa-0.0.9.tar.gz (20.8 MB view details)

Uploaded Source

File details

Details for the file nagisa-0.0.9.tar.gz.

File metadata

  • Download URL: nagisa-0.0.9.tar.gz
  • Upload date:
  • Size: 20.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for nagisa-0.0.9.tar.gz
Algorithm Hash digest
SHA256 9caff3c399d69c00961c39ea96cd22c702e68552498e8ec329384e499b07dae1
MD5 461c8f34c40d57a8f44b842b871391cd
BLAKE2b-256 908da3c91b4762f7b65ffcbef8f1fbad62332fcf901d32b7bd697b79329a94a0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page