A Japanese tokenizer based on recurrent neural networks
Project description
.. raw:: html
<p align="center">
.. raw:: html
</p>
--------------
|Build Status| |Documentation Status| |PyPI|
| Nagisa is a python module for Japanese word segmentation/POS-tagging.
| It is designed to be a simple and easy-to-use tool.
This tool has the following features. - Based on recurrent neural
networks. - The word segmentation model uses character- and word-level
features
`[池田+] <http://www.anlp.jp/proceedings/annual_meeting/2017/pdf_dir/B6-2.pdf>`__.
- The POS-tagging model uses tag dictionary information
`[Inoue+] <http://www.aclweb.org/anthology/K17-1042>`__.
For more details refer to the following links. - The slide in Japanese
is available
`here <https://drive.google.com/open?id=1AzR5wh5502u_OI_Jxwsq24t-er_rnJBP>`__.
- The documentation is available
`here <https://nagisa.readthedocs.io/en/latest/?badge=latest>`__.
Installation
============
| Python 2.7.x or 3.5+ is required.
| This tool uses `DyNet <https://github.com/clab/dynet>`__ (the Dynamic
Neural Network Toolkit) to calcucate neural networks.
| You can install nagisa by using the following command.
.. code:: bash
pip install nagisa
Usage
=====
Basic usage.
.. code:: python
import nagisa
# Sample of word segmentation and POS-tagging for Japanese
text = 'Pythonで簡単に使えるツールです'
words = nagisa.tagging(text)
print(words)
#=> Python/名詞 で/助詞 簡単/形状詞 に/助動詞 使える/動詞 ツール/名詞 です/助動詞
# Get a list of words
print(words.words)
#=> ['Python', 'で', '簡単', 'に', '使える', 'ツール', 'です']
# Get a list of POS-tags
print(words.postags)
#=> ['名詞', '助詞', '形状詞', '助動詞', '動詞', '名詞', '助動詞']
# The nagisa.wakati method is faster than the nagisa.tagging method.
words = nagisa.wakati(text)
print(words)
#=> ['Python', 'で', '簡単', 'に', '使える', 'ツール', 'です']
Post processing functions.
.. code:: python
# Extarcting all nouns from a text
words = nagisa.extract(text, extract_postags=['名詞'])
print(words)
#=> Python/名詞 ツール/名詞
# Filtering specific POS-tags from a text
words = nagisa.filter(text, filter_postags=['助詞', '助動詞'])
print(words)
#=> Python/名詞 簡単/形状詞 使える/動詞 ツール/名詞
# A list of available POS-tags
print(nagisa.tagger.postags)
#=> ['補助記号', '名詞', ... , 'URL']
# A word can be recognized as a single word forcibly.
text = 'ニューラルネットワークを使ってます。'
print(nagisa.tagging(text))
#=> ニューラル/名詞 ネットワーク/名詞 を/助詞 使っ/動詞 て/助動詞 ます/助動詞 。/補助記号
# If a word is included in the single_word_list, it is recognized as a single word.
tagger_nn = nagisa.Tagger(single_word_list=['ニューラルネットワーク'])
print(tagger_nn.tagging(text))
#=> ニューラルネットワーク/名詞 を/助詞 使っ/動詞 て/助動詞 ます/助動詞 。/補助記号
# Nagisa is good at capturing the URLs and kaomoji from an input text.
url = 'https://github.com/taishi-i/nagisaでコードを公開中(๑¯ω¯๑)'
words = nagisa.tagging(url)
print(words)
#=> https://github.com/taishi-i/nagisa/URL で/助詞 コード/名詞 を/助詞 公開/名詞 中/接尾辞 (๑ ̄ω ̄๑)/補助記号
.. |Build Status| image:: https://travis-ci.org/taishi-i/nagisa.svg?branch=master
:target: https://travis-ci.org/taishi-i/nagisa
.. |Documentation Status| image:: https://readthedocs.org/projects/nagisa/badge/?version=latest
:target: https://nagisa.readthedocs.io/en/latest/?badge=latest
.. |PyPI| image:: https://img.shields.io/pypi/v/nagisa.svg
:target: https://pypi.python.org/pypi/nagisa
<p align="center">
.. raw:: html
</p>
--------------
|Build Status| |Documentation Status| |PyPI|
| Nagisa is a python module for Japanese word segmentation/POS-tagging.
| It is designed to be a simple and easy-to-use tool.
This tool has the following features. - Based on recurrent neural
networks. - The word segmentation model uses character- and word-level
features
`[池田+] <http://www.anlp.jp/proceedings/annual_meeting/2017/pdf_dir/B6-2.pdf>`__.
- The POS-tagging model uses tag dictionary information
`[Inoue+] <http://www.aclweb.org/anthology/K17-1042>`__.
For more details refer to the following links. - The slide in Japanese
is available
`here <https://drive.google.com/open?id=1AzR5wh5502u_OI_Jxwsq24t-er_rnJBP>`__.
- The documentation is available
`here <https://nagisa.readthedocs.io/en/latest/?badge=latest>`__.
Installation
============
| Python 2.7.x or 3.5+ is required.
| This tool uses `DyNet <https://github.com/clab/dynet>`__ (the Dynamic
Neural Network Toolkit) to calcucate neural networks.
| You can install nagisa by using the following command.
.. code:: bash
pip install nagisa
Usage
=====
Basic usage.
.. code:: python
import nagisa
# Sample of word segmentation and POS-tagging for Japanese
text = 'Pythonで簡単に使えるツールです'
words = nagisa.tagging(text)
print(words)
#=> Python/名詞 で/助詞 簡単/形状詞 に/助動詞 使える/動詞 ツール/名詞 です/助動詞
# Get a list of words
print(words.words)
#=> ['Python', 'で', '簡単', 'に', '使える', 'ツール', 'です']
# Get a list of POS-tags
print(words.postags)
#=> ['名詞', '助詞', '形状詞', '助動詞', '動詞', '名詞', '助動詞']
# The nagisa.wakati method is faster than the nagisa.tagging method.
words = nagisa.wakati(text)
print(words)
#=> ['Python', 'で', '簡単', 'に', '使える', 'ツール', 'です']
Post processing functions.
.. code:: python
# Extarcting all nouns from a text
words = nagisa.extract(text, extract_postags=['名詞'])
print(words)
#=> Python/名詞 ツール/名詞
# Filtering specific POS-tags from a text
words = nagisa.filter(text, filter_postags=['助詞', '助動詞'])
print(words)
#=> Python/名詞 簡単/形状詞 使える/動詞 ツール/名詞
# A list of available POS-tags
print(nagisa.tagger.postags)
#=> ['補助記号', '名詞', ... , 'URL']
# A word can be recognized as a single word forcibly.
text = 'ニューラルネットワークを使ってます。'
print(nagisa.tagging(text))
#=> ニューラル/名詞 ネットワーク/名詞 を/助詞 使っ/動詞 て/助動詞 ます/助動詞 。/補助記号
# If a word is included in the single_word_list, it is recognized as a single word.
tagger_nn = nagisa.Tagger(single_word_list=['ニューラルネットワーク'])
print(tagger_nn.tagging(text))
#=> ニューラルネットワーク/名詞 を/助詞 使っ/動詞 て/助動詞 ます/助動詞 。/補助記号
# Nagisa is good at capturing the URLs and kaomoji from an input text.
url = 'https://github.com/taishi-i/nagisaでコードを公開中(๑¯ω¯๑)'
words = nagisa.tagging(url)
print(words)
#=> https://github.com/taishi-i/nagisa/URL で/助詞 コード/名詞 を/助詞 公開/名詞 中/接尾辞 (๑ ̄ω ̄๑)/補助記号
.. |Build Status| image:: https://travis-ci.org/taishi-i/nagisa.svg?branch=master
:target: https://travis-ci.org/taishi-i/nagisa
.. |Documentation Status| image:: https://readthedocs.org/projects/nagisa/badge/?version=latest
:target: https://nagisa.readthedocs.io/en/latest/?badge=latest
.. |PyPI| image:: https://img.shields.io/pypi/v/nagisa.svg
:target: https://pypi.python.org/pypi/nagisa
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
nagisa-0.0.9.tar.gz
(20.8 MB
view hashes)