A Japanese tokenizer based on recurrent neural networks
Project description
.. raw:: html
<p align="center">
.. raw:: html
</p>
--------------
|Codacy Badge| |Build Status| |Build status| |Coverage Status|
|Documentation Status| |PyPI| |PyPI - Downloads|
Nagisa is a python module for Japanese word segmentation/POS-tagging. It
is designed to be a simple and easy-to-use tool.
This tool has the following features. - Based on recurrent neural
networks. - The word segmentation model uses character- and word-level
features
`[池田+] <http://www.anlp.jp/proceedings/annual_meeting/2017/pdf_dir/B6-2.pdf>`__.
- The POS-tagging model uses tag dictionary information
`[Inoue+] <http://www.aclweb.org/anthology/K17-1042>`__.
For more details refer to the following links. - The article in Japanese
is available
`here <https://qiita.com/taishi-i/items/5b9275a606b392f7f58e>`__. - The
documentation is available
`here <https://nagisa.readthedocs.io/en/latest/?badge=latest>`__.
Installation
============
Python 2.7.x or 3.5+ is required. This tool uses
`DyNet <https://github.com/clab/dynet>`__ (the Dynamic Neural Network
Toolkit) to calcucate neural networks. You can install nagisa by using
the following command.
.. code:: bash
pip install nagisa
For Windows users, please run it with python 3.6+ (64bit).
Basic usage
===========
Sample of word segmentation and POS-tagging for Japanese.
.. code:: python
import nagisa
text = 'Pythonで簡単に使えるツールです'
words = nagisa.tagging(text)
print(words)
#=> Python/名詞 で/助詞 簡単/形状詞 に/助動詞 使える/動詞 ツール/名詞 です/助動詞
# Get a list of words
print(words.words)
#=> ['Python', 'で', '簡単', 'に', '使える', 'ツール', 'です']
# Get a list of POS-tags
print(words.postags)
#=> ['名詞', '助詞', '形状詞', '助動詞', '動詞', '名詞', '助動詞']
Post-processing functions
=========================
Filter and extarct words by the specific POS tags.
.. code:: python
# Filter the words of the specific POS tags.
words = nagisa.filter(text, filter_postags=['助詞', '助動詞'])
print(words)
#=> Python/名詞 簡単/形状詞 使える/動詞 ツール/名詞
# Extarct only nouns.
words = nagisa.extract(text, extract_postags=['名詞'])
print(words)
#=> Python/名詞 ツール/名詞
# This is a list of available POS-tags in nagisa.
print(nagisa.tagger.postags)
#=> ['補助記号', '名詞', ... , 'URL']
Add the user dictionary in easy way.
.. code:: python
# default
text = "3月に見た「3月のライオン」"
print(nagisa.tagging(text))
#=> 3/名詞 月/名詞 に/助詞 見/動詞 た/助動詞 「/補助記号 3/名詞 月/名詞 の/助詞 ライオン/名詞 」/補助記号
# If a word ("3月のライオン") is included in the single_word_list, it is recognized as a single word.
new_tagger = nagisa.Tagger(single_word_list=['3月のライオン'])
print(new_tagger.tagging(text))
#=> 3/名詞 月/名詞 に/助詞 見/動詞 た/助動詞 「/補助記号 3月のライオン/名詞 」/補助記号
Train a model
=============
Nagisa (v0.2.0+) provides a simple train method for a joint word
segmentation and sequence labeling (e.g, POS-tagging, NER) model.
The format of the train/dev/test files is tsv. Each line is ``word`` and
``tag`` and one line is represented by ``word`` :raw-latex:`\t`(tab)
``tag``. Note that you put EOS between sentences. Refer to `sample
datasets </nagisa/data/sample_datasets>`__ and `Tutorial (Train a model
for Universal
Dependencies) <https://nagisa.readthedocs.io/en/latest/tutorial.html>`__.
::
$ cat sample.train
唯一 NOUN
の ADP
趣味 NOU
は ADP
料理 NOUN
EOS
とても ADV
おいしかっ ADJ
た AUX
です AUX
。 PUNCT
EOS
ドル NOUN
は ADP
主要 ADJ
通貨 NOUN
EOS
.. code:: python
# After finish training, save the three model files (*.vocabs, *.params, *.hp).
nagisa.fit(train_file="sample.train", dev_file="sample.dev", test_file="sample.test", model_name="sample")
# Build the tagger by loading the trained model files.
sample_tagger = nagisa.Tagger(vocabs='sample.vocabs', params='sample.params', hp='sample.hp')
text = "福岡・博多の観光情報"
words = sample_tagger.tagging(text)
print(words)
#> 福岡/PROPN ・/SYM 博多/PROPN の/ADP 観光/NOUN 情報/NOUN
.. |Codacy Badge| image:: https://api.codacy.com/project/badge/Grade/769dd003c7184d4d81dad74fd8a322a1
:target: https://app.codacy.com/app/taishi-i/nagisa?utm_source=github.com&utm_medium=referral&utm_content=taishi-i/nagisa&utm_campaign=Badge_Grade_Dashboard
.. |Build Status| image:: https://travis-ci.org/taishi-i/nagisa.svg?branch=master
:target: https://travis-ci.org/taishi-i/nagisa
.. |Build status| image:: https://ci.appveyor.com/api/projects/status/6k35hmxl1juf1hqf?svg=true
:target: https://ci.appveyor.com/project/taishi-i/nagisa
.. |Coverage Status| image:: https://coveralls.io/repos/github/taishi-i/nagisa/badge.svg?branch=master
:target: https://coveralls.io/github/taishi-i/nagisa?branch=master
.. |Documentation Status| image:: https://readthedocs.org/projects/nagisa/badge/?version=latest
:target: https://nagisa.readthedocs.io/en/latest/?badge=latest
.. |PyPI| image:: https://img.shields.io/pypi/v/nagisa.svg
:target: https://pypi.python.org/pypi/nagisa
.. |PyPI - Downloads| image:: https://img.shields.io/pypi/dm/nagisa.svg
:target: https://img.shields.io/pypi/dm/nagisa.svg
<p align="center">
.. raw:: html
</p>
--------------
|Codacy Badge| |Build Status| |Build status| |Coverage Status|
|Documentation Status| |PyPI| |PyPI - Downloads|
Nagisa is a python module for Japanese word segmentation/POS-tagging. It
is designed to be a simple and easy-to-use tool.
This tool has the following features. - Based on recurrent neural
networks. - The word segmentation model uses character- and word-level
features
`[池田+] <http://www.anlp.jp/proceedings/annual_meeting/2017/pdf_dir/B6-2.pdf>`__.
- The POS-tagging model uses tag dictionary information
`[Inoue+] <http://www.aclweb.org/anthology/K17-1042>`__.
For more details refer to the following links. - The article in Japanese
is available
`here <https://qiita.com/taishi-i/items/5b9275a606b392f7f58e>`__. - The
documentation is available
`here <https://nagisa.readthedocs.io/en/latest/?badge=latest>`__.
Installation
============
Python 2.7.x or 3.5+ is required. This tool uses
`DyNet <https://github.com/clab/dynet>`__ (the Dynamic Neural Network
Toolkit) to calcucate neural networks. You can install nagisa by using
the following command.
.. code:: bash
pip install nagisa
For Windows users, please run it with python 3.6+ (64bit).
Basic usage
===========
Sample of word segmentation and POS-tagging for Japanese.
.. code:: python
import nagisa
text = 'Pythonで簡単に使えるツールです'
words = nagisa.tagging(text)
print(words)
#=> Python/名詞 で/助詞 簡単/形状詞 に/助動詞 使える/動詞 ツール/名詞 です/助動詞
# Get a list of words
print(words.words)
#=> ['Python', 'で', '簡単', 'に', '使える', 'ツール', 'です']
# Get a list of POS-tags
print(words.postags)
#=> ['名詞', '助詞', '形状詞', '助動詞', '動詞', '名詞', '助動詞']
Post-processing functions
=========================
Filter and extarct words by the specific POS tags.
.. code:: python
# Filter the words of the specific POS tags.
words = nagisa.filter(text, filter_postags=['助詞', '助動詞'])
print(words)
#=> Python/名詞 簡単/形状詞 使える/動詞 ツール/名詞
# Extarct only nouns.
words = nagisa.extract(text, extract_postags=['名詞'])
print(words)
#=> Python/名詞 ツール/名詞
# This is a list of available POS-tags in nagisa.
print(nagisa.tagger.postags)
#=> ['補助記号', '名詞', ... , 'URL']
Add the user dictionary in easy way.
.. code:: python
# default
text = "3月に見た「3月のライオン」"
print(nagisa.tagging(text))
#=> 3/名詞 月/名詞 に/助詞 見/動詞 た/助動詞 「/補助記号 3/名詞 月/名詞 の/助詞 ライオン/名詞 」/補助記号
# If a word ("3月のライオン") is included in the single_word_list, it is recognized as a single word.
new_tagger = nagisa.Tagger(single_word_list=['3月のライオン'])
print(new_tagger.tagging(text))
#=> 3/名詞 月/名詞 に/助詞 見/動詞 た/助動詞 「/補助記号 3月のライオン/名詞 」/補助記号
Train a model
=============
Nagisa (v0.2.0+) provides a simple train method for a joint word
segmentation and sequence labeling (e.g, POS-tagging, NER) model.
The format of the train/dev/test files is tsv. Each line is ``word`` and
``tag`` and one line is represented by ``word`` :raw-latex:`\t`(tab)
``tag``. Note that you put EOS between sentences. Refer to `sample
datasets </nagisa/data/sample_datasets>`__ and `Tutorial (Train a model
for Universal
Dependencies) <https://nagisa.readthedocs.io/en/latest/tutorial.html>`__.
::
$ cat sample.train
唯一 NOUN
の ADP
趣味 NOU
は ADP
料理 NOUN
EOS
とても ADV
おいしかっ ADJ
た AUX
です AUX
。 PUNCT
EOS
ドル NOUN
は ADP
主要 ADJ
通貨 NOUN
EOS
.. code:: python
# After finish training, save the three model files (*.vocabs, *.params, *.hp).
nagisa.fit(train_file="sample.train", dev_file="sample.dev", test_file="sample.test", model_name="sample")
# Build the tagger by loading the trained model files.
sample_tagger = nagisa.Tagger(vocabs='sample.vocabs', params='sample.params', hp='sample.hp')
text = "福岡・博多の観光情報"
words = sample_tagger.tagging(text)
print(words)
#> 福岡/PROPN ・/SYM 博多/PROPN の/ADP 観光/NOUN 情報/NOUN
.. |Codacy Badge| image:: https://api.codacy.com/project/badge/Grade/769dd003c7184d4d81dad74fd8a322a1
:target: https://app.codacy.com/app/taishi-i/nagisa?utm_source=github.com&utm_medium=referral&utm_content=taishi-i/nagisa&utm_campaign=Badge_Grade_Dashboard
.. |Build Status| image:: https://travis-ci.org/taishi-i/nagisa.svg?branch=master
:target: https://travis-ci.org/taishi-i/nagisa
.. |Build status| image:: https://ci.appveyor.com/api/projects/status/6k35hmxl1juf1hqf?svg=true
:target: https://ci.appveyor.com/project/taishi-i/nagisa
.. |Coverage Status| image:: https://coveralls.io/repos/github/taishi-i/nagisa/badge.svg?branch=master
:target: https://coveralls.io/github/taishi-i/nagisa?branch=master
.. |Documentation Status| image:: https://readthedocs.org/projects/nagisa/badge/?version=latest
:target: https://nagisa.readthedocs.io/en/latest/?badge=latest
.. |PyPI| image:: https://img.shields.io/pypi/v/nagisa.svg
:target: https://pypi.python.org/pypi/nagisa
.. |PyPI - Downloads| image:: https://img.shields.io/pypi/dm/nagisa.svg
:target: https://img.shields.io/pypi/dm/nagisa.svg
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
nagisa-0.2.2.tar.gz
(20.9 MB
view hashes)