nagisa

A Japanese tokenizer based on recurrent neural networks

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

.. raw:: html

<p align="center">

.. raw:: html

</p>

--------------

|Build Status| |Coverage Status| |Documentation Status| |PyPI|

Nagisa is a python module for Japanese word segmentation/POS-tagging. It
is designed to be a simple and easy-to-use tool.

This tool has the following features. - Based on recurrent neural
networks. - The word segmentation model uses character- and word-level
features
`[池田+] <http://www.anlp.jp/proceedings/annual_meeting/2017/pdf_dir/B6-2.pdf>`__.
- The POS-tagging model uses tag dictionary information
`[Inoue+] <http://www.aclweb.org/anthology/K17-1042>`__.

For more details refer to the following links. - The article in Japanese
is available
`here <https://qiita.com/taishi-i/items/5b9275a606b392f7f58e>`__. - The
documentation is available
`here <https://nagisa.readthedocs.io/en/latest/?badge=latest>`__.

Installation
============

Python 2.7.x or 3.5+ is required. This tool uses
`DyNet <https://github.com/clab/dynet>`__ (the Dynamic Neural Network
Toolkit) to calcucate neural networks. You can install nagisa by using
the following command.

.. code:: bash

pip install nagisa

If you use nagisa on Windows, please run it with python 3.5+.

Basic usage
===========

Sample of word segmentation and POS-tagging for Japanese.

.. code:: python

import nagisa

text = 'Pythonで簡単に使えるツールです'
words = nagisa.tagging(text)
print(words)
#=> Python/名詞で/助詞簡単/形状詞に/助動詞使える/動詞ツール/名詞です/助動詞

# Get a list of words
print(words.words)
#=> ['Python', 'で', '簡単', 'に', '使える', 'ツール', 'です']

# Get a list of POS-tags
print(words.postags)
#=> ['名詞', '助詞', '形状詞', '助動詞', '動詞', '名詞', '助動詞']

Post-processing functions
=========================

Filter and extarct words by the specific POS tags.

.. code:: python

# Filter the words of the specific POS tags.
words = nagisa.filter(text, filter_postags=['助詞', '助動詞'])
print(words)
#=> Python/名詞簡単/形状詞使える/動詞ツール/名詞

# Extarct only nouns.
words = nagisa.extract(text, extract_postags=['名詞'])
print(words)
#=> Python/名詞ツール/名詞

# This is a list of available POS-tags in nagisa.
print(nagisa.tagger.postags)
#=> ['補助記号', '名詞', ... , 'URL']

Add the user dictionary in easy way.

.. code:: python

# default
text = "3月に見た「3月のライオン」"
print(nagisa.tagging(text))
#=> 3/名詞月/名詞に/助詞見/動詞た/助動詞「/補助記号 3/名詞月/名詞の/助詞ライオン/名詞」/補助記号

# If a word ("3月のライオン") is included in the single_word_list, it is recognized as a single word.
new_tagger = nagisa.Tagger(single_word_list=['3月のライオン'])
print(new_tagger.tagging(text))
#=> 3/名詞月/名詞に/助詞見/動詞た/助動詞「/補助記号 3月のライオン/名詞」/補助記号

Train a model
=============

Nagisa (v0.2.0+) provides a simple train method for a joint word
segmentation and sequence labeling (e.g, POS-tagging, NER) model.

The format of the train/dev/test files is tsv. Each line is ``word`` and
``tag`` and one line is represented by ``word`` :raw-latex:`\t`(tab)
``tag``. Note that you put EOS between sentences. Refer to `sample
dagtsets </nagisa/data/sample_datasets>`__.

::

$ cat sample.train
唯一 NOUN
の ADP
趣味 NOU
は ADP
料理 NOUN
EOS
とても ADV
おいしかっ ADJ
た AUX
です AUX
。 PUNCT
EOS
ドル NOUN
は ADP
主要 ADJ
通貨 NOUN
EOS

.. code:: python

# After finish training, save the three model files (*.vocabs, *.params, *.hp).
nagisa.fit(train_file="sample.train", dev_file="sample.dev", test_file="sample.test", model_name="sample")

# Build the tagger by loading the trained model files.
sample_tagger = nagisa.Tagger(vocabs='sample.vocabs', params='sample.params', hp='sample.hp')

text = "福岡・博多の観光情報"
words = sample_tagger.tagging(text)
print(words)
#> 福岡/PROPN ・/SYM 博多/PROPN の/ADP 観光/NOUN 情報/NOUN

.. |Build Status| image:: https://travis-ci.org/taishi-i/nagisa.svg?branch=master
:target: https://travis-ci.org/taishi-i/nagisa
.. |Coverage Status| image:: https://coveralls.io/repos/github/taishi-i/nagisa/badge.svg?branch=master
:target: https://coveralls.io/github/taishi-i/nagisa?branch=master
.. |Documentation Status| image:: https://readthedocs.org/projects/nagisa/badge/?version=latest
:target: https://nagisa.readthedocs.io/en/latest/?badge=latest
.. |PyPI| image:: https://img.shields.io/pypi/v/nagisa.svg
:target: https://pypi.python.org/pypi/nagisa

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.2.11

Jan 28, 2024

0.2.11rc1 pre-release

Jan 28, 2024

0.2.10

Jan 27, 2024

0.2.9

Jul 30, 2023

0.2.8

Sep 9, 2022

0.2.7

Jul 6, 2020

0.2.6

Jun 11, 2020

0.2.5

Dec 31, 2019

0.2.4

Aug 5, 2019

0.2.3

May 19, 2019

0.2.2

May 3, 2019

This version

0.2.1

Mar 3, 2019

0.2.0

Jan 9, 2019

0.1.2

Dec 25, 2018

0.1.1

Sep 21, 2018

0.1.0

Sep 2, 2018

0.0.9

Jun 27, 2018

0.0.8

May 22, 2018

0.0.7

May 17, 2018

0.0.6

Mar 19, 2018

0.0.5

Feb 25, 2018

0.0.4

Feb 25, 2018

0.0.3

Feb 25, 2018

0.0.2

Feb 22, 2018

0.0.1

Feb 15, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nagisa-0.2.1.tar.gz (20.9 MB view hashes)

Uploaded Mar 3, 2019 Source

Hashes for nagisa-0.2.1.tar.gz

Hashes for nagisa-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`f570acc620d76bf6908d23eef96645e52c6d79b6187c52d1269cfa3b429e8a76`
MD5	`9c4aaa04bfd402232baed1006b99af6b`
BLAKE2b-256	`39d61a5cf5cf1abaa97101bcaff6e72d42a09de66bfbb4a37eae47adab634b80`