Skip to main content

A Tasty Python Binding with MeCab (FFI-based, no SWIG or compiler necessary)

Project description

natto-py
========

What is natto-py?
-----------------
``natto-py`` combines the Python_ programming language with MeCab_, the part-of-speech
and morphological analyzer for the Japanese language.

You can learn more about `natto-py at Bitbucket`_.

Requirements
-------------
``natto-py`` requires the following:

- `MeCab 0.996`_ along with an appropriate dictionary (`mecab-ipadic`_ recommended)
- `cffi 0.8.6`_

``natto-py`` is compatible with the following Python versions:

- `Python 2.7.8`_
- `Python 3.2.5`_
- `Python 3.3.5`_
- `Python 3.4.2`_

Installation
------------
Install ``natto-py`` with the following command::

pip install natto-py

This will automatically install the ``cffi`` package, which ``natto-py`` uses to
bind to the ``mecab`` library.

Configuration
-------------
``natto-py`` should just work without any explicit configuration.

On \*nix and Mac OS, it queries ``mecab-config`` to discover the
path to the ``libmecab.so`` or ``libmecab.dylib``, respectively.

On Windows, it queries the Windows Registry to locate the MeCab
installation folder.

In order to convert character encodings to/from Unicode, ``natto-py``
will examine the charset of the ``mecab`` system dictionary.

Therefore, as long as the ``mecab`` (and ``mecab-config`` for \*nix
and Mac OS) executables are on your ``PATH``, ``natto-py`` should
not require any explicit configuration.

If ``natto-py``, for some reason, cannot locate the ``mecab`` library,
or if it cannot determine the correct charset used internally by
``mecab``, then you will need to set the ``MECAB_PATH`` and ``MECAB_CHARSET``
environment variables.

Set the ``MECAB_PATH`` environment variable to the exact name/path to your
``mecab`` library. Set the ``MECAB_CHARSET`` environment variable if you
compiled ``mecab`` and the related dictionary to use a non-default character
encoding.

e.g., for Mac OS X::

export MECAB_PATH=/usr/local/Cellar/mecab/0.996/lib/libmecab.dylib
export MECAB_CHARSET=utf8

e.g., for bash on UNIX/Linux::

export MECAB_PATH=/usr/local/lib/libmecab.so
export MECAB_CHARSET=euc-jp

e.g., on Windows::

set MECAB_PATH=C:\Program Files\MeCab\bin\libmecab.dll
set MECAB_CHARSET=shift-jis

e.g., from within a Python program::

import os

os.environ['MECAB_PATH']='/usr/local/lib/libmecab.so'
os.environ['MECAB_CHARSET']=utf-16

Usage
-----
Here's a very quick guide to using ``natto-py``.

Following `NLTK3`_ lead, ``natto-py`` requires all input
to be unicode, and always returns text as unicode.
On Python 2.7, you will have to decode/encode accordingly.

Instantiate a reference to the ``mecab`` library, and display some details::

from natto import MeCab

with MeCab() as nm:
print(nm)

# displays details about the MeCab instance
<natto.mecab.MeCab
lib="/usr/local/lib/libmecab.so",
tagger=<cdata 'mecab_t *' 0x000000000037AB40>,
options={},
dicts=[<natto.dictionary.DictionaryInfo
pointer=<cdata 'mecab_dictionary_info_t *' 0x00000000003AC530>,
type="0",
filename="/usr/local/lib/mecab/dic/ipadic/sys.dic",
charset="utf8">],
version="0.996">

Display details about the ``mecab`` system dictionary used::

sysdic = nm.dicts[0]
print(sysdic)

# displays the MeCab system dictionary info
<natto.dictionary.DictionaryInfo
pointer=<cdata 'mecab_dictionary_info_t *' 0x00000000003AC530>,
type=0,
filename="/usr/local/lib/mecab/dic/ipadic/sys.dic",
charset="utf8">


Parse Japanese text as a string, outputting to ``stdout``::

print(nm.parse('ピンチの時には必ずヒーローが現れる。'))

# MeCab's parsing as a string sent to stdout
ピンチ 名詞,一般,*,*,*,*,ピンチ,ピンチ,ピンチ
の 助詞,連体化,*,*,*,*,の,ノ,ノ
時 名詞,非自立,副詞可能,*,*,*,時,トキ,トキ
に 助詞,格助詞,一般,*,*,*,に,ニ,ニ
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
必ず 副詞,助詞類接続,*,*,*,*,必ず,カナラズ,カナラズ
ヒーロー 名詞,一般,*,*,*,*,ヒーロー,ヒーロー,ヒーロー
が 助詞,格助詞,一般,*,*,*,が,ガ,ガ
現れる 動詞,自立,*,*,一段,基本形,現れる,アラワレル,アラワレル
。 記号,句点,*,*,*,*,。,。,。
EOS

Next, try parsing the text with MeCab node parsing, using the more detailed
information related to each morpheme::

nodes = nm.parse('ピンチの時には必ずヒーローが現れる。', as_nodes=True)

for n in nodes:
... if not n.is_eos():
... print('%s\t%s' % (n.surface, n.cost))
...
ピンチ 3348
の 3722
時 5176
に 5083
は 5305
必ず 7525
ヒーロー 11363
が 10508
現れる 10841
。 7127

Learn More
----------
You can read more about ``natto-py`` on the `project Wiki`_.

Contributing to natto-py
------------------------
- Use mercurial_ and `check out the latest code at Bitbucket`_ to make sure the
feature hasn't been implemented or the bug hasn't been fixed yet.
- `Browse the issue tracker`_ to make sure someone already hasn't requested it
and/or contributed it.
- Fork the project.
- Start a feature/bugfix branch.
- Commit and push until you are happy with your contribution.
- Make sure to add tests for it. This is important so I don't break it in a
future version unintentionally. I use unittest_ as it is very natural
and easy-to-use.
- Please try not to mess with the ``setup.py``, ``CHANGELOG``, or version
files. If you must have your own version, that is fine, but please isolate
to its own commit so I can cherry-pick around it.

Changelog
---------
Please see the ``CHANGELOG`` for the release history.

Copyright
---------
Copyright |copy| 2014, Brooke M. Fujita. All rights reserved. Please see the
LICENSE file for further details.

.. _Python: http://www.python.org/
.. _MeCab: http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html
.. _mecab-ipadic: http://mecab.googlecode.com/files/mecab-ipadic-2.7.0-20070801.tar.gz
.. _natto-py at Bitbucket: https://bitbucket.org/buruzaemon/natto-py
.. _MeCab 0.996: http://code.google.com/p/mecab/downloads/list
.. _cffi 0.8.6: https://bitbucket.org/cffi/cffi
.. _Python 2.7.8: https://www.python.org/download/releases/2.7.8/
.. _Python 3.2.5: https://www.python.org/download/releases/3.2.5/
.. _Python 3.3.5: https://www.python.org/download/releases/3.3.5/
.. _Python 3.4.2: https://www.python.org/downloads/release/python-342/
.. _NTLK3: https://github.com/nltk/nltk/wiki/Porting-your-code-to-NLTK-3.0
.. _project Wiki: https://bitbucket.org/buruzaemon/natto-py/wiki/Home
.. _mercurial: http://mercurial.selenic.com/
.. _check out the latest code at Bitbucket: https://bitbucket.org/buruzaemon/natto-py/src
.. _Browse the issue tracker: https://bitbucket.org/buruzaemon/natto-py/issues?status=new&status=open
.. _unittest: http://pythontesting.net/framework/unittest/unittest-introduction/
.. |copy| unicode:: 0xA9 .. copyright sign

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

natto-py-0.0.6.tar.gz (23.5 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page