A Tasty Python Binding with MeCab
Project description
A Tasty Python Binding with MeCab
What is natto-py?
natto combines the Python programming language with MeCab, the part-of-speech and morphological analyzer for the Japanese language.
You can learn more about natto-py at Bitbucket.
Requirements
natto-py requires the following:
Installation
Install natto-py with the following command:
pip install natto-py
This will automatically install the cffi package, which natto-py uses to bind to the mecab library.
Configuration
Set the MECAB_PATH environment variable to the exact name/path to your mecab library.
e.g., for Mac OS X:
export MECAB_PATH=/usr/local/Cellar/mecab/0.996/lib/libmecab.dylib
e.g., for bash on UNIX/Linux:
export MECAB_PATH=/usr/local/lib/libmecab.so
e.g., on Windows:
set MECAB_PATH=C:\Program Files\MeCab\bin\libmecab.dll
e.g., from within a Python program:
import os os.environ['MECAB_PATH']='/usr/local/lib/libmecab.so'
Usage
Here’s a very quick guide to using natto-py.
Instantiate a reference to the mecab library, and display some details:
>>> import natto >>> nm = natto.MeCab() >>> print nm <natto.api.MeCab tagger="<cdata 'mecab_t *' 0x000000000037AB40>", options="{}", dicts=[<natto.api.DictionaryInfo pointer=<cdata 'mecab_dictionary_info_t *' 0x00000000003AC530>, type="0", filename="/usr/local/lib/mecab/dic/ipadic/sys.dic", charset="utf8">], version="0.996">
Display details about the mecab system dictionary used:
>>> sysdic = nm.dicts[0] >>> print sysdic <natto.api.DictionaryInfo pointer=<cdata 'mecab_dictionary_info_t *' 0x00000000003AC530>, type="0", filename="/usr/local/lib/mecab/dic/ipadic/sys.dic", charset="utf8"> >>> print sysdic.is_sysdic() True
Parse Japanese text as a string, outputting to stdout:
>>> print nm.parse('ピンチの時には必ずヒーローが現れる。') ピンチ 名詞,一般,*,*,*,*,ピンチ,ピンチ,ピンチ の 助詞,連体化,*,*,*,*,の,ノ,ノ 時 名詞,非自立,副詞可能,*,*,*,時,トキ,トキ に 助詞,格助詞,一般,*,*,*,に,ニ,ニ は 助詞,係助詞,*,*,*,*,は,ハ,ワ 必ず 副詞,助詞類接続,*,*,*,*,必ず,カナラズ,カナラズ ヒーロー 名詞,一般,*,*,*,*,ヒーロー,ヒーロー,ヒーロー が 助詞,格助詞,一般,*,*,*,が,ガ,ガ 現れる 動詞,自立,*,*,一段,基本形,現れる,アラワレル,アラワレル 。 記号,句点,*,*,*,*,。,。,。 EOS
Next, try parsing the text with MeCab node parsing, using the more detailed information related to each morpheme:
>>> nodes = nm.parse('ピンチの時には必ずヒーローが現れる。', as_nodes=True) >>> for n in nodes: ... if not n.is_eos(): ... print "%s\t%s" % (n.surface, n.posid) ... ピンチ 38 の 24 時 66 に 13 は 16 必ず 35 ヒーロー 38 が 13 現れる 31 。 7
Learn More
You can read more about natto-py on the project Wiki.
Contributing to natto-py
Use mercurial and check out the latest code at Bitbucket to make sure the feature hasn’t been implemented or the bug hasn’t been fixed yet.
Browse the issue tracker to make sure someone already hasn’t requested it and/or contributed it.
Fork the project.
Start a feature/bugfix branch.
Commit and push until you are happy with your contribution.
Make sure to add tests for it. This is important so I don’t break it in a future version unintentionally. I use unittest as it is very natural and easy-to-use.
Please try not to mess with the setup.py, CHANGELOG, or version files. If you must have your own version, that is fine, but please isolate to its own commit so I can cherry-pick around it.
Changelog
Please see the CHANGELOG for the release history.
Copyright
Copyright © 2014, Brooke M. Fujita. All rights reserved. Please see the LICENSE file for further details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.