tokenizers for Whoosh designed for Japanese language
Project description
About
Tokenizers for Whoosh full text search library designed for Japanese language. This package conteins two Tokenizers.
IgoTokenizer
requires igo-python(http://pypi.python.org/pypi/igo-python/) and its dictionary.
TinySegmenterTokenizer
requires TinySegmenter in Python(https://code.google.com/p/mhagiwara/source/browse/trunk/nltk/jpbook/tinysegmenter.py)
MeCabTokenizer
requires MeCab python binding(http://mecab.sourceforge.net/bindings.html)
How To Use
IgoTokenizer:
import igo.Tagger import WhooshJapaneseTokenizer tk = WhooshJapaneseTokenizer.IgoTokenizer(igo.Tagger.Tagger('ipadic')) scm = Schema(title=TEXT(stored=True, analyzer=tk), path=ID(unique=True,stored=True), content=TEXT(analyzer=tk))
TinySegmenterTokenizer:
import WhooshJapaneseTokenizer import tinysegmenter tk = WhooshJapaneseTokenizer.TinySegmenterTokenizer(tinysegmenter.TinySegmenter()) scm = Schema(title=TEXT(stored=True, analyzer=tk), path=ID(unique=True,stored=True), content=TEXT(analyzer=tk))
Changelog for Japanese Tokenizers for Whoosh
- 2011-02-19 – 0.1
first release.
- 2011-02-21 – 0.2
add TinySegmenterTokenizer
change module name
- 2011-02-24 – 0.3
add FeatureFilter
- 2011-02-27 – 0.4
add MeCabTokenizer
add a mode for don’t pickle igo tagger to minimize index.
- 2011-04-17 – 0.5
correct char offsets
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.