tokenizers for Whoosh designed for Japanese language
Project description
About
Tokenizers for Whoosh full text search library designed for Japanese language. This package conteins two Tokenizers.
IgoTokenizer
requires igo-python(http://pypi.python.org/pypi/igo-python/) and its dictionary.
TinySegmenterTokenizer
requires TinySegmenter in Python(https://code.google.com/p/mhagiwara/source/browse/trunk/nltk/jpbook/tinysegmenter.py)
MeCabTokenizer
requires MeCab python binding(http://mecab.sourceforge.net/bindings.html)
How To Use
IgoTokenizer:
import igo.Tagger import whooshjp from whooshjp.IgoTokenizer import IgoTokenizer tk = IgoTokenizer(igo.Tagger.Tagger('ipadic')) scm = Schema(title=TEXT(stored=True, analyzer=tk), path=ID(unique=True,stored=True), content=TEXT(analyzer=tk))
TinySegmenterTokenizer:
import tinysegmenter import whooshjp from whooshjp.TinySegmenterTokenizer import TinySegmenterTokenizer tk = TinySegmenterTokenizer(tinysegmenter.TinySegmenter()) scm = Schema(title=TEXT(stored=True, analyzer=tk), path=ID(unique=True,stored=True), content=TEXT(analyzer=tk))
Changelog for Japanese Tokenizers for Whoosh
- 2011-02-19 – 0.1
first release.
- 2011-02-21 – 0.2
add TinySegmenterTokenizer
change module name
- 2011-02-24 – 0.3
add FeatureFilter
- 2011-02-27 – 0.4
add MeCabTokenizer
add a mode for don’t pickle igo tagger to minimize index.
- 2011-04-17 – 0.5
correct char offsets
- 2011-04-17 – 0.6
correct char offsets(TinySegmenterTokenizer)
- 2012-04-14 – 0.7
rename package(WhooshJapaneseTokenizer to whooshjp)
no longer import sub modules automatically
Python3 compatibility(3.2, 3.3)
Drop Python2.5 support
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file whoosh-igo-0.7.tar.gz
.
File metadata
- Download URL: whoosh-igo-0.7.tar.gz
- Upload date:
- Size: 8.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9731410ae86c4980955b77be62d3a06592705a4fc89191136b2841d379597157 |
|
MD5 | 24f942dd1a5e59d72907893ad602f38c |
|
BLAKE2b-256 | 0cd65c557c67716de1a6657da7e425cacf6c5abffb73ba8cb20609b21ee086c2 |