Skip to main content

Very compact Japanese tokenizer

Project description

“TinySegmenter in Python” is a Python port by Masato Hagiwara of TinySegmenter, which is an extremely compact Japanese tokenizer originally written in JavaScript by Mr. Taku Kudo.

The library has been finally packaged by Jehan. It resulted into this fork because Masako Hagiwara did not answer emails, and packaging patches could therefore not be committed upstream. But this is a friendly fork, and Masako Hagiwara is welcome to take back maintainance over his project. For the time being, I (Jehan) took up the maintenance, so please refer to this new website as being official, and direct any new patch there. I will follow up on patchs and bug reports, but probably won’t maintain an active development. Anyone wishing to improve the library is welcome to participate and will be gladly given committer rights.

It works on Python 2.6 or above (works on Python 3 too).

Authors

See all authors and contributors in AUTHORS file.

Download and Installation

This library can be installed the common ways: with a setup.py, as a pip package… See the INSTALL file in the package for more details.

If you simply want to download the source package, refer to the pypi repository: http://pypi.python.org/pypi/tinysegmenter

Development version can be downloaded anonymously at the Git repository:

$ git clone git://git.tuxfamily.org/gitroot/tinysegmente/tinysegmenter.git

or browsed online at: http://git.tuxfamily.org/tinysegmente/tinysegmenter/

Usage

Example code for direct usage:

> import tinysegmenter
> segmenter = tinysegmenter.TinySegmenter()
> print(' | '.join(segmenter.tokenize(u"私の名前は中野です")))
私 | の | 名前 | は | 中野 | です

TinySegmenter‘s interface is compatible with NLTK’s TokenizerI class, although the distribution does not directly depend on NLTK. Here is one way to use it as a tokenizer in NLTK (order of the multiple base classes matters):

import nltk.tokenize.api

class myTinySegmenter(tinysegmenter.TinySegmenter, nltk.tokenize.api.TokenizerI):
    pass
segmenter = myTinySegmenter()
# This segmenter can be used any place which expects a NLTK's TokenizerI subclass.

For more about NLTK (Natural Language Toolkit module), see: http://nltk.org/api/nltk.tokenize.html#nltk.tokenize.api.TokenizerI

Contact, Bugs and Contributing

All bug, patch, question, etc. can be sent to tinysegmenter at zemarmot dot net.

License

This package is distributed under a New BSD License (see COPYING file).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tinysegmenter-0.4.tar.gz (16.9 kB view details)

Uploaded Source

File details

Details for the file tinysegmenter-0.4.tar.gz.

File metadata

  • Download URL: tinysegmenter-0.4.tar.gz
  • Upload date:
  • Size: 16.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.9.1 pkginfo/1.3.2 requests/2.18.4 setuptools/39.2.0 requests-toolbelt/0.8.0 tqdm/4.19.6 CPython/3.6.6

File hashes

Hashes for tinysegmenter-0.4.tar.gz
Algorithm Hash digest
SHA256 64458bdba54ba7482c02c785f960cf3d6cf61594a4dab3d64c325439094fc1b7
MD5 c6e04dba1216b5a8971f62680f294a3a
BLAKE2b-256 9c70488895cb11e160b548c9ba5847c171b65b86a8ca1e54d206d55b2976bf7b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page