Skip to main content

Very compact Japanese tokenizer

Project description

“TinySegmenter in Python” is a Python port of TinySegmenter (which is an extremely compact (23KB) Japanese tokenizer originally written in JavaScript by Mr. Taku Kudo. It works on Python 2.5 or above.

Authors

See the original authors in AUTHORS file.

Installation

See INSTALL file.

Usage

Example code for direct usage:

> import tinysegmenter
> segmenter = tinysegmenter.TinySegmenter()
> print ' | '.join(segmenter.tokenize(u"私の名前は中野です"))
私 | の | 名前 | は | 中野 | です

“TinySegmenter in Python”‘s interface is compatible with NLTK’s TokenizerI, although the distribution file below does not directly depend on NLTK. If you’d like to use it as a tokenizer in NLTK, you have to modify the first few lines of the code as below (so you can’t use the pypi repository version for now, if you wish to do this. Get the sources.):

import nltk
import re
from nltk.tokenize.api import *

class TinySegmenter(TokenizerI):

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tinysegmenter-0.1.tar.gz (12.2 kB view details)

Uploaded Source

File details

Details for the file tinysegmenter-0.1.tar.gz.

File metadata

  • Download URL: tinysegmenter-0.1.tar.gz
  • Upload date:
  • Size: 12.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for tinysegmenter-0.1.tar.gz
Algorithm Hash digest
SHA256 2f19799e1cbd5877e7e101d74240eac21d4d224e5036fcfa58fc8e82ca642468
MD5 482525bea160b0b16571e5f6bcef4a9f
BLAKE2b-256 f0828a71b37e1c1f8c14e6bd95c2c49058bb05d35ca9c28e0efa2a9fb2c3039e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page