Skip to main content

Compact Japanese segmenter

Project description

TiniestSegmenter

A port of TinySegmenter written in pure, safe rust with no dependencies. You can find bindings for both Rust and Python.

TinySegmenter is an n-gram word tokenizer for Japanese text originally built by Taku Kudo (2008).

Usage

tiniestsegmenter can be installed from PyPI: pip install tiniestsegmenter

import tiniestsegmenter

tokens = tiniestsegmenter.tokenize("ジャガイモが好きです。")

With the GIL released on the rust side, multi-threading is also possible.

import functools
import tiniestsegmenter

tokenizer = functools.partial(tiniestsegmenter.tokenize)

documents = ["ジャガイモが好きです。"] * 10_000
with ThreadPoolExecutor(4) as e:
    list(e.map(encoder, documents))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tiniestsegmenter-0.1.0.tar.gz (322.7 kB view hashes)

Uploaded Source

Built Distribution

tiniestsegmenter-0.1.0-cp311-cp311-macosx_11_0_arm64.whl (234.6 kB view hashes)

Uploaded CPython 3.11 macOS 11.0+ ARM64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page