Compact Japanese segmenter
Project description
TiniestSegmenter
A port of TinySegmenter written in pure, safe rust with no dependencies. You can find bindings for both Rust and Python.
TinySegmenter is an n-gram word tokenizer for Japanese text originally built by Taku Kudo (2008).
Usage
tiniestsegmenter
can be installed from PyPI: pip install tiniestsegmenter
import tiniestsegmenter
tokens = tiniestsegmenter.tokenize("ジャガイモが好きです。")
With the GIL released on the rust side, multi-threading is also possible.
import functools
import tiniestsegmenter
tokenizer = functools.partial(tiniestsegmenter.tokenize)
documents = ["ジャガイモが好きです。"] * 10_000
with ThreadPoolExecutor(4) as e:
list(e.map(encoder, documents))
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
tiniestsegmenter-0.1.0.tar.gz
(322.7 kB
view hashes)
Built Distribution
Close
Hashes for tiniestsegmenter-0.1.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 64ca46bd117fdf4f21ca5889d060fa50d7608b2b17cfbdf59b75c29f009ab6bc |
|
MD5 | 38007fa16a2d9535d4106327153bbf75 |
|
BLAKE2b-256 | bb6abbb80e1ac603f400c171f43adbbb35ee0389bea96fdee502e8f6991d1f4a |