Compact Japanese segmenter
Project description
TiniestSegmenter
A port of TinySegmenter written in pure, safe rust with no dependencies. You can find bindings for both Rust and Python.
TinySegmenter is an n-gram word tokenizer for Japanese text originally built by Taku Kudo (2008).
Usage
tiniestsegmenter
can be installed from PyPI: pip install tiniestsegmenter
import tiniestsegmenter
tokens = tiniestsegmenter.tokenize("ジャガイモが好きです。")
With the GIL released on the rust side, multi-threading is also possible.
import functools
import tiniestsegmenter
tokenizer = functools.partial(tiniestsegmenter.tokenize)
documents = ["ジャガイモが好きです。"] * 10_000
with ThreadPoolExecutor(4) as e:
list(e.map(encoder, documents))
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
tiniestsegmenter-0.2.0.tar.gz
(19.3 kB
view details)
Built Distribution
File details
Details for the file tiniestsegmenter-0.2.0.tar.gz
.
File metadata
- Download URL: tiniestsegmenter-0.2.0.tar.gz
- Upload date:
- Size: 19.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5b92f904af3f550b0fd42a1433c3d45bbe4d606452b889ea1fd51d214b0725ab |
|
MD5 | 297bba88ad6e735de19452a8a6656c86 |
|
BLAKE2b-256 | 5463dca4442bf2fd8071ae3907ed25288111d4613dc7e4a1362fef201aeb077e |
File details
Details for the file tiniestsegmenter-0.2.0-cp311-cp311-macosx_11_0_arm64.whl
.
File metadata
- Download URL: tiniestsegmenter-0.2.0-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 234.7 kB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 205f2de3472f024f6fc90742d4236dbbb5ffc53e2b3290aee0a38c0b3a29d2d9 |
|
MD5 | ec761cc3dbef375b68020273ac1af9d5 |
|
BLAKE2b-256 | 68df17671439664c9e99525354850730a23ae6401a03516bf6d551fa52a99d8c |