Compact Japanese segmenter
Project description
TiniestSegmenter
A port of TinySegmenter written in pure, safe rust with no dependencies. You can find bindings for both Rust and Python.
TinySegmenter is an n-gram word tokenizer for Japanese text originally built by Taku Kudo (2008).
Usage
tiniestsegmenter
can be installed from PyPI: pip install tiniestsegmenter
import tiniestsegmenter
tokens = tiniestsegmenter.tokenize("ジャガイモが好きです。")
With the GIL released on the rust side, multi-threading is also possible.
import functools
import tiniestsegmenter
tokenizer = functools.partial(tiniestsegmenter.tokenize)
documents = ["ジャガイモが好きです。"] * 10_000
with ThreadPoolExecutor(4) as e:
list(e.map(encoder, documents))
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
tiniestsegmenter-0.1.0.tar.gz
(322.7 kB
view details)
Built Distribution
File details
Details for the file tiniestsegmenter-0.1.0.tar.gz
.
File metadata
- Download URL: tiniestsegmenter-0.1.0.tar.gz
- Upload date:
- Size: 322.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e28fa125bad2ab307219f0d266a534ce4e6998d342511d1d8504c487cf290193 |
|
MD5 | c0854de3ed6dbf9eeaa503b8a25c22c0 |
|
BLAKE2b-256 | 2dbd9ce529ebe70ee92971f9637f2cf4d5a4f3db5415b943f088820ce6f7a9cb |
File details
Details for the file tiniestsegmenter-0.1.0-cp311-cp311-macosx_11_0_arm64.whl
.
File metadata
- Download URL: tiniestsegmenter-0.1.0-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 234.6 kB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 64ca46bd117fdf4f21ca5889d060fa50d7608b2b17cfbdf59b75c29f009ab6bc |
|
MD5 | 38007fa16a2d9535d4106327153bbf75 |
|
BLAKE2b-256 | bb6abbb80e1ac603f400c171f43adbbb35ee0389bea96fdee502e8f6991d1f4a |