Compact Japanese segmenter
Project description
TiniestSegmenter
A port of TinySegmenter written in pure, safe rust with no dependencies. You can find bindings for both Rust and Python.
TinySegmenter is an n-gram word tokenizer for Japanese text originally built by Taku Kudo (2008).
Usage
tiniestsegmenter
can be installed from PyPI: pip install tiniestsegmenter
import tiniestsegmenter
tokens = tiniestsegmenter.tokenize("ジャガイモが好きです。")
With the GIL released on the rust side, multi-threading is also possible.
import functools
import tiniestsegmenter
tokenizer = functools.partial(tiniestsegmenter.tokenize)
documents = ["ジャガイモが好きです。"] * 10_000
with ThreadPoolExecutor(4) as e:
list(e.map(encoder, documents))
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
tiniestsegmenter-0.3.0.tar.gz
(19.2 kB
view details)
Built Distribution
File details
Details for the file tiniestsegmenter-0.3.0.tar.gz
.
File metadata
- Download URL: tiniestsegmenter-0.3.0.tar.gz
- Upload date:
- Size: 19.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 91c15a10de4d68256b733df48d559ba208e1067ded78f8314861b9d2d3bf2502 |
|
MD5 | 7b39fb0123a30509692952d84f9a2b1f |
|
BLAKE2b-256 | bfbf7aa3628a6fda12ee6be30e2b39e8706227d91c91d8348540c7e3ddb1d444 |
File details
Details for the file tiniestsegmenter-0.3.0-cp310-cp310-macosx_11_0_arm64.whl
.
File metadata
- Download URL: tiniestsegmenter-0.3.0-cp310-cp310-macosx_11_0_arm64.whl
- Upload date:
- Size: 223.6 kB
- Tags: CPython 3.10, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 73259212190adf51a0e0c3731f3b992fbf1e6369730b4c3733d943dae710d5c2 |
|
MD5 | 790f3db999527d1f45bbe4f8d8fa3022 |
|
BLAKE2b-256 | 7483539f252a29fd352662438a6914820566e397bb70dd8e2df06efdbaf3ce8a |