Skip to main content

Compact Japanese segmenter

Project description

TiniestSegmenter

A port of TinySegmenter written in pure, safe rust with no dependencies. You can find bindings for both Rust and Python.

TinySegmenter is an n-gram word tokenizer for Japanese text originally built by Taku Kudo (2008).

Usage

tiniestsegmenter can be installed from PyPI: pip install tiniestsegmenter

import tiniestsegmenter

tokens = tiniestsegmenter.tokenize("ジャガイモが好きです。")

With the GIL released on the rust side, multi-threading is also possible.

import functools
import tiniestsegmenter

tokenizer = functools.partial(tiniestsegmenter.tokenize)

documents = ["ジャガイモが好きです。"] * 10_000
with ThreadPoolExecutor(4) as e:
    list(e.map(encoder, documents))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tiniestsegmenter-0.3.0.tar.gz (19.2 kB view details)

Uploaded Source

Built Distribution

tiniestsegmenter-0.3.0-cp310-cp310-macosx_11_0_arm64.whl (223.6 kB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

File details

Details for the file tiniestsegmenter-0.3.0.tar.gz.

File metadata

  • Download URL: tiniestsegmenter-0.3.0.tar.gz
  • Upload date:
  • Size: 19.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.9

File hashes

Hashes for tiniestsegmenter-0.3.0.tar.gz
Algorithm Hash digest
SHA256 91c15a10de4d68256b733df48d559ba208e1067ded78f8314861b9d2d3bf2502
MD5 7b39fb0123a30509692952d84f9a2b1f
BLAKE2b-256 bfbf7aa3628a6fda12ee6be30e2b39e8706227d91c91d8348540c7e3ddb1d444

See more details on using hashes here.

File details

Details for the file tiniestsegmenter-0.3.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tiniestsegmenter-0.3.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 73259212190adf51a0e0c3731f3b992fbf1e6369730b4c3733d943dae710d5c2
MD5 790f3db999527d1f45bbe4f8d8fa3022
BLAKE2b-256 7483539f252a29fd352662438a6914820566e397bb70dd8e2df06efdbaf3ce8a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page