Skip to main content

Compact Japanese segmenter

Project description

TiniestSegmenter

A port of TinySegmenter written in pure, safe rust with no dependencies. You can find bindings for both Rust and Python.

TinySegmenter is an n-gram word tokenizer for Japanese text originally built by Taku Kudo (2008).

Usage

tiniestsegmenter can be installed from PyPI: pip install tiniestsegmenter

import tiniestsegmenter

tokens = tiniestsegmenter.tokenize("ジャガイモが好きです。")

With the GIL released on the rust side, multi-threading is also possible.

import functools
import tiniestsegmenter

tokenizer = functools.partial(tiniestsegmenter.tokenize)

documents = ["ジャガイモが好きです。"] * 10_000
with ThreadPoolExecutor(4) as e:
    list(e.map(encoder, documents))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tiniestsegmenter-0.1.0.tar.gz (322.7 kB view details)

Uploaded Source

Built Distribution

tiniestsegmenter-0.1.0-cp311-cp311-macosx_11_0_arm64.whl (234.6 kB view details)

Uploaded CPython 3.11 macOS 11.0+ ARM64

File details

Details for the file tiniestsegmenter-0.1.0.tar.gz.

File metadata

  • Download URL: tiniestsegmenter-0.1.0.tar.gz
  • Upload date:
  • Size: 322.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.9

File hashes

Hashes for tiniestsegmenter-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e28fa125bad2ab307219f0d266a534ce4e6998d342511d1d8504c487cf290193
MD5 c0854de3ed6dbf9eeaa503b8a25c22c0
BLAKE2b-256 2dbd9ce529ebe70ee92971f9637f2cf4d5a4f3db5415b943f088820ce6f7a9cb

See more details on using hashes here.

File details

Details for the file tiniestsegmenter-0.1.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tiniestsegmenter-0.1.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 64ca46bd117fdf4f21ca5889d060fa50d7608b2b17cfbdf59b75c29f009ab6bc
MD5 38007fa16a2d9535d4106327153bbf75
BLAKE2b-256 bb6abbb80e1ac603f400c171f43adbbb35ee0389bea96fdee502e8f6991d1f4a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page