Skip to main content

Compact Japanese segmenter

Project description

TiniestSegmenter

A port of TinySegmenter written in pure, safe rust with no dependencies. You can find bindings for both Rust and Python.

TinySegmenter is an n-gram word tokenizer for Japanese text originally built by Taku Kudo (2008).

Usage

tiniestsegmenter can be installed from PyPI: pip install tiniestsegmenter

import tiniestsegmenter

tokens = tiniestsegmenter.tokenize("ジャガイモが好きです。")

With the GIL released on the rust side, multi-threading is also possible.

import functools
import tiniestsegmenter

tokenizer = functools.partial(tiniestsegmenter.tokenize)

documents = ["ジャガイモが好きです。"] * 10_000
with ThreadPoolExecutor(4) as e:
    list(e.map(encoder, documents))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tiniestsegmenter-0.2.0.tar.gz (19.3 kB view details)

Uploaded Source

Built Distribution

tiniestsegmenter-0.2.0-cp311-cp311-macosx_11_0_arm64.whl (234.7 kB view details)

Uploaded CPython 3.11 macOS 11.0+ ARM64

File details

Details for the file tiniestsegmenter-0.2.0.tar.gz.

File metadata

  • Download URL: tiniestsegmenter-0.2.0.tar.gz
  • Upload date:
  • Size: 19.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.9

File hashes

Hashes for tiniestsegmenter-0.2.0.tar.gz
Algorithm Hash digest
SHA256 5b92f904af3f550b0fd42a1433c3d45bbe4d606452b889ea1fd51d214b0725ab
MD5 297bba88ad6e735de19452a8a6656c86
BLAKE2b-256 5463dca4442bf2fd8071ae3907ed25288111d4613dc7e4a1362fef201aeb077e

See more details on using hashes here.

File details

Details for the file tiniestsegmenter-0.2.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tiniestsegmenter-0.2.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 205f2de3472f024f6fc90742d4236dbbb5ffc53e2b3290aee0a38c0b3a29d2d9
MD5 ec761cc3dbef375b68020273ac1af9d5
BLAKE2b-256 68df17671439664c9e99525354850730a23ae6401a03516bf6d551fa52a99d8c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page