Skip to main content

Ideographic Tokenizer with CHISE-IDS

Project description

Current PyPI packages

IDSpiece

漢字/汉字-Tokenizer with Ideographic Description Sequence from CHISE-IDS.

  • Only nine IDCs (U+2FF0, U+2FF1, U+2FF4 to U+2FFA) are used.
  • IDCs never occur instantly after another IDC.
  • Instantly after IDCs, Kanxi Radicals and Supplement (U+2E80 to U+2FD5) are preferred.
  • Otherwise, CJK Unified Ideographs and Extension A (U+3400 to U+9FFC) are preferred.

Basic usage

>>> from idspiece import idstable
>>> def tokenize(text):
...   tokens=[]
...   while text>"":
...     c=text[0]
...     if c in idstable:
...       tokens.append(idstable[c][0:2])
...       text=idstable[c][2]+text[1:]
...     else:
...       tokens.append(c)
...       text=text[1:]
...   return tokens
...
>>> t=tokenize("羯諦羯諦波羅羯諦波羅僧羯諦菩提薩婆訶")
>>> print(t)
['⿰⽺', '⿱⽈', '⿹⼓', '亾', '⿰⾔', '帝', '⿰⽺', '⿱⽈', '⿹⼓', '亾', '⿰⾔', '帝', '⿰⺡', '皮', '⿱⺲', '⿰⽷', '隹', '⿰⽺', '⿱⽈', '⿹⼓', '亾', '⿰⾔', '帝', '⿰⺡', '皮', '⿱⺲', '⿰⽷', '隹', '⿰⺅', '曾', '⿰⽺', '⿱⽈', '⿹⼓', '亾', '⿰⾔', '帝', '⿱⺾', '⿱⽴', '口', '⿰⺘', '⿱⽇', '⿱⼀', '龰', '⿱⺾', '⿰⻖', '⿸产', '生', '⿱波', '女', '⿰⾔', '可']

Installation

pip3 install idspiece

Author

Koichi Yasuoka (安岡孝一)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

idspiece-0.6.5-py3-none-any.whl (932.7 kB view details)

Uploaded Python 3

File details

Details for the file idspiece-0.6.5-py3-none-any.whl.

File metadata

  • Download URL: idspiece-0.6.5-py3-none-any.whl
  • Upload date:
  • Size: 932.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/52.0.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.2

File hashes

Hashes for idspiece-0.6.5-py3-none-any.whl
Algorithm Hash digest
SHA256 4aec710919dc9b5aebee6f93402f00a9342191ba63c53308cd3e01b1ff0f5750
MD5 fae92097a49d6c2b8decfdaa7adcd914
BLAKE2b-256 10f7137cde082f5789de0251ab24910b5acdfa367facaf99844bbb6657d8630c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page