Skip to main content

Character-level tokenizer and morphological feature encoder for NLP pipelines (UniMorph, NFKC, padding, serialization)

Project description

chartoken-vp

Character-level tokenizer and morphological feature encoder for NLP pipelines.

Part of the MorphFormer project by Voluntas Progressus.

Installation

pip install chartoken-vp

Requires Python >= 3.14 and PyTorch >= 2.0.

Features

  • CharVocab — character-level tokenizer with NFKC normalization, SOS/EOS/PAD special tokens
  • FeatureVocab — morphological feature encoder for UniMorph tag sets with padding masks
  • Encode text to padded tensors or plain ID lists
  • Serialize/deserialize vocabularies via to_dict() / from_dict() for checkpoint compatibility
  • Dynamic vocabulary expansion from new texts

Quick Start

from chartoken import CharVocab, FeatureVocab

# Build vocab from texts
vocab = CharVocab.from_texts(["hello", "world"])
ids = vocab.encode("hello", max_len=32)
print(vocab.decode(ids.tolist()))  # "hello"

# Feature vocab for morphological tags
feat_vocab = FeatureVocab.from_tags([["V", "IND", "PRS"], ["N", "SG"]])
feat_ids, feat_mask = feat_vocab.encode(["V", "IND"], max_features=12)

API

Class / Constant Description
CharVocab Character vocabulary with encode/decode/from_texts/to_dict
FeatureVocab UniMorph feature vocabulary with encode/to_dict
PAD, SOS, EOS Special token strings
FEATURE_PAD Padding ID for feature sequences
normalize_text NFKC text normalization
SPECIAL_TOKENS Set of all special tokens

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chartoken_vp-1.1.0.tar.gz (4.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chartoken_vp-1.1.0-py3-none-any.whl (5.1 kB view details)

Uploaded Python 3

File details

Details for the file chartoken_vp-1.1.0.tar.gz.

File metadata

  • Download URL: chartoken_vp-1.1.0.tar.gz
  • Upload date:
  • Size: 4.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for chartoken_vp-1.1.0.tar.gz
Algorithm Hash digest
SHA256 af84d55bbc2e99ec47cd959f1049122830e35688b8591750b934f0986938bdfe
MD5 5cb6a4ae6a9395b5fff8891244180ef3
BLAKE2b-256 043e29a11dfe8e94692352abc0fbdf00853880cb165c44982a4730289c216094

See more details on using hashes here.

File details

Details for the file chartoken_vp-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: chartoken_vp-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 5.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for chartoken_vp-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8f84231d0391782ce587b3c17dadd5245b5f19957f36cee8b5c01e5e6aa14e0a
MD5 8aca4f4131ddfba536fa1decaf3c1fdb
BLAKE2b-256 4793a54b5164b8cd470db8b0cb9e9a91654c6b34e2e039362790808b53b31cbd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page