Character-level tokenizer and morphological feature encoder for NLP pipelines (UniMorph, NFKC, padding, serialization)
Project description
chartoken-vp
Character-level tokenizer and morphological feature encoder for NLP pipelines.
Part of the MorphFormer project by Voluntas Progressus.
Installation
pip install chartoken-vp
Requires Python >= 3.14 and PyTorch >= 2.0.
Features
- CharVocab — character-level tokenizer with NFKC normalization, SOS/EOS/PAD special tokens
- FeatureVocab — morphological feature encoder for UniMorph tag sets with padding masks
- Encode text to padded tensors or plain ID lists
- Serialize/deserialize vocabularies via
to_dict()/from_dict()for checkpoint compatibility - Dynamic vocabulary expansion from new texts
Quick Start
from chartoken import CharVocab, FeatureVocab
# Build vocab from texts
vocab = CharVocab.from_texts(["hello", "world"])
ids = vocab.encode("hello", max_len=32)
print(vocab.decode(ids.tolist())) # "hello"
# Feature vocab for morphological tags
feat_vocab = FeatureVocab.from_tags([["V", "IND", "PRS"], ["N", "SG"]])
feat_ids, feat_mask = feat_vocab.encode(["V", "IND"], max_features=12)
API
| Class / Constant | Description |
|---|---|
CharVocab |
Character vocabulary with encode/decode/from_texts/to_dict |
FeatureVocab |
UniMorph feature vocabulary with encode/to_dict |
PAD, SOS, EOS |
Special token strings |
FEATURE_PAD |
Padding ID for feature sequences |
normalize_text |
NFKC text normalization |
SPECIAL_TOKENS |
Set of all special tokens |
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chartoken_vp-1.1.0.tar.gz.
File metadata
- Download URL: chartoken_vp-1.1.0.tar.gz
- Upload date:
- Size: 4.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
af84d55bbc2e99ec47cd959f1049122830e35688b8591750b934f0986938bdfe
|
|
| MD5 |
5cb6a4ae6a9395b5fff8891244180ef3
|
|
| BLAKE2b-256 |
043e29a11dfe8e94692352abc0fbdf00853880cb165c44982a4730289c216094
|
File details
Details for the file chartoken_vp-1.1.0-py3-none-any.whl.
File metadata
- Download URL: chartoken_vp-1.1.0-py3-none-any.whl
- Upload date:
- Size: 5.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8f84231d0391782ce587b3c17dadd5245b5f19957f36cee8b5c01e5e6aa14e0a
|
|
| MD5 |
8aca4f4131ddfba536fa1decaf3c1fdb
|
|
| BLAKE2b-256 |
4793a54b5164b8cd470db8b0cb9e9a91654c6b34e2e039362790808b53b31cbd
|