Character-level tokenizer and typed morphological feature vocabulary for multilingual NLP pipelines.
Project description
chartoken-vp
chartoken-vp is a small, typed package for character-level text vocabularies and morphological feature vocabularies.
PyPI package name:
pip install chartoken-vp
Import name:
import chartoken
The library is intentionally narrow in scope. It does not try to be a full tokenizer framework. It gives you a stable, strictly typed foundation for:
- character vocabularies for sequence models
- UniMorph-style feature vocabularies
- deterministic serialization for checkpoints
- simple tensor conversion helpers for PyTorch code
Why this package exists
Morphological reinflection and other low-level text tasks often work better with characters than with subword tokenizers. In those pipelines you usually need two parallel vocabularies:
- one vocabulary for characters in source and target strings
- one vocabulary for morphological tags such as
PST,SG,NOM,V, and so on
chartoken-vp keeps those concerns separate and explicit.
Main components
CharVocab
CharVocab builds a character inventory from raw texts and exposes:
from_textsencodeencode_idsdecodeto_dictfrom_dict
The vocabulary uses three built-in special tokens:
PAD = 0SOS = 1EOS = 2
All text is normalized with Unicode NFKC via normalize_text.
FeatureVocab
FeatureVocab builds a vocabulary over feature tags and exposes:
from_tagsencodeencode_tensorto_dictfrom_dict
Feature sequences are padded with FEATURE_PAD = 0 and returned together with a float mask.
Installation
Requirements:
- Python
>=3.14 - PyTorch
>=2.0
Install from PyPI:
pip install chartoken-vp
Quick start
from chartoken import CharVocab, FeatureVocab
texts = ["walk", "walked", "go", "went"]
tag_sets = [
["V", "PRS"],
["V", "PST"],
["V", "PRS"],
["V", "PST"],
]
char_vocab = CharVocab.from_texts(texts)
feature_vocab = FeatureVocab.from_tags(tag_sets)
token_ids = char_vocab.encode_ids("walk", max_len=12)
feature_ids, feature_mask = feature_vocab.encode(["V", "PST"], max_features=8)
print(token_ids)
print(feature_ids, feature_mask)
print(char_vocab.decode(token_ids))
Character vocabulary behavior
Encoding works as:
- normalize input text with NFKC
- prepend
<sos> - append
<eos> - truncate to
max_len - right-pad with
<pad>
This makes the output predictable and checkpoint-friendly.
Example:
from chartoken import CharVocab
vocab = CharVocab.from_texts(["lemma", "form"])
tensor = vocab.encode("lemma", max_len=10)
print(tensor.shape)
encode returns a torch.Tensor, while encode_ids returns list[int]. That split is useful when you want preprocessing logic without eagerly creating tensors.
Feature vocabulary behavior
Feature tags are treated as an unordered list supplied by the caller. The package:
- maps known tags to integer ids
- truncates to
max_features - pads the remainder with
FEATURE_PAD - returns a float mask aligned with the ids
Example:
from chartoken import FeatureVocab
vocab = FeatureVocab.from_tags([["N", "SG"], ["N", "PL"], ["V", "PST"]])
ids, mask = vocab.encode(["N", "SG"], max_features=6)
If you want tensors directly:
ids_tensor, mask_tensor = vocab.encode_tensor(["N", "SG"], max_features=6)
Serialization
Both vocabularies are serializable to plain dictionaries and back:
state = char_vocab.to_dict()
restored = CharVocab.from_dict(state)
This is useful for:
- checkpoint payloads
- experiment reproducibility
- packaging trained models
- keeping training and inference vocabularies aligned
Typing
This package ships py.typed and is meant to be consumed by pyright/Pylance-aware codebases.
Typed state objects:
CharVocabStateFeatureVocabState
Exported constants:
PADSOSEOSFEATURE_PADSPECIAL_TOKENS
Typical integration pattern
chartoken-vp is designed to sit underneath dataset and model packages.
A common stack looks like:
- read raw TSV rows
- build
CharVocabfrom lemmas and surfaces - build
FeatureVocabfrom tag lists - pre-encode examples into tensors
- save vocab state in checkpoints
- reuse the same states at inference time
What this package deliberately does not do
It does not include:
- BPE or sentencepiece tokenization
- dataset downloading
- batching or dataloaders
- model architectures
- training loops
That separation is intentional. chartoken-vp should stay easy to publish, easy to test, and easy to embed into larger systems.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chartoken_vp-2.1.0.tar.gz.
File metadata
- Download URL: chartoken_vp-2.1.0.tar.gz
- Upload date:
- Size: 5.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c2560faa6870ae7c4df722c0feb4b39eeff362cea3192691281ffaaead01d843
|
|
| MD5 |
714a3f41ade0248d318292873d236ebc
|
|
| BLAKE2b-256 |
89db0ca268cb1dc3c50a41cf6bb0eb3eb144852e79cb758049dce72faf6dad6d
|
File details
Details for the file chartoken_vp-2.1.0-py3-none-any.whl.
File metadata
- Download URL: chartoken_vp-2.1.0-py3-none-any.whl
- Upload date:
- Size: 5.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
72a63467133c10af61b740881902df9b73f066e1d3f04b56febde779c0f93e90
|
|
| MD5 |
722b2d27248e480395cb1cbc22c1b1ec
|
|
| BLAKE2b-256 |
1b0d1170347a4d0b4208791895e294e58b5af516cd95132bd241e74b053e2093
|