Character-level tokenizer and typed morphological feature vocabulary for multilingual NLP pipelines.

These details have not been verified by PyPI

Project description

chartoken-vp

chartoken-vp is a small, typed package for character-level text vocabularies and morphological feature vocabularies.

PyPI package name:

pip install chartoken-vp

Import name:

import chartoken

The library is intentionally narrow in scope. It does not try to be a full tokenizer framework. It gives you a stable, strictly typed foundation for:

character vocabularies for sequence models
UniMorph-style feature vocabularies
deterministic serialization for checkpoints
simple tensor conversion helpers for PyTorch code

Why this package exists

Morphological reinflection and other low-level text tasks often work better with characters than with subword tokenizers. In those pipelines you usually need two parallel vocabularies:

one vocabulary for characters in source and target strings
one vocabulary for morphological tags such as PST, SG, NOM, V, and so on

chartoken-vp keeps those concerns separate and explicit.

Main components

`CharVocab`

CharVocab builds a character inventory from raw texts and exposes:

from_texts
encode
encode_ids
decode
to_dict
from_dict

The vocabulary uses three built-in special tokens:

PAD = 0
SOS = 1
EOS = 2

All text is normalized with Unicode NFKC via normalize_text.

`FeatureVocab`

FeatureVocab builds a vocabulary over feature tags and exposes:

from_tags
encode
encode_tensor
to_dict
from_dict

Feature sequences are padded with FEATURE_PAD = 0 and returned together with a float mask.

Installation

Requirements:

Python >=3.14
PyTorch >=2.0

Install from PyPI:

pip install chartoken-vp

Quick start

from chartoken import CharVocab, FeatureVocab

texts = ["walk", "walked", "go", "went"]
tag_sets = [
    ["V", "PRS"],
    ["V", "PST"],
    ["V", "PRS"],
    ["V", "PST"],
]

char_vocab = CharVocab.from_texts(texts)
feature_vocab = FeatureVocab.from_tags(tag_sets)

token_ids = char_vocab.encode_ids("walk", max_len=12)
feature_ids, feature_mask = feature_vocab.encode(["V", "PST"], max_features=8)

print(token_ids)
print(feature_ids, feature_mask)
print(char_vocab.decode(token_ids))

Character vocabulary behavior

Encoding works as:

normalize input text with NFKC
prepend <sos>
append <eos>
truncate to max_len
right-pad with <pad>

This makes the output predictable and checkpoint-friendly.

Example:

from chartoken import CharVocab

vocab = CharVocab.from_texts(["lemma", "form"])
tensor = vocab.encode("lemma", max_len=10)
print(tensor.shape)

encode returns a torch.Tensor, while encode_ids returns list[int]. That split is useful when you want preprocessing logic without eagerly creating tensors.

Feature vocabulary behavior

Feature tags are treated as an unordered list supplied by the caller. The package:

maps known tags to integer ids
truncates to max_features
pads the remainder with FEATURE_PAD
returns a float mask aligned with the ids

Example:

from chartoken import FeatureVocab

vocab = FeatureVocab.from_tags([["N", "SG"], ["N", "PL"], ["V", "PST"]])
ids, mask = vocab.encode(["N", "SG"], max_features=6)

If you want tensors directly:

ids_tensor, mask_tensor = vocab.encode_tensor(["N", "SG"], max_features=6)

Serialization

Both vocabularies are serializable to plain dictionaries and back:

state = char_vocab.to_dict()
restored = CharVocab.from_dict(state)

This is useful for:

checkpoint payloads
experiment reproducibility
packaging trained models
keeping training and inference vocabularies aligned

Typing

This package ships py.typed and is meant to be consumed by pyright/Pylance-aware codebases.

Typed state objects:

CharVocabState
FeatureVocabState

Exported constants:

PAD
SOS
EOS
FEATURE_PAD
SPECIAL_TOKENS

Typical integration pattern

chartoken-vp is designed to sit underneath dataset and model packages.

A common stack looks like:

read raw TSV rows
build CharVocab from lemmas and surfaces
build FeatureVocab from tag lists
pre-encode examples into tensors
save vocab state in checkpoints
reuse the same states at inference time

What this package deliberately does not do

It does not include:

BPE or sentencepiece tokenization
dataset downloading
batching or dataloaders
model architectures
training loops

That separation is intentional. chartoken-vp should stay easy to publish, easy to test, and easy to embed into larger systems.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

3.0.0

Apr 20, 2026

2.2.0

Apr 2, 2026

2.1.5

Mar 30, 2026

This version

2.1.4

Mar 30, 2026

2.1.3

Mar 30, 2026

2.1.2

Mar 30, 2026

2.1.1

Mar 30, 2026

2.1.0

Mar 29, 2026

2.0.2

Mar 29, 2026

2.0.1

Mar 29, 2026

2.0.0

Mar 29, 2026

1.1.0

Mar 28, 2026

1.0.0

Mar 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chartoken_vp-2.1.4.tar.gz (5.3 kB view details)

Uploaded Mar 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chartoken_vp-2.1.4-py3-none-any.whl (5.7 kB view details)

Uploaded Mar 30, 2026 Python 3

File details

Details for the file chartoken_vp-2.1.4.tar.gz.

File metadata

Download URL: chartoken_vp-2.1.4.tar.gz
Upload date: Mar 30, 2026
Size: 5.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for chartoken_vp-2.1.4.tar.gz
Algorithm	Hash digest
SHA256	`e3ed42bd57085f56e024a56d6e602916448d70e6b2c482cf27275ef441fb0c93`
MD5	`faaeec9bc8e1874f4c0f3f4265b87640`
BLAKE2b-256	`8b2b87b77a94e36717bc0cd511b070e0c5c3dba969a7850475274d9f83c58f6f`

See more details on using hashes here.

File details

Details for the file chartoken_vp-2.1.4-py3-none-any.whl.

File metadata

Download URL: chartoken_vp-2.1.4-py3-none-any.whl
Upload date: Mar 30, 2026
Size: 5.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for chartoken_vp-2.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6324428ed198a14fe005bc50348b6cf3d39e0ddddeb059ca8b67b8eccd7f95d0`
MD5	`c7f7259c0cceab1a3e1123eff471c5de`
BLAKE2b-256	`c924e92c9feb28912d7969439fe6e924acafd6936b765fd99c13ba50c6c8947f`

See more details on using hashes here.

chartoken-vp 2.1.4

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

chartoken-vp

Why this package exists

Main components

`CharVocab`

`FeatureVocab`

Installation

Quick start

Character vocabulary behavior

Feature vocabulary behavior

Serialization

Typing

Typical integration pattern

What this package deliberately does not do

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes