Skip to main content

Character-level tokenizer and typed morphological feature vocabulary for multilingual NLP pipelines.

Project description

chartoken-vp

chartoken-vp is a small, typed package for character-level text vocabularies and morphological feature vocabularies.

PyPI package name:

pip install chartoken-vp

Import name:

import chartoken

The library is intentionally narrow in scope. It does not try to be a full tokenizer framework. It gives you a stable, strictly typed foundation for:

  • character vocabularies for sequence models
  • UniMorph-style feature vocabularies
  • deterministic serialization for checkpoints
  • simple tensor conversion helpers for PyTorch code

Why this package exists

Morphological reinflection and other low-level text tasks often work better with characters than with subword tokenizers. In those pipelines you usually need two parallel vocabularies:

  • one vocabulary for characters in source and target strings
  • one vocabulary for morphological tags such as PST, SG, NOM, V, and so on

chartoken-vp keeps those concerns separate and explicit.

Main components

CharVocab

CharVocab builds a character inventory from raw texts and exposes:

  • from_texts
  • encode
  • encode_ids
  • decode
  • to_dict
  • from_dict

The vocabulary uses three built-in special tokens:

  • PAD = 0
  • SOS = 1
  • EOS = 2

All text is normalized with Unicode NFKC via normalize_text.

FeatureVocab

FeatureVocab builds a vocabulary over feature tags and exposes:

  • from_tags
  • encode
  • encode_tensor
  • to_dict
  • from_dict

Feature sequences are padded with FEATURE_PAD = 0 and returned together with a float mask.

Installation

Requirements:

  • Python >=3.14
  • PyTorch >=2.0

Install from PyPI:

pip install chartoken-vp

Quick start

from chartoken import CharVocab, FeatureVocab

texts = ["walk", "walked", "go", "went"]
tag_sets = [
    ["V", "PRS"],
    ["V", "PST"],
    ["V", "PRS"],
    ["V", "PST"],
]

char_vocab = CharVocab.from_texts(texts)
feature_vocab = FeatureVocab.from_tags(tag_sets)

token_ids = char_vocab.encode_ids("walk", max_len=12)
feature_ids, feature_mask = feature_vocab.encode(["V", "PST"], max_features=8)

print(token_ids)
print(feature_ids, feature_mask)
print(char_vocab.decode(token_ids))

Character vocabulary behavior

Encoding works as:

  1. normalize input text with NFKC
  2. prepend <sos>
  3. append <eos>
  4. truncate to max_len
  5. right-pad with <pad>

This makes the output predictable and checkpoint-friendly.

Example:

from chartoken import CharVocab

vocab = CharVocab.from_texts(["lemma", "form"])
tensor = vocab.encode("lemma", max_len=10)
print(tensor.shape)

encode returns a torch.Tensor, while encode_ids returns list[int]. That split is useful when you want preprocessing logic without eagerly creating tensors.

Feature vocabulary behavior

Feature tags are treated as an unordered list supplied by the caller. The package:

  • maps known tags to integer ids
  • truncates to max_features
  • pads the remainder with FEATURE_PAD
  • returns a float mask aligned with the ids

Example:

from chartoken import FeatureVocab

vocab = FeatureVocab.from_tags([["N", "SG"], ["N", "PL"], ["V", "PST"]])
ids, mask = vocab.encode(["N", "SG"], max_features=6)

If you want tensors directly:

ids_tensor, mask_tensor = vocab.encode_tensor(["N", "SG"], max_features=6)

Serialization

Both vocabularies are serializable to plain dictionaries and back:

state = char_vocab.to_dict()
restored = CharVocab.from_dict(state)

This is useful for:

  • checkpoint payloads
  • experiment reproducibility
  • packaging trained models
  • keeping training and inference vocabularies aligned

Typing

This package ships py.typed and is meant to be consumed by pyright/Pylance-aware codebases.

Typed state objects:

  • CharVocabState
  • FeatureVocabState

Exported constants:

  • PAD
  • SOS
  • EOS
  • FEATURE_PAD
  • SPECIAL_TOKENS

Typical integration pattern

chartoken-vp is designed to sit underneath dataset and model packages.

A common stack looks like:

  1. read raw TSV rows
  2. build CharVocab from lemmas and surfaces
  3. build FeatureVocab from tag lists
  4. pre-encode examples into tensors
  5. save vocab state in checkpoints
  6. reuse the same states at inference time

What this package deliberately does not do

It does not include:

  • BPE or sentencepiece tokenization
  • dataset downloading
  • batching or dataloaders
  • model architectures
  • training loops

That separation is intentional. chartoken-vp should stay easy to publish, easy to test, and easy to embed into larger systems.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chartoken_vp-3.0.0.tar.gz (5.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chartoken_vp-3.0.0-py3-none-any.whl (5.7 kB view details)

Uploaded Python 3

File details

Details for the file chartoken_vp-3.0.0.tar.gz.

File metadata

  • Download URL: chartoken_vp-3.0.0.tar.gz
  • Upload date:
  • Size: 5.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for chartoken_vp-3.0.0.tar.gz
Algorithm Hash digest
SHA256 b4de29277099a7399d2bf5a5a555f91805a5fb7c63a4791b01aba8ae782eee27
MD5 30c2c151613acd846cd200f25dbff81a
BLAKE2b-256 b1cf572bad26c286c5e9868797436a96f79c8ff7e7b10adc2788483c7233d9f6

See more details on using hashes here.

File details

Details for the file chartoken_vp-3.0.0-py3-none-any.whl.

File metadata

  • Download URL: chartoken_vp-3.0.0-py3-none-any.whl
  • Upload date:
  • Size: 5.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for chartoken_vp-3.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 51484c0282a104e97f054bce3c8488c0b985a36c183f1b2b421b1aa94c596293
MD5 2555bdb60dafe0571a06639d2991fc0d
BLAKE2b-256 2eca7e7a942e2b12274c2c37fe01e2a98a78b2e92cf32a6d8e23a6805bc0797c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page