10 projects
tktkt
Tokeniser toolkit: a collection of Pythonic subword tokenisers and text preprocessing tools.
lamoto
Language Modelling Tasks as Objects (LaMoTO) provides a framework for language model training (masked and causal, pretraining and finetuning) where the tasks, not just the models, are classes themselves.
sage_bauwenst
SaGe subword tokenizer - version 2.0
pickybpe_bauwenst
BPE modification that removes sparsely used intermediate tokens during vocabularisation.
archit
ArchIt: A framework for base-and-head language models, and toolkit for converting in-place modifications of PyTorch objects into class code.
cli_release-me
CLI tool for creating git-tagged versions of Python packages where the version has to be specified exactly once.
supar_bauwenst
State-of-the-art parsers for natural language
modest_bauwenst
MoDeST: a Morphological Decomposition & Segmentation Trove.
bpe_knockout
Implementation of BPE-knockout, a morphologically informed post-processing step for BPE tokenisers.
fiject
Object-oriented, two-stage PDF figure generation library for Python.