Skip to main content

Morphology-aware BPE tokenizer for Philippine languages (Tagalog)

Project description

Filipino Tokenizer

A morphology-aware BPE tokenizer for Philippine languages.

Existing subword tokenizers (SentencePiece, HuggingFace BPE) treat Filipino text as raw character sequences. They have no knowledge of Filipino morphology, so they routinely split words at linguistically meaningless points. A word like pinakamahusay ("the best") gets fragmented into arbitrary substrings instead of its actual morphemes: pinaka- + ma- + husay.

This project fixes that. It combines a rule-based morphological segmenter with a constrained BPE algorithm that never merges across morpheme boundaries. The result is a tokenizer that produces fewer, more meaningful tokens for Filipino text.

Before and After

Consider the sentence: Kumain siya ng masarap na pagkain.

A generic BPE tokenizer might produce:

["Ku", "main", " siya", " ng", " mas", "ar", "ap", " na", " pag", "ka", "in", "."]

This tokenizer understands that kumain contains the infix -um- and root kain, and that pagkain is prefix pag- plus the same root kain:

["k", "um", "ain", " ", "siya", " ", "ng", " ", "ma", "sarap", " ", "na", " ", "pag", "kain", "."]

The root kain is preserved as a single token and shared across both words. This gives downstream models a head start on understanding Filipino word formation.

Installation

pip install filipino-tokenizer

Pre-built wheels are available for Linux, macOS, and Windows on Python 3.10–3.13 — no compiler or Rust toolchain required.

For HuggingFace Transformers integration:

pip install filipino-tokenizer[hf]

To install from source for development (requires Rust via rustup.rs):

git clone https://github.com/JpCurada/filipino-tokenizer.git
cd filipino-tokenizer
pip install -e .

Quick Start

Use the bundled pretrained model

A 32k-vocabulary model trained on Wikitext-TL-39 ships inside the package — no download needed.

from filipino_tokenizer.tagalog import TagalogTokenizer

tok = TagalogTokenizer()
tok.load_pretrained()

ids = tok.encode("Kumain siya ng pagkain.")
print(tok.decode(ids))    # kumain siya ng pagkain.
print(tok.tokenize("Kumain siya ng pagkain."))
# ['k', 'um', 'ain', ' ', 'siya', ' ', 'ng', ' ', 'pag', 'kain', '.']

HuggingFace integration

from filipino_tokenizer.tagalog import TagalogHFTokenizer

tok = TagalogHFTokenizer()   # loads bundled model
encoding = tok("Kumain siya ng pagkain.", return_tensors="pt")

Works directly with Trainer, TRL, Axolotl, LlamaFactory, and any other HuggingFace-based training pipeline.

Train a custom model

from filipino_tokenizer.tagalog import TagalogTokenizer

tok = TagalogTokenizer()
tok.train("corpus.txt", vocab_size=32000)

ids = tok.encode("Kumain siya ng pagkain.")
print(tok.decode(ids))   # kumain siya ng pagkain.

tok.save("my_tokenizer/")

tok2 = TagalogTokenizer()
tok2.load("my_tokenizer/")

How It Works

The tokenizer is a three-stage pipeline.

Stage 1: Affix Tables. Four JSON files in data/ define every known Filipino prefix, suffix, infix, and circumfix. Each entry is tagged by language (Tagalog, Cebuano, etc.), so the same data files support multiple Philippine languages. Prefixes are sorted longest-first for greedy matching.

Stage 2: Morphological Segmenter. The TagalogSegmenter decomposes a word into its constituent morphemes using a multi-pass algorithm:

  1. Check for frozen/lexicalized forms (e.g., pangalan is a word, not pang- + alan).
  2. Try circumfix detection (prefix + suffix pairs like ka- -han).
  3. Strip prefixes, longest match first, with recursion for stacked prefixes.
  4. Detect infixes (-um- and -in- after the first consonant).
  5. Strip suffixes, applying phonological rules (-an becomes -han after vowels).
  6. Validate every candidate root against a dictionary of 30,000+ Tagalog roots.

If no valid segmentation is found, the word is returned whole.

Stage 3: Constrained BPE. The MorphAwareBPE class runs an optimized, incremental byte-pair encoding algorithm (using doubly-linked lists and max-heaps) with one critical constraint: it never merges a pair of symbols that would cross a morpheme boundary marker (). Merges that respect this constraint are learned at training time. At inference time, the greedy BPE encoder is implemented in Rust (_bpe_rust.CoreBPE via PyO3) for fast, allocation-efficient encoding.

Evaluation

We evaluated our TagalogTokenizer against standard industry tokenizers (GPT-4's cl100k_base and SentencePiece Unigram) on a 5,000-line corpus evaluation split.

=======================================================================
Metric                         | Ours       | GPT-4      | SPM       
-----------------------------------------------------------------------
Total Tokens                   | 645        | 516        | 318       
Tokens per Word (Fertility)    | 2.34       | 1.87       | 1.15      
Morpheme F1 Accuracy           | 64.5%      | 20.8%      | 12.0%     
=======================================================================
  • Morpheme F1 Accuracy: Our tokenizer is 3x more likely to split Filipino words at actual linguistic boundaries than GPT-4, and 5x more likely than SentencePiece.
  • Fertility: Our tokenizer produces slightly more tokens per word (2.34). This is the expected trade-off: because we strictly prevent merges across morpheme boundaries, frequent but morphologically distinct parts (like pag and kain) are kept separate, rather than being memorized as a single unbroken token (pagkain). This ensures robust compositional understanding for AI models.

Project Structure

filipino-tokenizer/
    src/
        lib.rs                  # Rust BPE backend (CoreBPE, PyO3 bindings)
    filipino_tokenizer/
        base.py                 # BaseAffixes, BaseRoots, BaseSegmenter, BaseTokenizer
        data/
            prefix_table.json       # Prefix definitions, multi-language
            suffix_table.json       # Suffix definitions
            infix_table.json        # Infix definitions
            circumfix_table.json    # Circumfix definitions
            tagalog_roots.json      # ~30k Tagalog root words
            bisaya_roots.json       # Bisaya root words
            pretrained/
                vocab.json          # Bundled 32k vocabulary (Wikitext-TL-39)
                merges.txt          # Bundled merge rules
        tagalog/
            __init__.py         # Package exports
            affixes.py          # TagalogAffixes (filters for language="Tagalog")
            roots.py            # TagalogRoots (loads tagalog_roots.json)
            phonology.py        # Nasal assimilation, suffix h-insertion
            segmenter.py        # TagalogSegmenter (multi-pass morpheme decomposition)
            bpe.py              # MorphAwareBPE (constrained BPE, delegates to Rust)
            tokenizer.py        # TagalogTokenizer (segmenter + BPE pipeline)
            hf_tokenizer.py     # TagalogHFTokenizer (PreTrainedTokenizer wrapper)
    tests/
        test_affixes.py         # Affix loading and filtering tests
        test_segmenter.py       # Morphological segmentation tests
        test_tokenizer.py       # Full pipeline tests (round-trip, consistency, efficiency)
        test_rust_backend.py    # Rust extension tests (encode/decode, morpheme boundaries)
    examples/
        training_tagalog_tokenizer.py   # End-to-end training example
    demo/
        demo_tagalog_tokenizer.ipynb    # Usage guide notebook
        tokenizer_comparisons.ipynb     # Benchmark vs GPT-4 and SentencePiece
        tokenizer_comparisons_fil.ipynb # Side-by-side comparison on Filipino sentences
        slm_tokenizer_comparison.ipynb  # SLM training metrics comparison
        slm_training_experiment.ipynb   # Full GPT-2 training experiment
    Cargo.toml                  # Rust crate configuration
    setup.py                    # setuptools-rust build hook
    pyproject.toml              # Package metadata and build system

Running Tests

# All tests
python -m unittest discover tests -v

# Individual test files
python -m unittest tests.test_affixes -v
python -m unittest tests.test_segmenter -v
python -m unittest tests.test_tokenizer -v
python -m unittest tests.test_rust_backend -v

# Rust unit tests (requires cargo)
cargo test

Adding a New Language

The architecture is designed to support multiple Philippine languages from the same data files. To add Bisaya, Ilokano, or another language:

  1. Add entries to the JSON affix tables in filipino_tokenizer/data/ with the appropriate language field.
  2. Add a root word list (e.g., filipino_tokenizer/data/bisaya_roots.json).
  3. Create filipino_tokenizer/<language>/affixes.py subclassing BaseAffixes with super().__init__(language="<Language>").
  4. Create a roots class subclassing BaseRoots.
  5. Implement a segmenter subclassing BaseSegmenter with language-specific phonological rules.
  6. Create a tokenizer class that wires the segmenter to MorphAwareBPE.

References

License

MIT License. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

filipino_tokenizer-0.4.0.tar.gz (3.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

filipino_tokenizer-0.4.0-cp313-cp313-win_amd64.whl (3.2 MB view details)

Uploaded CPython 3.13Windows x86-64

filipino_tokenizer-0.4.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

filipino_tokenizer-0.4.0-cp313-cp313-macosx_11_0_arm64.whl (3.3 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

filipino_tokenizer-0.4.0-cp312-cp312-win_amd64.whl (3.2 MB view details)

Uploaded CPython 3.12Windows x86-64

filipino_tokenizer-0.4.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

filipino_tokenizer-0.4.0-cp312-cp312-macosx_11_0_arm64.whl (3.3 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

filipino_tokenizer-0.4.0-cp311-cp311-win_amd64.whl (3.2 MB view details)

Uploaded CPython 3.11Windows x86-64

filipino_tokenizer-0.4.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

filipino_tokenizer-0.4.0-cp311-cp311-macosx_11_0_arm64.whl (3.3 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

filipino_tokenizer-0.4.0-cp310-cp310-win_amd64.whl (3.2 MB view details)

Uploaded CPython 3.10Windows x86-64

filipino_tokenizer-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

filipino_tokenizer-0.4.0-cp310-cp310-macosx_11_0_arm64.whl (3.3 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file filipino_tokenizer-0.4.0.tar.gz.

File metadata

  • Download URL: filipino_tokenizer-0.4.0.tar.gz
  • Upload date:
  • Size: 3.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filipino_tokenizer-0.4.0.tar.gz
Algorithm Hash digest
SHA256 95811a5599fc353b685ce6d22e6f21cbe083ff2157e2e00c77000eb3c8593ef6
MD5 52fa581c04aac82689c0bb549658c375
BLAKE2b-256 af0da8e7935b1f716f81fe70b7eae26e4797418b93f883312fafefb849c9cfa3

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.0.tar.gz:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filipino_tokenizer-0.4.0-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for filipino_tokenizer-0.4.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 15b0273e5d198a22acdf297079c30540df7e2168076423f2a42592b95cfef78e
MD5 1bf0af1d3724c1a2d6421eb76540f26c
BLAKE2b-256 486431cd936c43d3892709f1377e6efde8799772b29f41816c2160a56a90d4f6

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.0-cp313-cp313-win_amd64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filipino_tokenizer-0.4.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for filipino_tokenizer-0.4.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8b5534cf9b1e3ee995b021f0ad36ab44eef0a7839660d23c43b8004648c164fe
MD5 e9c84d4ac851e74e125d00787cc4ce41
BLAKE2b-256 c64c91cad8af39faea0ae4de9c06ea743a34af25d995f9e49767a26ab8a68194

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filipino_tokenizer-0.4.0-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for filipino_tokenizer-0.4.0-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 fa0271db01ae07250dc02740837dd9178d2edf2d12b44b109f6cce4283d98cd9
MD5 3bf2898d22a49ecc1016c82a6be1d50a
BLAKE2b-256 026ff70b1c8b46039ec27b6dcaa48b30d0ee9290477706dccbac5f2e043e66d5

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.0-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filipino_tokenizer-0.4.0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for filipino_tokenizer-0.4.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 6e396db7daa177c75515a4fc8eecbe118091623aed946c1085e5db72d025ba90
MD5 b802bf3b8894b8e4bac003c08890c685
BLAKE2b-256 faf0d8edc1b140a4eca93e42f2cdc1626b6d2d1a57c2241a926a95df96a9dd8a

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.0-cp312-cp312-win_amd64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filipino_tokenizer-0.4.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for filipino_tokenizer-0.4.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3ec133eb0f60a8f1d23688ed55caf4d26dfd26d93f8c348b9f035770e63a03ae
MD5 781f1fb1665946c1aaf07a831e00312c
BLAKE2b-256 3a4d9e6a2fb1c6ca1f24d8325b7988771f04705e0505cc759d7b05cd2fa174fe

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filipino_tokenizer-0.4.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for filipino_tokenizer-0.4.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a28f7dc65e818eb7fcf70125f7780397b5894c8fc4f602ed1ff0d4f06e7d165b
MD5 5845b37d91723e0b3dd179c2b22f83e4
BLAKE2b-256 d1a60a35884add1dfdcd52955d644566a5515eef2c9819227f1b93fe82e91dd4

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.0-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filipino_tokenizer-0.4.0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for filipino_tokenizer-0.4.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 11767cea8068cd16bd36cfa68366da7d18fac16be34793a509d2592abaf9da38
MD5 a5d68361ef67015409e55766aa302170
BLAKE2b-256 19af1f3520c71e7a2200ea916da54d129947140250cc34255f9d866e19406713

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.0-cp311-cp311-win_amd64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filipino_tokenizer-0.4.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for filipino_tokenizer-0.4.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0038893f6f2bb23a5cee79e6b4873a3bf9f35e09001c631ee6d100a9825a3ce9
MD5 f4310e9dd55b73f532425b89315d0504
BLAKE2b-256 a96ff9c5bcec97d07069a69825a8f1d731dd7e8b5f68c3147a3edfc94e1d8d42

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filipino_tokenizer-0.4.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for filipino_tokenizer-0.4.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2778e0ce10c067b4095b7e2c0e546fb5bd669d2d1673f302bdd7dd9308e4b21f
MD5 23130784a760087378f94a5c4750e6b1
BLAKE2b-256 1e3eefb541687955e9bc906fb09864cd0cd04f1db3850929ee9d0a275e6ba540

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.0-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filipino_tokenizer-0.4.0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for filipino_tokenizer-0.4.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 863066f09f84fb13685f1030e1578c0ff19f4f759e55d63fb52d82e355e63a37
MD5 da218eda0943676e085c559dbda22ada
BLAKE2b-256 697ecebce361e39239e76fb2977453e82710d59202a936dfb84487d900b75679

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.0-cp310-cp310-win_amd64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filipino_tokenizer-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for filipino_tokenizer-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 cfb00022ca29746d677795556dbdf955e66ecf995a2100972ab3e4e5bc871035
MD5 1b93300c31f0cbbd6cab4bc797c049d4
BLAKE2b-256 e9b0f9fb14ceda094768b804e1b3c0cba4bcc88b3e45b391616065932d925aad

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filipino_tokenizer-0.4.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for filipino_tokenizer-0.4.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 04dd0675041560ace56ac49f9ce7fb0c285dfc74343b7f6b7a76958af3e84199
MD5 bd78a191cd58b34cb2625748e753a87e
BLAKE2b-256 c72b6138a07aae4c9b0fe3f5803b3900702ebee3b1be8d83de7861afe78579cb

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.0-cp310-cp310-macosx_11_0_arm64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page