Skip to main content

Morphology-aware BPE tokenizer for Philippine languages (Tagalog)

Project description

Filipino Tokenizer

PyPI Python License: MIT

A morphology-aware BPE tokenizer for Philippine languages.

Existing subword tokenizers (SentencePiece, HuggingFace BPE) treat Filipino text as raw character sequences. They have no knowledge of Filipino morphology, so they routinely split words at linguistically meaningless points. A word like pinakamahusay ("the best") gets fragmented into arbitrary substrings instead of its actual morphemes: pinaka- + ma- + husay.

This project fixes that. It combines a rule-based morphological segmenter with a constrained BPE algorithm that never merges across morpheme boundaries. The result is a tokenizer that produces fewer, more meaningful tokens for Filipino text.

Before and After

Consider the sentence: kumain ka na ba? ("Have you eaten?")

GPT-2 tokenizer — arbitrary statistical splits:

['k', 'um', 'ain', 'Ġka', 'Ġna', 'Ġba', '?']

Filipino Tokenizer — preserves the infix -um- and root kain:

['k', '▁', 'um', '▁', 'ain', ' ', 'ka', ' ', 'na', ' ', 'ba', '?']

The boundary marker (U+2581) separates morphemes within a word. The root kain (eat) is preserved as a consistent unit across all inflected forms: kumain, pagkain, kainan, kinain.

Installation

pip install filipino-tokenizer

Pre-built wheels are available for Linux, macOS, and Windows on Python 3.10–3.13 — no compiler or Rust toolchain required.

For HuggingFace Transformers integration:

pip install filipino-tokenizer[hf]

To install from source for development (requires Rust via rustup.rs):

git clone https://github.com/JpCurada/filipino-tokenizer.git
cd filipino-tokenizer
pip install -e .

Quick Start

Use the bundled pretrained model

A 32k-vocabulary model trained on Wikitext-TL-39 ships inside the package — no download needed.

from filipino_tokenizer.tagalog import TagalogTokenizer

tok = TagalogTokenizer()
tok.load_pretrained()

ids = tok.encode("Kumain siya ng pagkain.")
print(tok.decode(ids))    # kumain siya ng pagkain.
print(tok.tokenize("Kumain siya ng pagkain."))
# ['k', '▁', 'um', '▁', 'ain', ' ', 'siya', ' ', 'ng', ' ', 'pag', 'kain', '.']

HuggingFace integration

from filipino_tokenizer.tagalog import TagalogHFTokenizer

tok = TagalogHFTokenizer()   # loads bundled model
encoding = tok("Kumain siya ng pagkain.", return_tensors="pt")

# Batch tokenisation with padding
enc = tok(
    ["Kumain siya ng pagkain.", "Nagluluto ang nanay."],
    truncation=True,
    max_length=128,
    padding="max_length",
    return_tensors=None,
)

Works directly with Trainer, TRL, Axolotl, LlamaFactory, and any other HuggingFace-based training pipeline.

Train a custom model

from filipino_tokenizer.tagalog import TagalogTokenizer

tok = TagalogTokenizer()
tok.train("corpus.txt", vocab_size=32000)

ids = tok.encode("Kumain siya ng pagkain.")
print(tok.decode(ids))   # kumain siya ng pagkain.

tok.save("my_tokenizer/")

tok2 = TagalogTokenizer()
tok2.load("my_tokenizer/")

How It Works

The tokenizer is a three-stage pipeline.

Stage 1: Affix Tables. Four JSON files in data/ define every known Filipino prefix, suffix, infix, and circumfix. Each entry is tagged by language (Tagalog, Cebuano, etc.), so the same data files support multiple Philippine languages. Prefixes are sorted longest-first for greedy matching.

Stage 2: Morphological Segmenter. The TagalogSegmenter decomposes a word into its constituent morphemes using a multi-pass algorithm:

  1. Check for frozen/lexicalized forms (e.g., pangalan is a word, not pang- + alan).
  2. Try circumfix detection (prefix + suffix pairs like ka- -han).
  3. Strip prefixes, longest match first, with recursion for stacked prefixes.
  4. Detect infixes (-um- and -in- after the first consonant).
  5. Strip suffixes, applying phonological rules (-an becomes -han after vowels).
  6. Validate every candidate root against a dictionary of 30,000+ Tagalog roots.

If no valid segmentation is found, the word is returned whole.

Stage 3: Constrained BPE. The MorphAwareBPE class runs an optimized, incremental byte-pair encoding algorithm (using doubly-linked lists and max-heaps) with one critical constraint: it never merges a pair of symbols that would cross a morpheme boundary marker (). The greedy BPE encoder is implemented in Rust (_bpe_rust.CoreBPE via PyO3) for fast, allocation-efficient inference.

Evaluation

Morpheme Boundary Accuracy

We evaluated against standard tokenizers on 200 gold-standard Filipino words spanning prefixed, infixed, suffixed, circumfixed, stacked, and unsegmentable categories.

=======================================================================
Metric                         | Ours       | GPT-4      | SPM
-----------------------------------------------------------------------
Morpheme F1 Accuracy           | 46.0%      | 20.8%      | 12.0%
=======================================================================

Our tokenizer is 2.2× more accurate than GPT-4 at placing splits at actual linguistic boundaries, and 3.8× more accurate than SentencePiece.

Small Language Model Experiment

We trained identical GPT-2 mini (~25M params, 6 layers, 384-dim) models on 47,500 lines from Wikitext-TL-39 — same architecture, same data, same hyperparameters. The only difference was the tokenizer.

Results on 2,500 held-out Filipino sentences:

==================================================
Tokenizer                   Perplexity
--------------------------------------------------
Filipino Tokenizer               24.79
GPT-2 Tokenizer                 100.38
--------------------------------------------------
Winner: Filipino Tokenizer  (75.3% lower perplexity)
==================================================

Fertility comparison (2,000 validation lines):

Metric                            Filipino Tok       GPT-2 Tok
--------------------------------------------------------------
Fertility (tokens/word)                   2.53            2.05
Mean sequence length                      57.6            46.8
Context window utilization               22.5%           18.3%

The Filipino Tokenizer produces a slightly higher fertility (more tokens per word) because it enforces morpheme boundaries instead of greedily merging across them. The payoff is 75% lower perplexity — the model learns Filipino much more efficiently when every token is a meaningful linguistic unit.

Full experiment: Kaggle notebook

Project Structure

filipino-tokenizer/
    src/
        lib.rs                  # Rust BPE backend (CoreBPE, PyO3 bindings)
    filipino_tokenizer/
        base.py                 # BaseAffixes, BaseRoots, BaseSegmenter, BaseTokenizer
        data/
            prefix_table.json       # Prefix definitions, multi-language
            suffix_table.json       # Suffix definitions
            infix_table.json        # Infix definitions
            circumfix_table.json    # Circumfix definitions
            tagalog_roots.json      # ~30k Tagalog root words
            bisaya_roots.json       # Bisaya root words
            pretrained/
                vocab.json          # Bundled 32k vocabulary (Wikitext-TL-39)
                merges.txt          # Bundled merge rules
        tagalog/
            __init__.py         # Package exports
            affixes.py          # TagalogAffixes (filters for language="Tagalog")
            roots.py            # TagalogRoots (loads tagalog_roots.json)
            phonology.py        # Nasal assimilation, suffix h-insertion
            segmenter.py        # TagalogSegmenter (multi-pass morpheme decomposition)
            bpe.py              # MorphAwareBPE (constrained BPE, delegates to Rust)
            tokenizer.py        # TagalogTokenizer (segmenter + BPE pipeline)
            hf_tokenizer.py     # TagalogHFTokenizer (PreTrainedTokenizer wrapper)
    tests/
        test_affixes.py         # Affix loading and filtering tests
        test_segmenter.py       # Morphological segmentation tests
        test_tokenizer.py       # Full pipeline tests (round-trip, consistency, efficiency)
        test_rust_backend.py    # Rust extension tests (encode/decode, morpheme boundaries)
    examples/
        training_tagalog_tokenizer.py   # End-to-end training example
    demo/
        demo_tagalog_tokenizer.ipynb        # Usage guide notebook
        tokenizer_comparisons.ipynb         # Benchmark vs GPT-4 and SentencePiece
        filipino-tokenizer-experiment.ipynb # Full GPT-2 SLM training experiment
    Cargo.toml                  # Rust crate configuration
    pyproject.toml              # Package metadata and build system

Running Tests

# All tests
python -m unittest discover tests -v

# Individual test files
python -m unittest tests.test_affixes -v
python -m unittest tests.test_segmenter -v
python -m unittest tests.test_tokenizer -v
python -m unittest tests.test_rust_backend -v

# Rust unit tests (requires cargo)
cargo test

Adding a New Language

The architecture is designed to support multiple Philippine languages from the same data files. To add Bisaya, Ilokano, or another language:

  1. Add entries to the JSON affix tables in filipino_tokenizer/data/ with the appropriate language field.
  2. Add a root word list (e.g., filipino_tokenizer/data/bisaya_roots.json).
  3. Create filipino_tokenizer/<language>/affixes.py subclassing BaseAffixes with super().__init__(language="<Language>").
  4. Create a roots class subclassing BaseRoots.
  5. Implement a segmenter subclassing BaseSegmenter with language-specific phonological rules.
  6. Create a tokenizer class that wires the segmenter to MorphAwareBPE.

Contributing

Contributions are welcome. Areas where help is most needed:

  • Cebuano / Bisaya support — the affix tables already have Bisaya entries; the segmenter and phonology modules are missing.
  • Ilokano, Hiligaynon, Kapampangan — affix data and root dictionaries.
  • Segmenter accuracy — the gold-standard test set in demo/tokenizer_comparisons.ipynb is a good starting point for finding and fixing segmentation errors.
  • Documentation — tutorials, worked examples, and comparisons against newer tokenizers.

Please open an issue or pull request on GitHub. For questions, feel free to reach out via GitHub Issues.

Links

References

License

MIT License. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

filipino_tokenizer-0.4.1.tar.gz (3.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

filipino_tokenizer-0.4.1-cp313-cp313-win_amd64.whl (3.2 MB view details)

Uploaded CPython 3.13Windows x86-64

filipino_tokenizer-0.4.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

filipino_tokenizer-0.4.1-cp313-cp313-macosx_11_0_arm64.whl (3.3 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

filipino_tokenizer-0.4.1-cp312-cp312-win_amd64.whl (3.2 MB view details)

Uploaded CPython 3.12Windows x86-64

filipino_tokenizer-0.4.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

filipino_tokenizer-0.4.1-cp312-cp312-macosx_11_0_arm64.whl (3.3 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

filipino_tokenizer-0.4.1-cp311-cp311-win_amd64.whl (3.2 MB view details)

Uploaded CPython 3.11Windows x86-64

filipino_tokenizer-0.4.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

filipino_tokenizer-0.4.1-cp311-cp311-macosx_11_0_arm64.whl (3.3 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

filipino_tokenizer-0.4.1-cp310-cp310-win_amd64.whl (3.2 MB view details)

Uploaded CPython 3.10Windows x86-64

filipino_tokenizer-0.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

filipino_tokenizer-0.4.1-cp310-cp310-macosx_11_0_arm64.whl (3.3 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file filipino_tokenizer-0.4.1.tar.gz.

File metadata

  • Download URL: filipino_tokenizer-0.4.1.tar.gz
  • Upload date:
  • Size: 3.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filipino_tokenizer-0.4.1.tar.gz
Algorithm Hash digest
SHA256 e9f8bb70cccbd8259317ac572e39b9a8e340163dc350455d301b1bb6efbf8e92
MD5 201d79c496b7b64a20a8983ebd0bdb4b
BLAKE2b-256 fcedba4b7ba43001101db7df7ac095bb75874a4465561b2a2f2139b03e3476fa

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.1.tar.gz:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filipino_tokenizer-0.4.1-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for filipino_tokenizer-0.4.1-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 952dee1a3b6b0c7458cd419b735939da69aedfa8abf9b5e1ad50200ce24ec7c9
MD5 406a5920e6be2ea96a406fe61f949a52
BLAKE2b-256 26cde4581a57270c3ffd668762a2bc49b18bcae4262581f88ee3b3429f150dcb

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.1-cp313-cp313-win_amd64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filipino_tokenizer-0.4.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for filipino_tokenizer-0.4.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c862900f57d548853bc92cb535738b75b4daa3de6f95d5f80617d6c161af975f
MD5 c94ec70199896a7195be243c964f434c
BLAKE2b-256 d8cf6ff8f2d341b1fbd96cdd3cc8f8f25629cd8f86463a19593409ce8e879247

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filipino_tokenizer-0.4.1-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for filipino_tokenizer-0.4.1-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 990f5ba2735fdb819e8ee814ae2138352c4db24f7c555b3243c4a474b7c86156
MD5 a3b3a276edef0b7e9eb341d4623ab06e
BLAKE2b-256 7a24c2de63e0850431e099e1869c572d9b267814f3d0880b57bb1179706afe1e

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.1-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filipino_tokenizer-0.4.1-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for filipino_tokenizer-0.4.1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 d81c985826f06454f37a3110f4f769af357422c832a581a8e2e4d8c8a0ae1d47
MD5 f38ebb4a6d88cc5229436136241f568e
BLAKE2b-256 494f2fd0b80ca80ce6d4551e6af23fcc5907dc0d63af0c5fa7fe17bdc82b87c6

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.1-cp312-cp312-win_amd64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filipino_tokenizer-0.4.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for filipino_tokenizer-0.4.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 38ab0cf0fba768a937681700fb2bd65f90c33c981798c02c8fd7b8b86ecda07c
MD5 8613821af4f2176926bdfbba90d8e2bb
BLAKE2b-256 6f531b6f9b4e14e8388cfc564c9ddb875c231d48e00c17aacc6348bc905a8910

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filipino_tokenizer-0.4.1-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for filipino_tokenizer-0.4.1-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c7486dc5019abcbb59a3546cb509efc90f5d767408c5826e19a3831ec3a7f151
MD5 a28e2cd27a158e2e855fb47c6dc978f2
BLAKE2b-256 d13232e5b33ab2f1dc1ab8cb1929b362e8845bdbae2c861c808d0ccb6775b376

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.1-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filipino_tokenizer-0.4.1-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for filipino_tokenizer-0.4.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 3491412966b32d48fb12a4adb04795e750e0450c97c5f26ac19bbdbf3a8b9daf
MD5 db4879ebfcf3202760548d9c5180cf27
BLAKE2b-256 83779531f45c1291957e1c4a76a2982bcc84c2ec29aa432de88f535cc3088114

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.1-cp311-cp311-win_amd64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filipino_tokenizer-0.4.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for filipino_tokenizer-0.4.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5db620c1a97d8610a511f090b2534f0e75d75f916fc52eeb6441dfd16ab920e2
MD5 702aa203e1c555ed4cb7ac69e986fea9
BLAKE2b-256 2a9cfa54e84143e9f8fd79777ac2326501134a09f37f13a63fb2a7de2f178b1b

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filipino_tokenizer-0.4.1-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for filipino_tokenizer-0.4.1-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 29c0b36b68dacc055f5d1dfc26d48007fa7a8e4b3777092675e59698e33d9973
MD5 30ffe93100f473f15175aa7edb4c5339
BLAKE2b-256 c7cb6ca272da62950fc449f6e7b97b9381e0adaffd913d6914cffe03b4ef24d1

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.1-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filipino_tokenizer-0.4.1-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for filipino_tokenizer-0.4.1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 d5a8d27e77ce647285c5b0224963c212891ccf1fb7799d5209061d8b0b361b38
MD5 26dd4453a9cd0df2bda387015aa0a57a
BLAKE2b-256 9a63da0a6b26cf8b04bde2ed5af3e3a464e5f6438cd34db564d37202eb027f4c

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.1-cp310-cp310-win_amd64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filipino_tokenizer-0.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for filipino_tokenizer-0.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 66cd36d7f8d1daf73ad3dc34cb9b5471748c621fe4654d14641118edc3664b14
MD5 1d5c048a51035760567b9d4f431f52ff
BLAKE2b-256 e1996adc500efe1535a8173d37e7174884f43b84dea0eca2b60d16ebfbdc3ebe

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filipino_tokenizer-0.4.1-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for filipino_tokenizer-0.4.1-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 064c2d05c03b08426868b58f28f34e57be1de130eb11e680fefc8268e5af5fc9
MD5 c146032c1b87115c7a13dbbda1baa374
BLAKE2b-256 c564c035defcff0ae056229c3d143824b6dda8ed00c3f0e6f085e753d945d98f

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.1-cp310-cp310-macosx_11_0_arm64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page