Skip to main content

Morphology-aware BPE tokenizer for Philippine languages (Tagalog)

Project description

Filipino Tokenizer

A morphology-aware BPE tokenizer for Philippine languages.

Existing subword tokenizers (SentencePiece, HuggingFace BPE) treat Filipino text as raw character sequences. They have no knowledge of Filipino morphology, so they routinely split words at linguistically meaningless points. A word like pinakamahusay ("the best") gets fragmented into arbitrary substrings instead of its actual morphemes: pinaka- + ma- + husay.

This project fixes that. It combines a rule-based morphological segmenter with a constrained BPE algorithm that never merges across morpheme boundaries. The result is a tokenizer that produces fewer, more meaningful tokens for Filipino text.

Before and After

Consider the sentence: Kumain siya ng masarap na pagkain.

A generic BPE tokenizer might produce:

["Ku", "main", " siya", " ng", " mas", "ar", "ap", " na", " pag", "ka", "in", "."]

This tokenizer understands that kumain contains the infix -um- and root kain, and that pagkain is prefix pag- plus the same root kain:

["k", "um", "ain", " ", "siya", " ", "ng", " ", "ma", "sarap", " ", "na", " ", "pag", "kain", "."]

The root kain is preserved as a single token and shared across both words. This gives downstream models a head start on understanding Filipino word formation.

Installation

The core library requires no external dependencies (like HuggingFace or SentencePiece) and runs purely on the standard library.

pip install filipino-tokenizer

To install from source for development:

git clone https://github.com/JpCurada/filipino-tokenizer.git
cd filipino-tokenizer
pip install -e .[dev]

Quick Start

import os, tempfile
from filipino_tokenizer.tagalog import TagalogTokenizer

# Write a small training corpus
corpus_text = """
Kumain siya ng pagkain sa hapagkainan.
Maganda ang panahon ngayon kaya lumabas kami.
Nagluluto ang nanay ng masarap na adobo para sa pamilya.
"""
tmpdir = tempfile.mkdtemp()
corpus_path = os.path.join(tmpdir, "corpus.txt")
with open(corpus_path, "w", encoding="utf-8") as f:
    f.write(corpus_text)

# Train
tok = TagalogTokenizer()
tok.train(corpus_path, vocab_size=500)

# Encode and decode
ids = tok.encode("Kumain siya ng pagkain.")
text = tok.decode(ids)
print(text)  # kumain siya ng pagkain.

# Inspect subword tokens
tokens = tok.tokenize("Kumain siya ng pagkain.")
print(tokens)  # ['k', 'um', 'ain', ' ', 'siya', ' ', 'ng', ' ', 'pag', 'kain', '.']

# Save and reload
tok.save("my_tokenizer/")
tok2 = TagalogTokenizer()
tok2.load("my_tokenizer/")

How It Works

The tokenizer is a three-stage pipeline.

Stage 1: Affix Tables. Four JSON files in data/ define every known Filipino prefix, suffix, infix, and circumfix. Each entry is tagged by language (Tagalog, Cebuano, etc.), so the same data files support multiple Philippine languages. Prefixes are sorted longest-first for greedy matching.

Stage 2: Morphological Segmenter. The TagalogSegmenter decomposes a word into its constituent morphemes using a multi-pass algorithm:

  1. Check for frozen/lexicalized forms (e.g., pangalan is a word, not pang- + alan).
  2. Try circumfix detection (prefix + suffix pairs like ka- -han).
  3. Strip prefixes, longest match first, with recursion for stacked prefixes.
  4. Detect infixes (-um- and -in- after the first consonant).
  5. Strip suffixes, applying phonological rules (-an becomes -han after vowels).
  6. Validate every candidate root against a dictionary of 30,000+ Tagalog roots.

If no valid segmentation is found, the word is returned whole.

Stage 3: Constrained BPE. The MorphAwareBPE class runs an optimized, incremental byte-pair encoding algorithm (using doubly-linked lists and max-heaps) with one critical constraint: it never merges a pair of symbols that would cross a morpheme boundary marker. This means learned subword units always stay within a single morpheme. The approach follows the Constrained BPE (CBPE) method described by Tacorda et al.

Evaluation

We evaluated our TagalogTokenizer against standard industry tokenizers (GPT-4's cl100k_base and SentencePiece Unigram) on a 5,000-line corpus evaluation split.

=======================================================================
Metric                         | Ours       | GPT-4      | SPM       
-----------------------------------------------------------------------
Total Tokens                   | 645        | 516        | 318       
Tokens per Word (Fertility)    | 2.34       | 1.87       | 1.15      
Morpheme F1 Accuracy           | 64.5%      | 20.8%      | 12.0%     
=======================================================================
  • Morpheme F1 Accuracy: Our tokenizer is 3x more likely to split Filipino words at actual linguistic boundaries than GPT-4, and 5x more likely than SentencePiece.
  • Fertility: Our tokenizer produces slightly more tokens per word (2.34). This is the expected trade-off: because we strictly prevent merges across morpheme boundaries, frequent but morphologically distinct parts (like pag and kain) are kept separate, rather than being memorized as a single unbroken token (pagkain). This ensures robust compositional understanding for AI models.

Project Structure

filipino-tokenizer/
    filipino_tokenizer/
        base.py                 # BaseAffixes, BaseRoots, BaseSegmenter, BaseTokenizer
        data/
            prefix_table.json       # Prefix definitions, multi-language
            suffix_table.json       # Suffix definitions
            infix_table.json        # Infix definitions
            circumfix_table.json    # Circumfix definitions
            tagalog_roots.json      # ~30k Tagalog root words
            bisaya_roots.json       # Bisaya root words
        tagalog/
            __init__.py         # Package exports
            affixes.py          # TagalogAffixes (filters for language="Tagalog")
            roots.py            # TagalogRoots (loads tagalog_roots.json)
            phonology.py        # Nasal assimilation, suffix h-insertion
            segmenter.py        # TagalogSegmenter (multi-pass morpheme decomposition)
            bpe.py              # MorphAwareBPE (constrained BPE, no cross-boundary merges)
            tokenizer.py        # TagalogTokenizer (segmenter + BPE pipeline)
    tests/
        test_affixes.py         # Affix loading and filtering tests
        test_segmenter.py       # Morphological segmentation tests
        test_tokenizer.py       # Full pipeline tests (round-trip, consistency, efficiency)
    examples/
        training_tagalog_tokenizer.py   # End-to-end training example
    demo/
        demo_tagalog_tokenizer.ipynb    # Jupyter notebook demo

Running Tests

# All tests
python -m unittest discover tests -v

# Individual test files
python -m unittest tests.test_affixes -v
python -m unittest tests.test_segmenter -v
python -m unittest tests.test_tokenizer -v

Adding a New Language

The architecture is designed to support multiple Philippine languages from the same data files. To add Bisaya, Ilokano, or another language:

  1. Add entries to the JSON affix tables in filipino_tokenizer/data/ with the appropriate language field.
  2. Add a root word list (e.g., filipino_tokenizer/data/bisaya_roots.json).
  3. Create filipino_tokenizer/<language>/affixes.py subclassing BaseAffixes with super().__init__(language="<Language>").
  4. Create a roots class subclassing BaseRoots.
  5. Implement a segmenter subclassing BaseSegmenter with language-specific phonological rules.
  6. Create a tokenizer class that wires the segmenter to MorphAwareBPE.

References

License

MIT License. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

filipino_tokenizer-0.3.1.tar.gz (3.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

filipino_tokenizer-0.3.1-py3-none-any.whl (3.1 MB view details)

Uploaded Python 3

File details

Details for the file filipino_tokenizer-0.3.1.tar.gz.

File metadata

  • Download URL: filipino_tokenizer-0.3.1.tar.gz
  • Upload date:
  • Size: 3.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filipino_tokenizer-0.3.1.tar.gz
Algorithm Hash digest
SHA256 839b4dede8f0f1c36aa60f29910e796d4d4b629008d9dceb0f9465e1dd5e7ef4
MD5 362b4af7c783c41182d89fb959d86c96
BLAKE2b-256 b7372f06eb02787bb999ebc217ec55438e519db3d603a43bc70da9930d794f4e

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.3.1.tar.gz:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filipino_tokenizer-0.3.1-py3-none-any.whl.

File metadata

File hashes

Hashes for filipino_tokenizer-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ecb507276f2c7c0510c1b017c523decb4b5ef620868815cc4bf64fa74199732d
MD5 83c76b503f04d58c62337244d680d870
BLAKE2b-256 1977e360d3b1454e1181b36b9ae3f65925e5f77063c03acf57b183ecd3f9ca61

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.3.1-py3-none-any.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page