Morphology-aware BPE tokenizer for Philippine languages (Tagalog)

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

jpcurada

These details have not been verified by PyPI

Project description

Filipino Tokenizer

A morphology-aware BPE tokenizer for Philippine languages.

Existing subword tokenizers (SentencePiece, HuggingFace BPE) treat Filipino text as raw character sequences. They have no knowledge of Filipino morphology, so they routinely split words at linguistically meaningless points. A word like pinakamahusay ("the best") gets fragmented into arbitrary substrings instead of its actual morphemes: pinaka- + ma- + husay.

This project fixes that. It combines a rule-based morphological segmenter with a constrained BPE algorithm that never merges across morpheme boundaries. The result is a tokenizer that produces fewer, more meaningful tokens for Filipino text.

Before and After

Consider the sentence: kumain ka na ba? ("Have you eaten?")

GPT-2 tokenizer — arbitrary statistical splits:

['k', 'um', 'ain', 'Ġka', 'Ġna', 'Ġba', '?']

Filipino Tokenizer — preserves the infix -um- and root kain:

['k', '▁', 'um', '▁', 'ain', ' ', 'ka', ' ', 'na', ' ', 'ba', '?']

The boundary marker ▁ (U+2581) separates morphemes within a word. The root kain (eat) is preserved as a consistent unit across all inflected forms: kumain, pagkain, kainan, kinain.

Installation

pip install filipino-tokenizer

Pre-built wheels are available for Linux, macOS, and Windows on Python 3.10–3.13 — no compiler or Rust toolchain required.

For HuggingFace Transformers integration:

pip install filipino-tokenizer[hf]

To install from source for development (requires Rust via rustup.rs):

git clone https://github.com/JpCurada/filipino-tokenizer.git
cd filipino-tokenizer
pip install -e .

Quick Start

Use the bundled pretrained model

A 32k-vocabulary model trained on Wikitext-TL-39 ships inside the package — no download needed.

from filipino_tokenizer.tagalog import TagalogTokenizer

tok = TagalogTokenizer()
tok.load_pretrained()

ids = tok.encode("Kumain siya ng pagkain.")
print(tok.decode(ids))    # kumain siya ng pagkain.
print(tok.tokenize("Kumain siya ng pagkain."))
# ['k', '▁', 'um', '▁', 'ain', ' ', 'siya', ' ', 'ng', ' ', 'pag', 'kain', '.']

HuggingFace integration

from filipino_tokenizer.tagalog import TagalogHFTokenizer

tok = TagalogHFTokenizer()   # loads bundled model
encoding = tok("Kumain siya ng pagkain.", return_tensors="pt")

# Batch tokenisation with padding
enc = tok(
    ["Kumain siya ng pagkain.", "Nagluluto ang nanay."],
    truncation=True,
    max_length=128,
    padding="max_length",
    return_tensors=None,
)

Works directly with Trainer, TRL, Axolotl, LlamaFactory, and any other HuggingFace-based training pipeline.

Train a custom model

from filipino_tokenizer.tagalog import TagalogTokenizer

tok = TagalogTokenizer()
tok.train("corpus.txt", vocab_size=32000)

ids = tok.encode("Kumain siya ng pagkain.")
print(tok.decode(ids))   # kumain siya ng pagkain.

tok.save("my_tokenizer/")

tok2 = TagalogTokenizer()
tok2.load("my_tokenizer/")

How It Works

The tokenizer is a three-stage pipeline.

Stage 1: Affix Tables. Four JSON files in data/ define every known Filipino prefix, suffix, infix, and circumfix. Each entry is tagged by language (Tagalog, Cebuano, etc.), so the same data files support multiple Philippine languages. Prefixes are sorted longest-first for greedy matching.

Stage 2: Morphological Segmenter. The TagalogSegmenter decomposes a word into its constituent morphemes using a multi-pass algorithm:

Check for frozen/lexicalized forms (e.g., pangalan is a word, not pang- + alan).
Try circumfix detection (prefix + suffix pairs like ka- -han).
Strip prefixes, longest match first, with recursion for stacked prefixes.
Detect infixes (-um- and -in- after the first consonant).
Strip suffixes, applying phonological rules (-an becomes -han after vowels).
Validate every candidate root against a dictionary of 30,000+ Tagalog roots.

If no valid segmentation is found, the word is returned whole.

Stage 3: Constrained BPE. The MorphAwareBPE class runs an optimized, incremental byte-pair encoding algorithm (using doubly-linked lists and max-heaps) with one critical constraint: it never merges a pair of symbols that would cross a morpheme boundary marker (▁). The greedy BPE encoder is implemented in Rust (_bpe_rust.CoreBPE via PyO3) for fast, allocation-efficient inference.

Evaluation

Morpheme Boundary Accuracy

We evaluated against standard tokenizers on 200 gold-standard Filipino words spanning prefixed, infixed, suffixed, circumfixed, stacked, and unsegmentable categories.

=======================================================================
Metric                         | Ours       | GPT-4      | SPM
-----------------------------------------------------------------------
Morpheme F1 Accuracy           | 46.0%      | 20.8%      | 12.0%
=======================================================================

Our tokenizer is 2.2× more accurate than GPT-4 at placing splits at actual linguistic boundaries, and 3.8× more accurate than SentencePiece.

Small Language Model Experiment

We trained identical GPT-2 mini (~25M params, 6 layers, 384-dim) models on 47,500 lines from Wikitext-TL-39 — same architecture, same data, same hyperparameters. The only difference was the tokenizer.

Results on 2,500 held-out Filipino sentences:

==================================================
Tokenizer                   Perplexity
--------------------------------------------------
Filipino Tokenizer               24.79
GPT-2 Tokenizer                 100.38
--------------------------------------------------
Winner: Filipino Tokenizer  (75.3% lower perplexity)
==================================================

Fertility comparison (2,000 validation lines):

Metric                            Filipino Tok       GPT-2 Tok
--------------------------------------------------------------
Fertility (tokens/word)                   2.53            2.05
Mean sequence length                      57.6            46.8
Context window utilization               22.5%           18.3%

The Filipino Tokenizer produces a slightly higher fertility (more tokens per word) because it enforces morpheme boundaries instead of greedily merging across them. The payoff is 75% lower perplexity — the model learns Filipino much more efficiently when every token is a meaningful linguistic unit.

Full experiment: Kaggle notebook

Project Structure

filipino-tokenizer/
    src/
        lib.rs                  # Rust BPE backend (CoreBPE, PyO3 bindings)
    filipino_tokenizer/
        base.py                 # BaseAffixes, BaseRoots, BaseSegmenter, BaseTokenizer
        data/
            prefix_table.json       # Prefix definitions, multi-language
            suffix_table.json       # Suffix definitions
            infix_table.json        # Infix definitions
            circumfix_table.json    # Circumfix definitions
            tagalog_roots.json      # ~30k Tagalog root words
            bisaya_roots.json       # Bisaya root words
            pretrained/
                vocab.json          # Bundled 32k vocabulary (Wikitext-TL-39)
                merges.txt          # Bundled merge rules
        tagalog/
            __init__.py         # Package exports
            affixes.py          # TagalogAffixes (filters for language="Tagalog")
            roots.py            # TagalogRoots (loads tagalog_roots.json)
            phonology.py        # Nasal assimilation, suffix h-insertion
            segmenter.py        # TagalogSegmenter (multi-pass morpheme decomposition)
            bpe.py              # MorphAwareBPE (constrained BPE, delegates to Rust)
            tokenizer.py        # TagalogTokenizer (segmenter + BPE pipeline)
            hf_tokenizer.py     # TagalogHFTokenizer (PreTrainedTokenizer wrapper)
    tests/
        test_affixes.py         # Affix loading and filtering tests
        test_segmenter.py       # Morphological segmentation tests
        test_tokenizer.py       # Full pipeline tests (round-trip, consistency, efficiency)
        test_rust_backend.py    # Rust extension tests (encode/decode, morpheme boundaries)
    examples/
        training_tagalog_tokenizer.py   # End-to-end training example
    demo/
        demo_tagalog_tokenizer.ipynb        # Usage guide notebook
        tokenizer_comparisons.ipynb         # Benchmark vs GPT-4 and SentencePiece
        filipino-tokenizer-experiment.ipynb # Full GPT-2 SLM training experiment
    Cargo.toml                  # Rust crate configuration
    pyproject.toml              # Package metadata and build system

Running Tests

# All tests
python -m unittest discover tests -v

# Individual test files
python -m unittest tests.test_affixes -v
python -m unittest tests.test_segmenter -v
python -m unittest tests.test_tokenizer -v
python -m unittest tests.test_rust_backend -v

# Rust unit tests (requires cargo)
cargo test

Adding a New Language

The architecture is designed to support multiple Philippine languages from the same data files. To add Bisaya, Ilokano, or another language:

Add entries to the JSON affix tables in filipino_tokenizer/data/ with the appropriate language field.
Add a root word list (e.g., filipino_tokenizer/data/bisaya_roots.json).
Create filipino_tokenizer/<language>/affixes.py subclassing BaseAffixes with super().__init__(language="<Language>").
Create a roots class subclassing BaseRoots.
Implement a segmenter subclassing BaseSegmenter with language-specific phonological rules.
Create a tokenizer class that wires the segmenter to MorphAwareBPE.

Contributing

Contributions are welcome. Areas where help is most needed:

Cebuano / Bisaya support — the affix tables already have Bisaya entries; the segmenter and phonology modules are missing.
Ilokano, Hiligaynon, Kapampangan — affix data and root dictionaries.
Segmenter accuracy — the gold-standard test set in demo/tokenizer_comparisons.ipynb is a good starting point for finding and fixing segmentation errors.
Documentation — tutorials, worked examples, and comparisons against newer tokenizers.

Please open an issue or pull request on GitHub. For questions, feel free to reach out via GitHub Issues.

References

Tacorda, A. J., Ignacio, M. J., Oco, N., & Roxas, R. E. (2017). Controlling byte pair encoding for neural machine translation. 2017 International Conference on Asian Language Processing (IALP), 168-171. The core idea behind the boundary-constrained (Controlled) BPE approach used here.
Cruz, J. C. B., & Cheng, C. (2022). Improving Large-scale Language Models and Resources for Filipino. Proceedings of the Thirteenth Language Resources and Evaluation Conference (LREC). Authors of key Filipino NLP datasets and benchmarks, including the TLUnified corpus.
Miranda, L. J. (2023). calamanCy: A Tagalog Natural Language Processing Toolkit. Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS). SpaCy-based NLP pipeline for Tagalog that informed the morphological analysis approach.

License

MIT License. See LICENSE for details.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

jpcurada

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.4.1

Apr 27, 2026

0.4.0

Apr 27, 2026

0.3.2

Apr 26, 2026

0.3.1

Apr 26, 2026

0.3.0

Apr 26, 2026

0.2.0

Apr 26, 2026

0.1.0

Apr 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

filipino_tokenizer-0.4.1.tar.gz (3.0 MB view details)

Uploaded Apr 27, 2026 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

filipino_tokenizer-0.4.1-cp313-cp313-win_amd64.whl (3.2 MB view details)

Uploaded Apr 27, 2026 CPython 3.13Windows x86-64

filipino_tokenizer-0.4.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded Apr 27, 2026 CPython 3.13manylinux: glibc 2.17+ x86-64

filipino_tokenizer-0.4.1-cp313-cp313-macosx_11_0_arm64.whl (3.3 MB view details)

Uploaded Apr 27, 2026 CPython 3.13macOS 11.0+ ARM64

filipino_tokenizer-0.4.1-cp312-cp312-win_amd64.whl (3.2 MB view details)

Uploaded Apr 27, 2026 CPython 3.12Windows x86-64

filipino_tokenizer-0.4.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded Apr 27, 2026 CPython 3.12manylinux: glibc 2.17+ x86-64

filipino_tokenizer-0.4.1-cp312-cp312-macosx_11_0_arm64.whl (3.3 MB view details)

Uploaded Apr 27, 2026 CPython 3.12macOS 11.0+ ARM64

filipino_tokenizer-0.4.1-cp311-cp311-win_amd64.whl (3.2 MB view details)

Uploaded Apr 27, 2026 CPython 3.11Windows x86-64

filipino_tokenizer-0.4.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded Apr 27, 2026 CPython 3.11manylinux: glibc 2.17+ x86-64

filipino_tokenizer-0.4.1-cp311-cp311-macosx_11_0_arm64.whl (3.3 MB view details)

Uploaded Apr 27, 2026 CPython 3.11macOS 11.0+ ARM64

filipino_tokenizer-0.4.1-cp310-cp310-win_amd64.whl (3.2 MB view details)

Uploaded Apr 27, 2026 CPython 3.10Windows x86-64

filipino_tokenizer-0.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded Apr 27, 2026 CPython 3.10manylinux: glibc 2.17+ x86-64

filipino_tokenizer-0.4.1-cp310-cp310-macosx_11_0_arm64.whl (3.3 MB view details)

Uploaded Apr 27, 2026 CPython 3.10macOS 11.0+ ARM64

File details

Details for the file filipino_tokenizer-0.4.1.tar.gz.

File metadata

Download URL: filipino_tokenizer-0.4.1.tar.gz
Upload date: Apr 27, 2026
Size: 3.0 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filipino_tokenizer-0.4.1.tar.gz
Algorithm	Hash digest
SHA256	`e9f8bb70cccbd8259317ac572e39b9a8e340163dc350455d301b1bb6efbf8e92`
MD5	`201d79c496b7b64a20a8983ebd0bdb4b`
BLAKE2b-256	`fcedba4b7ba43001101db7df7ac095bb75874a4465561b2a2f2139b03e3476fa`

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.1.tar.gz:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: filipino_tokenizer-0.4.1.tar.gz
- Subject digest: e9f8bb70cccbd8259317ac572e39b9a8e340163dc350455d301b1bb6efbf8e92
- Sigstore transparency entry: 1393100543
- Sigstore integration time: Apr 27, 2026
Source repository:
- Permalink: JpCurada/filipino-tokenizer@59005f304284a6451abe2b977acca82c187bb90c
- Branch / Tag: refs/tags/v0.4.1
- Owner: https://github.com/JpCurada
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@59005f304284a6451abe2b977acca82c187bb90c
- Trigger Event: push

File details

Details for the file filipino_tokenizer-0.4.1-cp313-cp313-win_amd64.whl.

File metadata

Download URL: filipino_tokenizer-0.4.1-cp313-cp313-win_amd64.whl
Upload date: Apr 27, 2026
Size: 3.2 MB
Tags: CPython 3.13, Windows x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filipino_tokenizer-0.4.1-cp313-cp313-win_amd64.whl
Algorithm	Hash digest
SHA256	`952dee1a3b6b0c7458cd419b735939da69aedfa8abf9b5e1ad50200ce24ec7c9`
MD5	`406a5920e6be2ea96a406fe61f949a52`
BLAKE2b-256	`26cde4581a57270c3ffd668762a2bc49b18bcae4262581f88ee3b3429f150dcb`

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.1-cp313-cp313-win_amd64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: filipino_tokenizer-0.4.1-cp313-cp313-win_amd64.whl
- Subject digest: 952dee1a3b6b0c7458cd419b735939da69aedfa8abf9b5e1ad50200ce24ec7c9
- Sigstore transparency entry: 1393101071
- Sigstore integration time: Apr 27, 2026
Source repository:
- Permalink: JpCurada/filipino-tokenizer@59005f304284a6451abe2b977acca82c187bb90c
- Branch / Tag: refs/tags/v0.4.1
- Owner: https://github.com/JpCurada
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@59005f304284a6451abe2b977acca82c187bb90c
- Trigger Event: push

File details

Details for the file filipino_tokenizer-0.4.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: filipino_tokenizer-0.4.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Apr 27, 2026
Size: 3.3 MB
Tags: CPython 3.13, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filipino_tokenizer-0.4.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`c862900f57d548853bc92cb535738b75b4daa3de6f95d5f80617d6c161af975f`
MD5	`c94ec70199896a7195be243c964f434c`
BLAKE2b-256	`d8cf6ff8f2d341b1fbd96cdd3cc8f8f25629cd8f86463a19593409ce8e879247`

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: filipino_tokenizer-0.4.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Subject digest: c862900f57d548853bc92cb535738b75b4daa3de6f95d5f80617d6c161af975f
- Sigstore transparency entry: 1393100843
- Sigstore integration time: Apr 27, 2026
Source repository:
- Permalink: JpCurada/filipino-tokenizer@59005f304284a6451abe2b977acca82c187bb90c
- Branch / Tag: refs/tags/v0.4.1
- Owner: https://github.com/JpCurada
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@59005f304284a6451abe2b977acca82c187bb90c
- Trigger Event: push

File details

Details for the file filipino_tokenizer-0.4.1-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

Download URL: filipino_tokenizer-0.4.1-cp313-cp313-macosx_11_0_arm64.whl
Upload date: Apr 27, 2026
Size: 3.3 MB
Tags: CPython 3.13, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filipino_tokenizer-0.4.1-cp313-cp313-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`990f5ba2735fdb819e8ee814ae2138352c4db24f7c555b3243c4a474b7c86156`
MD5	`a3b3a276edef0b7e9eb341d4623ab06e`
BLAKE2b-256	`7a24c2de63e0850431e099e1869c572d9b267814f3d0880b57bb1179706afe1e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.1-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: filipino_tokenizer-0.4.1-cp313-cp313-macosx_11_0_arm64.whl
- Subject digest: 990f5ba2735fdb819e8ee814ae2138352c4db24f7c555b3243c4a474b7c86156
- Sigstore transparency entry: 1393100890
- Sigstore integration time: Apr 27, 2026
Source repository:
- Permalink: JpCurada/filipino-tokenizer@59005f304284a6451abe2b977acca82c187bb90c
- Branch / Tag: refs/tags/v0.4.1
- Owner: https://github.com/JpCurada
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@59005f304284a6451abe2b977acca82c187bb90c
- Trigger Event: push

File details

Details for the file filipino_tokenizer-0.4.1-cp312-cp312-win_amd64.whl.

File metadata

Download URL: filipino_tokenizer-0.4.1-cp312-cp312-win_amd64.whl
Upload date: Apr 27, 2026
Size: 3.2 MB
Tags: CPython 3.12, Windows x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filipino_tokenizer-0.4.1-cp312-cp312-win_amd64.whl
Algorithm	Hash digest
SHA256	`d81c985826f06454f37a3110f4f769af357422c832a581a8e2e4d8c8a0ae1d47`
MD5	`f38ebb4a6d88cc5229436136241f568e`
BLAKE2b-256	`494f2fd0b80ca80ce6d4551e6af23fcc5907dc0d63af0c5fa7fe17bdc82b87c6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.1-cp312-cp312-win_amd64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: filipino_tokenizer-0.4.1-cp312-cp312-win_amd64.whl
- Subject digest: d81c985826f06454f37a3110f4f769af357422c832a581a8e2e4d8c8a0ae1d47
- Sigstore transparency entry: 1393100589
- Sigstore integration time: Apr 27, 2026
Source repository:
- Permalink: JpCurada/filipino-tokenizer@59005f304284a6451abe2b977acca82c187bb90c
- Branch / Tag: refs/tags/v0.4.1
- Owner: https://github.com/JpCurada
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@59005f304284a6451abe2b977acca82c187bb90c
- Trigger Event: push

File details

Details for the file filipino_tokenizer-0.4.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: filipino_tokenizer-0.4.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Apr 27, 2026
Size: 3.3 MB
Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filipino_tokenizer-0.4.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`38ab0cf0fba768a937681700fb2bd65f90c33c981798c02c8fd7b8b86ecda07c`
MD5	`8613821af4f2176926bdfbba90d8e2bb`
BLAKE2b-256	`6f531b6f9b4e14e8388cfc564c9ddb875c231d48e00c17aacc6348bc905a8910`

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: filipino_tokenizer-0.4.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Subject digest: 38ab0cf0fba768a937681700fb2bd65f90c33c981798c02c8fd7b8b86ecda07c
- Sigstore transparency entry: 1393101183
- Sigstore integration time: Apr 27, 2026
Source repository:
- Permalink: JpCurada/filipino-tokenizer@59005f304284a6451abe2b977acca82c187bb90c
- Branch / Tag: refs/tags/v0.4.1
- Owner: https://github.com/JpCurada
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@59005f304284a6451abe2b977acca82c187bb90c
- Trigger Event: push

File details

Details for the file filipino_tokenizer-0.4.1-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

Download URL: filipino_tokenizer-0.4.1-cp312-cp312-macosx_11_0_arm64.whl
Upload date: Apr 27, 2026
Size: 3.3 MB
Tags: CPython 3.12, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filipino_tokenizer-0.4.1-cp312-cp312-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`c7486dc5019abcbb59a3546cb509efc90f5d767408c5826e19a3831ec3a7f151`
MD5	`a28e2cd27a158e2e855fb47c6dc978f2`
BLAKE2b-256	`d13232e5b33ab2f1dc1ab8cb1929b362e8845bdbae2c861c808d0ccb6775b376`

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.1-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: filipino_tokenizer-0.4.1-cp312-cp312-macosx_11_0_arm64.whl
- Subject digest: c7486dc5019abcbb59a3546cb509efc90f5d767408c5826e19a3831ec3a7f151
- Sigstore transparency entry: 1393100952
- Sigstore integration time: Apr 27, 2026
Source repository:
- Permalink: JpCurada/filipino-tokenizer@59005f304284a6451abe2b977acca82c187bb90c
- Branch / Tag: refs/tags/v0.4.1
- Owner: https://github.com/JpCurada
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@59005f304284a6451abe2b977acca82c187bb90c
- Trigger Event: push

File details

Details for the file filipino_tokenizer-0.4.1-cp311-cp311-win_amd64.whl.

File metadata

Download URL: filipino_tokenizer-0.4.1-cp311-cp311-win_amd64.whl
Upload date: Apr 27, 2026
Size: 3.2 MB
Tags: CPython 3.11, Windows x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filipino_tokenizer-0.4.1-cp311-cp311-win_amd64.whl
Algorithm	Hash digest
SHA256	`3491412966b32d48fb12a4adb04795e750e0450c97c5f26ac19bbdbf3a8b9daf`
MD5	`db4879ebfcf3202760548d9c5180cf27`
BLAKE2b-256	`83779531f45c1291957e1c4a76a2982bcc84c2ec29aa432de88f535cc3088114`

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.1-cp311-cp311-win_amd64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: filipino_tokenizer-0.4.1-cp311-cp311-win_amd64.whl
- Subject digest: 3491412966b32d48fb12a4adb04795e750e0450c97c5f26ac19bbdbf3a8b9daf
- Sigstore transparency entry: 1393100777
- Sigstore integration time: Apr 27, 2026
Source repository:
- Permalink: JpCurada/filipino-tokenizer@59005f304284a6451abe2b977acca82c187bb90c
- Branch / Tag: refs/tags/v0.4.1
- Owner: https://github.com/JpCurada
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@59005f304284a6451abe2b977acca82c187bb90c
- Trigger Event: push

File details

Details for the file filipino_tokenizer-0.4.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: filipino_tokenizer-0.4.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Apr 27, 2026
Size: 3.3 MB
Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filipino_tokenizer-0.4.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`5db620c1a97d8610a511f090b2534f0e75d75f916fc52eeb6441dfd16ab920e2`
MD5	`702aa203e1c555ed4cb7ac69e986fea9`
BLAKE2b-256	`2a9cfa54e84143e9f8fd79777ac2326501134a09f37f13a63fb2a7de2f178b1b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: filipino_tokenizer-0.4.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Subject digest: 5db620c1a97d8610a511f090b2534f0e75d75f916fc52eeb6441dfd16ab920e2
- Sigstore transparency entry: 1393101012
- Sigstore integration time: Apr 27, 2026
Source repository:
- Permalink: JpCurada/filipino-tokenizer@59005f304284a6451abe2b977acca82c187bb90c
- Branch / Tag: refs/tags/v0.4.1
- Owner: https://github.com/JpCurada
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@59005f304284a6451abe2b977acca82c187bb90c
- Trigger Event: push

File details

Details for the file filipino_tokenizer-0.4.1-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

Download URL: filipino_tokenizer-0.4.1-cp311-cp311-macosx_11_0_arm64.whl
Upload date: Apr 27, 2026
Size: 3.3 MB
Tags: CPython 3.11, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filipino_tokenizer-0.4.1-cp311-cp311-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`29c0b36b68dacc055f5d1dfc26d48007fa7a8e4b3777092675e59698e33d9973`
MD5	`30ffe93100f473f15175aa7edb4c5339`
BLAKE2b-256	`c7cb6ca272da62950fc449f6e7b97b9381e0adaffd913d6914cffe03b4ef24d1`

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.1-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: filipino_tokenizer-0.4.1-cp311-cp311-macosx_11_0_arm64.whl
- Subject digest: 29c0b36b68dacc055f5d1dfc26d48007fa7a8e4b3777092675e59698e33d9973
- Sigstore transparency entry: 1393101126
- Sigstore integration time: Apr 27, 2026
Source repository:
- Permalink: JpCurada/filipino-tokenizer@59005f304284a6451abe2b977acca82c187bb90c
- Branch / Tag: refs/tags/v0.4.1
- Owner: https://github.com/JpCurada
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@59005f304284a6451abe2b977acca82c187bb90c
- Trigger Event: push

File details

Details for the file filipino_tokenizer-0.4.1-cp310-cp310-win_amd64.whl.

File metadata

Download URL: filipino_tokenizer-0.4.1-cp310-cp310-win_amd64.whl
Upload date: Apr 27, 2026
Size: 3.2 MB
Tags: CPython 3.10, Windows x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filipino_tokenizer-0.4.1-cp310-cp310-win_amd64.whl
Algorithm	Hash digest
SHA256	`d5a8d27e77ce647285c5b0224963c212891ccf1fb7799d5209061d8b0b361b38`
MD5	`26dd4453a9cd0df2bda387015aa0a57a`
BLAKE2b-256	`9a63da0a6b26cf8b04bde2ed5af3e3a464e5f6438cd34db564d37202eb027f4c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.1-cp310-cp310-win_amd64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: filipino_tokenizer-0.4.1-cp310-cp310-win_amd64.whl
- Subject digest: d5a8d27e77ce647285c5b0224963c212891ccf1fb7799d5209061d8b0b361b38
- Sigstore transparency entry: 1393100689
- Sigstore integration time: Apr 27, 2026
Source repository:
- Permalink: JpCurada/filipino-tokenizer@59005f304284a6451abe2b977acca82c187bb90c
- Branch / Tag: refs/tags/v0.4.1
- Owner: https://github.com/JpCurada
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@59005f304284a6451abe2b977acca82c187bb90c
- Trigger Event: push

File details

Details for the file filipino_tokenizer-0.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: filipino_tokenizer-0.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Apr 27, 2026
Size: 3.3 MB
Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filipino_tokenizer-0.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`66cd36d7f8d1daf73ad3dc34cb9b5471748c621fe4654d14641118edc3664b14`
MD5	`1d5c048a51035760567b9d4f431f52ff`
BLAKE2b-256	`e1996adc500efe1535a8173d37e7174884f43b84dea0eca2b60d16ebfbdc3ebe`

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: filipino_tokenizer-0.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Subject digest: 66cd36d7f8d1daf73ad3dc34cb9b5471748c621fe4654d14641118edc3664b14
- Sigstore transparency entry: 1393100728
- Sigstore integration time: Apr 27, 2026
Source repository:
- Permalink: JpCurada/filipino-tokenizer@59005f304284a6451abe2b977acca82c187bb90c
- Branch / Tag: refs/tags/v0.4.1
- Owner: https://github.com/JpCurada
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@59005f304284a6451abe2b977acca82c187bb90c
- Trigger Event: push

File details

Details for the file filipino_tokenizer-0.4.1-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

Download URL: filipino_tokenizer-0.4.1-cp310-cp310-macosx_11_0_arm64.whl
Upload date: Apr 27, 2026
Size: 3.3 MB
Tags: CPython 3.10, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filipino_tokenizer-0.4.1-cp310-cp310-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`064c2d05c03b08426868b58f28f34e57be1de130eb11e680fefc8268e5af5fc9`
MD5	`c146032c1b87115c7a13dbbda1baa374`
BLAKE2b-256	`c564c035defcff0ae056229c3d143824b6dda8ed00c3f0e6f085e753d945d98f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for filipino_tokenizer-0.4.1-cp310-cp310-macosx_11_0_arm64.whl:

Publisher: publish.yml on JpCurada/filipino-tokenizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: filipino_tokenizer-0.4.1-cp310-cp310-macosx_11_0_arm64.whl
- Subject digest: 064c2d05c03b08426868b58f28f34e57be1de130eb11e680fefc8268e5af5fc9
- Sigstore transparency entry: 1393100637
- Sigstore integration time: Apr 27, 2026
Source repository:
- Permalink: JpCurada/filipino-tokenizer@59005f304284a6451abe2b977acca82c187bb90c
- Branch / Tag: refs/tags/v0.4.1
- Owner: https://github.com/JpCurada
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@59005f304284a6451abe2b977acca82c187bb90c
- Trigger Event: push

filipino-tokenizer 0.4.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Filipino Tokenizer

Before and After

Installation

Quick Start

Use the bundled pretrained model

HuggingFace integration

Train a custom model

How It Works

Evaluation

Morpheme Boundary Accuracy

Small Language Model Experiment

Project Structure

Running Tests

Adding a New Language

Contributing

Links

References

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance