Text segmentation into words for multiple languages.
Project description
Words Segmentation
This repository contains a pretokenizer that segments text into "words" for further processing.
We define three classes of tokens:
C0Control tokens (always atomic)- "Words" = runs of non-space, non-control + optional single trailing whitespace
- Whitespace runs
For any script where the default is not suitable, you can implement a custom pretokenizer.
Modify LANGUAGE_SPECS in languages.py to add a custom function for specific scripts.
For example:
LANGUAGE_SPECS: Dict[str, LanguageSpec] = {
"Chinese": {
"scripts": ("Han",),
"callback": segment_chinese,
},
"Japanese": {
"scripts": ("Han", "Hiragana", "Katakana"),
"callback": segment_japanese,
},
}
Then, with a max_bytes parameter, we split long words into smaller chunks while preserving
Unicode grapheme boundaries.
Usage
Install:
pip install words-segmentation
Pretokenize text using a Huggingface Tokenizer implementation:
from words_segmentation.tokenizer import WordsSegmentationTokenizer
pretokenizer = WordsSegmentationTokenizer(max_bytes=16)
tokens = pretokenizer.tokenize("hello world! 我爱北京天安门 👩👩👧👦")
# ['hello ', 'world! ', '我', '爱', '北京', '天安门', ' ', '👩👩👧👦']
Writing systems without word boundaries
Perhaps there will come a day when we could have a universal pretokenizer that works for all languages. Until then, we need to handle some writing systems with custom logic. We implement custom fallback pretoknizers for the following writing systems:
- Chinese characters - using jieba
- Japanese writing system - using fugashi
- Balinese script
- Burmese alphabet
- Chữ Hán
- Chữ Nôm
- Hanja
- Javanese script
- Khmer script
- Lao script
- ʼPhags-pa script
- Rasm
- Sawndip
- Scriptio continua
- S'gaw Karen alphabet
- Tai Tham script
- Thai script
- Tibetan script
- Vietnamese alphabet
- Western Pwo alphabet
Cite
If you use this code in your research, please consider citing the work:
@misc{moryossef2025words,
title={Words Segmentation: A Word Level Pre-tokenizer for Languages of the World},
author={Moryossef, Amit},
howpublished={\url{https://github.com/sign/words-segmentation}},
year={2025}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file words_segmentation-0.0.1.tar.gz.
File metadata
- Download URL: words_segmentation-0.0.1.tar.gz
- Upload date:
- Size: 11.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ce3e5033322b0951d4b714f5b862b3e6469829f438d4a66126a0ed48e50b0b91
|
|
| MD5 |
8f8e7c259fe45023ad5284376cca2ea3
|
|
| BLAKE2b-256 |
efc4517be058f38ab00d06ef9d51e32fa72a38d095b7159335a9350121bb7814
|
Provenance
The following attestation bundles were made for words_segmentation-0.0.1.tar.gz:
Publisher:
release.yaml on sign/words-segmentation
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
words_segmentation-0.0.1.tar.gz -
Subject digest:
ce3e5033322b0951d4b714f5b862b3e6469829f438d4a66126a0ed48e50b0b91 - Sigstore transparency entry: 568034985
- Sigstore integration time:
-
Permalink:
sign/words-segmentation@6fd363ed98e53d95cf9baddd0c11356d612b3892 -
Branch / Tag:
refs/tags/v0.0.1 - Owner: https://github.com/sign
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@6fd363ed98e53d95cf9baddd0c11356d612b3892 -
Trigger Event:
release
-
Statement type:
File details
Details for the file words_segmentation-0.0.1-py3-none-any.whl.
File metadata
- Download URL: words_segmentation-0.0.1-py3-none-any.whl
- Upload date:
- Size: 10.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e2a3c43eaa9d7ba6603dcc49c4c830f21d380fc6921df14e6df2289c4968b445
|
|
| MD5 |
24efe1d65d8af7802acefe84e54007bc
|
|
| BLAKE2b-256 |
e7a93090ca892e59cbcea21922bc4f99d37bc47500f9d6db2575e0b168cf28c5
|
Provenance
The following attestation bundles were made for words_segmentation-0.0.1-py3-none-any.whl:
Publisher:
release.yaml on sign/words-segmentation
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
words_segmentation-0.0.1-py3-none-any.whl -
Subject digest:
e2a3c43eaa9d7ba6603dcc49c4c830f21d380fc6921df14e6df2289c4968b445 - Sigstore transparency entry: 568034987
- Sigstore integration time:
-
Permalink:
sign/words-segmentation@6fd363ed98e53d95cf9baddd0c11356d612b3892 -
Branch / Tag:
refs/tags/v0.0.1 - Owner: https://github.com/sign
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@6fd363ed98e53d95cf9baddd0c11356d612b3892 -
Trigger Event:
release
-
Statement type: