Text segmentation into words for multiple languages.
Project description
Words Segmentation
This repository contains a pretokenizer that segments text into "words" for further processing.
We define three classes of tokens:
C0Control tokens (always atomic)- "Words" = runs of non-space, non-control + optional single trailing whitespace
- Whitespace runs
For any script where the default is not suitable, you can implement a custom pretokenizer.
Modify LANGUAGE_SPECS in languages.py to add a custom function for specific
scripts.
For example:
LANGUAGE_SPECS: Dict[str, LanguageSpec] = {
"Chinese": {
"scripts": ("Han",),
"callback": segment_chinese,
},
"Japanese": {
"scripts": ("Han", "Hiragana", "Katakana"),
"callback": segment_japanese,
},
}
Then, with a max_bytes parameter, we split long words into smaller chunks while preserving
Unicode grapheme boundaries.
Usage
Install:
pip install words-segmentation
Pretokenize text using a Huggingface Tokenizer implementation:
from words_segmentation.tokenizer import WordsSegmentationTokenizer
pretokenizer = WordsSegmentationTokenizer(max_bytes=16)
tokens = pretokenizer.tokenize("hello world! 我爱北京天安门 👩👩👧👦")
# ['hello ', 'world! ', '我', '爱', '北京', '天安门', ' ', '👩👩👧👦']
Writing systems without word boundaries
Perhaps there will come a day when we could have a universal pretokenizer that works for all languages. Until then, we need to handle some writing systems with custom logic. We implement custom fallback pretoknizers for the following writing systems:
- Chinese characters - using jieba
- Japanese writing system - using fugashi
- Balinese script
- Burmese alphabet
- Chữ Hán
- Chữ Nôm
- Hanja
- Javanese script
- Khmer script
- Lao script
- ʼPhags-pa script
- Rasm
- Sawndip
- Scriptio continua
- S'gaw Karen alphabet
- Tai Tham script
- Thai script
- Tibetan script
- Vietnamese alphabet
- Western Pwo alphabet
Tokenization Parity
Foroutan and Meister et al. (2025) note that:
In multilingual models, the same meaning can take far more tokens in some languages, penalizing users of underrepresented languages with worse performance and higher API costs.
Let's consider the same example, for whitespace pre-tokenization parity:
| Language | Text (Google Translate) | Bytes (UTF-8) | Tokens (GPT-4) | Words (Whitespace+) |
|---|---|---|---|---|
| English | Tours are cheaper for larger groups, so if you're by yourself or with just one friend, try to meet other people and form a group of four to six for a better per-person rate. | 173 | 40 | 34 |
| Italian | I tour sono più economici per i gruppi più numerosi, quindi se sei da solo o con un solo amico, prova a incontrare altre persone e a formare un gruppo da quattro a sei persone per ottenere una tariffa più conveniente a persona. | 230 | 58 | 43 |
| German | Touren sind für größere Gruppen günstiger. Wenn Sie also alleine oder mit nur einem Freund unterwegs sind, versuchen Sie, andere Leute kennenzulernen und eine Gruppe von vier bis sechs Personen zu bilden, um einen besseren Preis pro Person zu erhalten. | 256 | 64 | 40 |
| Chinese | 团体旅游价格更便宜,所以如果您独自一人或只有一个朋友,请尝试结识其他人并组成一个四到六人的团体,以获得更好的每人价格。 | 177 | 64 | 34 |
| Japanese | ツアーはグループが多ければ安くなるので、一人または友達とだけ参加する場合は、他の人と会って4人から6人のグループを作ると、一人当たりの料金が安くなります。 | 227 | 74 | 48 |
| Finnish | Retket ovat halvempia suuremmille ryhmille, joten jos olet yksin tai vain yhden ystävän kanssa, yritä tavata muita ihmisiä ja muodosta neljän tai kuuden hengen ryhmä saadaksesi paremman hinnan per henkilö. | 212 | 79 | 30 |
| Russian | Туры обходятся дешевле для больших групп, поэтому, если вы одни или с одним другом, постарайтесь познакомиться с другими людьми и сформировать группу из четырех-шести человек, чтобы получить более выгодную цену на человека. | 409 | 100 | 32 |
| Arabic | تكون الجولات أرخص بالنسبة للمجموعات الكبيرة، لذلك إذا كنت بمفردك أو مع صديق واحد فقط، فحاول مقابلة أشخاص آخرين وتشكيل مجموعة مكونة من أربعة إلى ستة أشخاص للحصول على سعر أفضل للشخص الواحد. | 341 | 140 | 33 |
| Hebrew | סיורים זולים יותר לקבוצות גדולות יותר, כך שאם אתם לבד או עם חבר אחד בלבד, נסו לפגוש אנשים אחרים וליצור קבוצה של ארבעה עד שישה אנשים לקבלת מחיר טוב יותר לאדם. | 281 | 151 | 31 |
| Greek | Οι εκδρομές είναι φθηνότερες για μεγαλύτερες ομάδες, οπότε αν είστε μόνοι σας ή με έναν μόνο φίλο, προσπαθήστε να γνωρίσετε άλλα άτομα και να σχηματίσετε μια ομάδα τεσσάρων έως έξι ατόμων για καλύτερη τιμή ανά άτομο. | 394 | 193 | 36 |
| Tamil | பெரிய குழுக்களுக்கு சுற்றுலாக்கள் மலிவானவை, எனவே நீங்கள் தனியாகவோ அல்லது ஒரு நண்பருடனோ இருந்தால், மற்றவர்களைச் சந்தித்து நான்கு முதல் ஆறு பேர் கொண்ட குழுவை உருவாக்கி, ஒரு நபருக்கு சிறந்த விலையைப் பெற முயற்சிக்கவும். | 587 | 293 | 26 |
| Kannada | ದೊಡ್ಡ ಗುಂಪುಗಳಿಗೆ ಪ್ರವಾಸಗಳು ಅಗ್ಗವಾಗಿರುತ್ತವೆ, ಆದ್ದರಿಂದ ನೀವು ಒಬ್ಬಂಟಿಯಾಗಿ ಅಥವಾ ಒಬ್ಬ ಸ್ನೇಹಿತನೊಂದಿಗೆ ಇದ್ದರೆ, ಇತರ ಜನರನ್ನು ಭೇಟಿ ಮಾಡಲು ಪ್ರಯತ್ನಿಸಿ ಮತ್ತು ಪ್ರತಿ ವ್ಯಕ್ತಿಗೆ ಉತ್ತಮ ದರಕ್ಕಾಗಿ ನಾಲ್ಕರಿಂದ ಆರು ಜನರ ಗುಂಪನ್ನು ರಚಿಸಿ. | 565 | 361 | 26 |
| Shan | ၶၢဝ်းတၢင်း တႃႇၸုမ်းယႂ်ႇၼၼ်ႉ ၵႃႈၶၼ်မၼ်း ထုၵ်ႇလိူဝ်လႄႈ သင်ဝႃႈ ၸဝ်ႈၵဝ်ႇ ယူႇႁင်းၵူၺ်း ဢမ်ႇၼၼ် မီးဢူၺ်းၵေႃႉ ၵေႃႉလဵဝ်ၵွႆးၼႆၸိုင် ၶတ်းၸႂ် ႁူပ်ႉထူပ်း ၵူၼ်းတၢင်ႇၵေႃႉသေ ႁဵတ်းၸုမ်း 4 ၵေႃႉ တေႃႇထိုင် 6 ၵေႃႉ ႁႂ်ႈလႆႈ ၵႃႈၶၼ် ၼိုင်ႈၵေႃႉ ဢၼ်လီလိူဝ်ၼၼ်ႉယဝ်ႉ။ | 669 | 531 | 23 |
Bytes Efficiency
English really is the most efficient language in terms of bytes count, which is not suprising given its Latin alphabet, without diacritics or ligatures (with 1 byte per character). Other languages that use the Latin alphabet are also relatively efficient (e.g. Italian, German, Finnish), but their use of diacritics and ligatures increases the byte count.
Languages that use non-Latin scripts (e.g. Arabic, Hebrew, Shan) have a much higher byte count, due to the need for multiple bytes per character in UTF-8 encoding. Hebrew and Arabic use two bytes per character, while Shan uses three bytes per character, not counting ligatures.
Tokenization Efficiency (GPT-4)
English is also the most efficient language in terms of token count, which is not suprising given that the tokenizer was trained primarily on English text. Other languages that use the Latin alphabet are also relatively efficient, but the moment we move to non-Latin scripts, the token count increases significantly (up to 13x for Shan).
Words Efficiency
Assuming whitespace tokenization as a proxy for words, we see that English is not the most efficient language. This makes sense, from a language efficiency perspective, that there is no computational bias towards English. Languages distribute between 23 and 43 words for the same sentence, with English right in the middle with 34.
Cite
If you use this code in your research, please consider citing the work:
@misc{moryossef2025words,
title={Words Segmentation: A Word Level Pre-tokenizer for Languages of the World},
author={Moryossef, Amit},
howpublished={\url{https://github.com/sign/words-segmentation}},
year={2025}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file words_segmentation-0.0.2.tar.gz.
File metadata
- Download URL: words_segmentation-0.0.2.tar.gz
- Upload date:
- Size: 20.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
75c63b0cfcc6713728df3572f3c3ac80f3866c07cdbc9dd3ed2ecdc9d092b08b
|
|
| MD5 |
4841a8ce89a28707494dcfdc4320998d
|
|
| BLAKE2b-256 |
d41c0c346fdd979fc3cbedc57ae0641334a03ed1301790cf8d5475d17ff454d6
|
Provenance
The following attestation bundles were made for words_segmentation-0.0.2.tar.gz:
Publisher:
release.yaml on sign/words-segmentation
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
words_segmentation-0.0.2.tar.gz -
Subject digest:
75c63b0cfcc6713728df3572f3c3ac80f3866c07cdbc9dd3ed2ecdc9d092b08b - Sigstore transparency entry: 605348004
- Sigstore integration time:
-
Permalink:
sign/words-segmentation@1b2f8d52ee3d30e6a124b796987cb1455a8f2dc1 -
Branch / Tag:
refs/tags/v0.0.2 - Owner: https://github.com/sign
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@1b2f8d52ee3d30e6a124b796987cb1455a8f2dc1 -
Trigger Event:
release
-
Statement type:
File details
Details for the file words_segmentation-0.0.2-py3-none-any.whl.
File metadata
- Download URL: words_segmentation-0.0.2-py3-none-any.whl
- Upload date:
- Size: 14.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
93fef72d01a616baac7fd27acc66659a3343ec0885b281a9bfb2ebc5c4c93889
|
|
| MD5 |
b23ea98b89094aafadc546f0cf777660
|
|
| BLAKE2b-256 |
554b24446732bea6095045a13b3ff39abd10c75c1e86b1769b38b4a71138ced0
|
Provenance
The following attestation bundles were made for words_segmentation-0.0.2-py3-none-any.whl:
Publisher:
release.yaml on sign/words-segmentation
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
words_segmentation-0.0.2-py3-none-any.whl -
Subject digest:
93fef72d01a616baac7fd27acc66659a3343ec0885b281a9bfb2ebc5c4c93889 - Sigstore transparency entry: 605348008
- Sigstore integration time:
-
Permalink:
sign/words-segmentation@1b2f8d52ee3d30e6a124b796987cb1455a8f2dc1 -
Branch / Tag:
refs/tags/v0.0.2 - Owner: https://github.com/sign
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@1b2f8d52ee3d30e6a124b796987cb1455a8f2dc1 -
Trigger Event:
release
-
Statement type: