Text segmentation into words for multiple languages.

Project description

Words Segmentation

This repository contains a pretokenizer that segments text into "words" for further processing.

We define three classes of tokens:

C0 Control tokens (always atomic)
"Words" = runs of non-space, non-control + optional single trailing whitespace
Whitespace runs

For any script where the default is not suitable, you can implement a custom pretokenizer. Modify LANGUAGE_SPECS in languages.py to add a custom function for specific scripts.

For example:

LANGUAGE_SPECS: Dict[str, LanguageSpec] = {
    "Chinese": {
        "scripts": ("Han",),
        "callback": segment_chinese,
    },
    "Japanese": {
        "scripts": ("Han", "Hiragana", "Katakana"),
        "callback": segment_japanese,
    },
}

Then, with a max_bytes parameter, we split long words into smaller chunks while preserving Unicode grapheme boundaries.

Usage

Install:

pip install words-segmentation

Pretokenize text using a Huggingface Tokenizer implementation:

from words_segmentation.tokenizer import WordsSegmentationTokenizer

pretokenizer = WordsSegmentationTokenizer(max_bytes=16)
tokens = pretokenizer.tokenize("hello world! 我爱北京天安门 👩‍👩‍👧‍👦")
# ['hello ', 'world! ', '我', '爱', '北京', '天安门', ' ', '👩‍👩‍👧‍👦‍']

Writing systems without word boundaries

Perhaps there will come a day when we could have a universal pretokenizer that works for all languages. Until then, we need to handle some writing systems with custom logic. We implement custom fallback pretoknizers for the following writing systems:

Tokenization Parity

Foroutan and Meister et al. (2025) note that:

In multilingual models, the same meaning can take far more tokens in some languages, penalizing users of underrepresented languages with worse performance and higher API costs.

Let's consider the same example, for whitespace pre-tokenization parity:

Language	Text (Google Translate)	Bytes (UTF-8)	Tokens (GPT-4)	Words (Whitespace+)
English	Tours are cheaper for larger groups, so if you're by yourself or with just one friend, try to meet other people and form a group of four to six for a better per-person rate.	173	40	34
Italian	I tour sono più economici per i gruppi più numerosi, quindi se sei da solo o con un solo amico, prova a incontrare altre persone e a formare un gruppo da quattro a sei persone per ottenere una tariffa più conveniente a persona.	230	58	43
German	Touren sind für größere Gruppen günstiger. Wenn Sie also alleine oder mit nur einem Freund unterwegs sind, versuchen Sie, andere Leute kennenzulernen und eine Gruppe von vier bis sechs Personen zu bilden, um einen besseren Preis pro Person zu erhalten.	256	64	40
Chinese	团体旅游价格更便宜，所以如果您独自一人或只有一个朋友，请尝试结识其他人并组成一个四到六人的团体，以获得更好的每人价格。	177	64	34
Japanese	ツアーはグループが多ければ安くなるので、一人または友達とだけ参加する場合は、他の人と会って4人から6人のグループを作ると、一人当たりの料金が安くなります。	227	74	48
Finnish	Retket ovat halvempia suuremmille ryhmille, joten jos olet yksin tai vain yhden ystävän kanssa, yritä tavata muita ihmisiä ja muodosta neljän tai kuuden hengen ryhmä saadaksesi paremman hinnan per henkilö.	212	79	30
Russian	Туры обходятся дешевле для больших групп, поэтому, если вы одни или с одним другом, постарайтесь познакомиться с другими людьми и сформировать группу из четырех-шести человек, чтобы получить более выгодную цену на человека.	409	100	32
Arabic	تكون الجولات أرخص بالنسبة للمجموعات الكبيرة، لذلك إذا كنت بمفردك أو مع صديق واحد فقط، فحاول مقابلة أشخاص آخرين وتشكيل مجموعة مكونة من أربعة إلى ستة أشخاص للحصول على سعر أفضل للشخص الواحد.	341	140	33
Hebrew	סיורים זולים יותר לקבוצות גדולות יותר, כך שאם אתם לבד או עם חבר אחד בלבד, נסו לפגוש אנשים אחרים וליצור קבוצה של ארבעה עד שישה אנשים לקבלת מחיר טוב יותר לאדם.	281	151	31
Greek	Οι εκδρομές είναι φθηνότερες για μεγαλύτερες ομάδες, οπότε αν είστε μόνοι σας ή με έναν μόνο φίλο, προσπαθήστε να γνωρίσετε άλλα άτομα και να σχηματίσετε μια ομάδα τεσσάρων έως έξι ατόμων για καλύτερη τιμή ανά άτομο.	394	193	36
Tamil	பெரிய குழுக்களுக்கு சுற்றுலாக்கள் மலிவானவை, எனவே நீங்கள் தனியாகவோ அல்லது ஒரு நண்பருடனோ இருந்தால், மற்றவர்களைச் சந்தித்து நான்கு முதல் ஆறு பேர் கொண்ட குழுவை உருவாக்கி, ஒரு நபருக்கு சிறந்த விலையைப் பெற முயற்சிக்கவும்.	587	293	26
Kannada	ದೊಡ್ಡ ಗುಂಪುಗಳಿಗೆ ಪ್ರವಾಸಗಳು ಅಗ್ಗವಾಗಿರುತ್ತವೆ, ಆದ್ದರಿಂದ ನೀವು ಒಬ್ಬಂಟಿಯಾಗಿ ಅಥವಾ ಒಬ್ಬ ಸ್ನೇಹಿತನೊಂದಿಗೆ ಇದ್ದರೆ, ಇತರ ಜನರನ್ನು ಭೇಟಿ ಮಾಡಲು ಪ್ರಯತ್ನಿಸಿ ಮತ್ತು ಪ್ರತಿ ವ್ಯಕ್ತಿಗೆ ಉತ್ತಮ ದರಕ್ಕಾಗಿ ನಾಲ್ಕರಿಂದ ಆರು ಜನರ ಗುಂಪನ್ನು ರಚಿಸಿ.	565	361	26
Shan	ၶၢဝ်းတၢင်း တႃႇၸုမ်းယႂ်ႇၼၼ်ႉ ၵႃႈၶၼ်မၼ်း ထုၵ်ႇလိူဝ်လႄႈ သင်ဝႃႈ ၸဝ်ႈၵဝ်ႇ ယူႇႁင်းၵူၺ်း ဢမ်ႇၼၼ် မီးဢူၺ်းၵေႃႉ ၵေႃႉလဵဝ်ၵွႆးၼႆၸိုင် ၶတ်းၸႂ် ႁူပ်ႉထူပ်း ၵူၼ်းတၢင်ႇၵေႃႉသေ ႁဵတ်းၸုမ်း 4 ၵေႃႉ တေႃႇထိုင် 6 ၵေႃႉ ႁႂ်ႈလႆႈ ၵႃႈၶၼ် ၼိုင်ႈၵေႃႉ ဢၼ်လီလိူဝ်ၼၼ်ႉယဝ်ႉ။	669	531	23

Bytes Efficiency

English really is the most efficient language in terms of bytes count, which is not suprising given its Latin alphabet, without diacritics or ligatures (with 1 byte per character). Other languages that use the Latin alphabet are also relatively efficient (e.g. Italian, German, Finnish), but their use of diacritics and ligatures increases the byte count.

Languages that use non-Latin scripts (e.g. Arabic, Hebrew, Shan) have a much higher byte count, due to the need for multiple bytes per character in UTF-8 encoding. Hebrew and Arabic use two bytes per character, while Shan uses three bytes per character, not counting ligatures.

Tokenization Efficiency (GPT-4)

English is also the most efficient language in terms of token count, which is not suprising given that the tokenizer was trained primarily on English text. Other languages that use the Latin alphabet are also relatively efficient, but the moment we move to non-Latin scripts, the token count increases significantly (up to 13x for Shan).

Words Efficiency

Assuming whitespace tokenization as a proxy for words, we see that English is not the most efficient language. This makes sense, from a language efficiency perspective, that there is no computational bias towards English. Languages distribute between 23 and 43 words for the same sentence, with English right in the middle with 34.

Tokenization Parity - Words

Cite

If you use this code in your research, please consider citing the work:

@misc{moryossef2025words,
  title={Words Segmentation: A Word Level Pre-tokenizer for Languages of the World},
  author={Moryossef, Amit},
  howpublished={\url{https://github.com/sign/words-segmentation}},
  year={2025}
}

Project details

Release history Release notifications | RSS feed

0.0.5

Jan 9, 2026

0.0.4

Dec 27, 2025

0.0.3

Oct 30, 2025

This version

0.0.2

Oct 14, 2025

0.0.1

Sep 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

words_segmentation-0.0.2.tar.gz (20.2 kB view details)

Uploaded Oct 14, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

words_segmentation-0.0.2-py3-none-any.whl (14.4 kB view details)

Uploaded Oct 14, 2025 Python 3

File details

Details for the file words_segmentation-0.0.2.tar.gz.

File metadata

Download URL: words_segmentation-0.0.2.tar.gz
Upload date: Oct 14, 2025
Size: 20.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for words_segmentation-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`75c63b0cfcc6713728df3572f3c3ac80f3866c07cdbc9dd3ed2ecdc9d092b08b`
MD5	`4841a8ce89a28707494dcfdc4320998d`
BLAKE2b-256	`d41c0c346fdd979fc3cbedc57ae0641334a03ed1301790cf8d5475d17ff454d6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for words_segmentation-0.0.2.tar.gz:

Publisher: release.yaml on sign/words-segmentation

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: words_segmentation-0.0.2.tar.gz
- Subject digest: 75c63b0cfcc6713728df3572f3c3ac80f3866c07cdbc9dd3ed2ecdc9d092b08b
- Sigstore transparency entry: 605348004
- Sigstore integration time: Oct 14, 2025
Source repository:
- Permalink: sign/words-segmentation@1b2f8d52ee3d30e6a124b796987cb1455a8f2dc1
- Branch / Tag: refs/tags/v0.0.2
- Owner: https://github.com/sign
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yaml@1b2f8d52ee3d30e6a124b796987cb1455a8f2dc1
- Trigger Event: release

File details

Details for the file words_segmentation-0.0.2-py3-none-any.whl.

File metadata

Download URL: words_segmentation-0.0.2-py3-none-any.whl
Upload date: Oct 14, 2025
Size: 14.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for words_segmentation-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`93fef72d01a616baac7fd27acc66659a3343ec0885b281a9bfb2ebc5c4c93889`
MD5	`b23ea98b89094aafadc546f0cf777660`
BLAKE2b-256	`554b24446732bea6095045a13b3ff39abd10c75c1e86b1769b38b4a71138ced0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for words_segmentation-0.0.2-py3-none-any.whl:

Publisher: release.yaml on sign/words-segmentation

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: words_segmentation-0.0.2-py3-none-any.whl
- Subject digest: 93fef72d01a616baac7fd27acc66659a3343ec0885b281a9bfb2ebc5c4c93889
- Sigstore transparency entry: 605348008
- Sigstore integration time: Oct 14, 2025
Source repository:
- Permalink: sign/words-segmentation@1b2f8d52ee3d30e6a124b796987cb1455a8f2dc1
- Branch / Tag: refs/tags/v0.0.2
- Owner: https://github.com/sign
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yaml@1b2f8d52ee3d30e6a124b796987cb1455a8f2dc1
- Trigger Event: release

words-segmentation 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Words Segmentation

Usage

Writing systems without word boundaries

Tokenization Parity

Bytes Efficiency

Tokenization Efficiency (GPT-4)

Words Efficiency

Cite

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance