Text segmentation into words for multiple languages.

Project description

Words Segmentation

This repository contains a pretokenizer that segments text into "words" for further processing.

We define three classes of tokens:

C0 Control tokens (always atomic)
"Words" = runs of non-space, non-control + optional single trailing whitespace
Whitespace runs

For any script where the default is not suitable, you can implement a custom pretokenizer. Modify LANGUAGE_SPECS in languages.py to add a custom function for specific scripts.

For example:

LANGUAGE_SPECS: Dict[str, LanguageSpec] = {
    "Chinese": {
        "scripts": ("Han",),
        "callback": segment_chinese,
    },
    "Japanese": {
        "scripts": ("Han", "Hiragana", "Katakana"),
        "callback": segment_japanese,
    },
}

Then, with a max_bytes parameter, we split long words into smaller chunks while preserving Unicode grapheme boundaries.

Usage

Install:

pip install words-segmentation

Pretokenize text using a Huggingface Tokenizer implementation:

from words_segmentation.tokenizer import WordsSegmentationTokenizer

pretokenizer = WordsSegmentationTokenizer(max_bytes=16)
tokens = pretokenizer.tokenize("hello world! 我爱北京天安门 👩‍👩‍👧‍👦")
# ['hello ', 'world! ', '我', '爱', '北京', '天安门', ' ', '👩‍👩‍👧‍👦‍']

Writing systems without word boundaries

Perhaps there will come a day when we could have a universal pretokenizer that works for all languages. Until then, we need to handle some writing systems with custom logic. We implement custom fallback pretoknizers for the following writing systems:

Cite

If you use this code in your research, please consider citing the work:

@misc{moryossef2025words,
  title={Words Segmentation: A Word Level Pre-tokenizer for Languages of the World},
  author={Moryossef, Amit},
  howpublished={\url{https://github.com/sign/words-segmentation}},
  year={2025}
}

Project details

Release history Release notifications | RSS feed

0.0.5

Jan 9, 2026

0.0.4

Dec 27, 2025

0.0.3

Oct 30, 2025

0.0.2

Oct 14, 2025

This version

0.0.1

Sep 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

words_segmentation-0.0.1.tar.gz (11.9 kB view details)

Uploaded Sep 29, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

words_segmentation-0.0.1-py3-none-any.whl (10.3 kB view details)

Uploaded Sep 29, 2025 Python 3

File details

Details for the file words_segmentation-0.0.1.tar.gz.

File metadata

Download URL: words_segmentation-0.0.1.tar.gz
Upload date: Sep 29, 2025
Size: 11.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for words_segmentation-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`ce3e5033322b0951d4b714f5b862b3e6469829f438d4a66126a0ed48e50b0b91`
MD5	`8f8e7c259fe45023ad5284376cca2ea3`
BLAKE2b-256	`efc4517be058f38ab00d06ef9d51e32fa72a38d095b7159335a9350121bb7814`

See more details on using hashes here.

Provenance

The following attestation bundles were made for words_segmentation-0.0.1.tar.gz:

Publisher: release.yaml on sign/words-segmentation

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: words_segmentation-0.0.1.tar.gz
- Subject digest: ce3e5033322b0951d4b714f5b862b3e6469829f438d4a66126a0ed48e50b0b91
- Sigstore transparency entry: 568034985
- Sigstore integration time: Sep 29, 2025
Source repository:
- Permalink: sign/words-segmentation@6fd363ed98e53d95cf9baddd0c11356d612b3892
- Branch / Tag: refs/tags/v0.0.1
- Owner: https://github.com/sign
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yaml@6fd363ed98e53d95cf9baddd0c11356d612b3892
- Trigger Event: release

File details

Details for the file words_segmentation-0.0.1-py3-none-any.whl.

File metadata

Download URL: words_segmentation-0.0.1-py3-none-any.whl
Upload date: Sep 29, 2025
Size: 10.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for words_segmentation-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e2a3c43eaa9d7ba6603dcc49c4c830f21d380fc6921df14e6df2289c4968b445`
MD5	`24efe1d65d8af7802acefe84e54007bc`
BLAKE2b-256	`e7a93090ca892e59cbcea21922bc4f99d37bc47500f9d6db2575e0b168cf28c5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for words_segmentation-0.0.1-py3-none-any.whl:

Publisher: release.yaml on sign/words-segmentation

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: words_segmentation-0.0.1-py3-none-any.whl
- Subject digest: e2a3c43eaa9d7ba6603dcc49c4c830f21d380fc6921df14e6df2289c4968b445
- Sigstore transparency entry: 568034987
- Sigstore integration time: Sep 29, 2025
Source repository:
- Permalink: sign/words-segmentation@6fd363ed98e53d95cf9baddd0c11356d612b3892
- Branch / Tag: refs/tags/v0.0.1
- Owner: https://github.com/sign
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yaml@6fd363ed98e53d95cf9baddd0c11356d612b3892
- Trigger Event: release

words-segmentation 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Words Segmentation

Usage

Writing systems without word boundaries

Cite

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance