Skip to main content

Text segmentation into words for multiple languages.

Project description

Words Segmentation

This repository contains a pretokenizer that segments text into "words" for further processing.

We define three classes of tokens:

  1. C0 Control tokens (always atomic)
  2. "Words" = runs of non-space, non-control + optional single trailing whitespace
  3. Whitespace runs

For any script where the default is not suitable, you can implement a custom pretokenizer. Modify LANGUAGE_SPECS in languages.py to add a custom function for specific scripts.

For example:

LANGUAGE_SPECS: Dict[str, LanguageSpec] = {
    "Chinese": {
        "scripts": ("Han",),
        "callback": segment_chinese,
    },
    "Japanese": {
        "scripts": ("Han", "Hiragana", "Katakana"),
        "callback": segment_japanese,
    },
}

Then, with a max_bytes parameter, we split long words into smaller chunks while preserving Unicode grapheme boundaries.

Usage

Install:

pip install words-segmentation

Pretokenize text using a Huggingface Tokenizer implementation:

from words_segmentation.tokenizer import WordsSegmentationTokenizer

pretokenizer = WordsSegmentationTokenizer(max_bytes=16)
tokens = pretokenizer.tokenize("hello world! 我爱北京天安门 👩‍👩‍👧‍👦")
# ['hello ', 'world! ', '我', '爱', '北京', '天安门', ' ', '👩‍👩‍👧‍👦‍']

Writing systems without word boundaries

Perhaps there will come a day when we could have a universal pretokenizer that works for all languages. Until then, we need to handle some writing systems with custom logic. We implement custom fallback pretoknizers for the following writing systems:

Cite

If you use this code in your research, please consider citing the work:

@misc{moryossef2025words,
  title={Words Segmentation: A Word Level Pre-tokenizer for Languages of the World},
  author={Moryossef, Amit},
  howpublished={\url{https://github.com/sign/words-segmentation}},
  year={2025}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

words_segmentation-0.0.1.tar.gz (11.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

words_segmentation-0.0.1-py3-none-any.whl (10.3 kB view details)

Uploaded Python 3

File details

Details for the file words_segmentation-0.0.1.tar.gz.

File metadata

  • Download URL: words_segmentation-0.0.1.tar.gz
  • Upload date:
  • Size: 11.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for words_segmentation-0.0.1.tar.gz
Algorithm Hash digest
SHA256 ce3e5033322b0951d4b714f5b862b3e6469829f438d4a66126a0ed48e50b0b91
MD5 8f8e7c259fe45023ad5284376cca2ea3
BLAKE2b-256 efc4517be058f38ab00d06ef9d51e32fa72a38d095b7159335a9350121bb7814

See more details on using hashes here.

Provenance

The following attestation bundles were made for words_segmentation-0.0.1.tar.gz:

Publisher: release.yaml on sign/words-segmentation

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file words_segmentation-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for words_segmentation-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e2a3c43eaa9d7ba6603dcc49c4c830f21d380fc6921df14e6df2289c4968b445
MD5 24efe1d65d8af7802acefe84e54007bc
BLAKE2b-256 e7a93090ca892e59cbcea21922bc4f99d37bc47500f9d6db2575e0b168cf28c5

See more details on using hashes here.

Provenance

The following attestation bundles were made for words_segmentation-0.0.1-py3-none-any.whl:

Publisher: release.yaml on sign/words-segmentation

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page