botok·PyPI

Tibetan Word Tokenizer

These details have not been verified by PyPI

Project links

Project description

Botok – Python Tibetan Tokenizer

Description • Key Features • Installation • Basic Usage • Advanced Usage • Documentation • Development • Contributing • Acknowledgements

Description

Botok is a powerful Python library for tokenizing Tibetan text. It segments text into words with high accuracy and provides optional attributes such as lemma, part-of-speech (POS) tags, and clean forms. The library supports various text formats, custom dialects, and multiple tokenization modes, making it a versatile tool for Tibetan Natural Language Processing (NLP).

Key Features

Word Segmentation: Accurate word segmentation with support for affixed particles
Multiple Tokenization Modes:
- Word tokenization
- Chunk tokenization (groups of meaningful characters)
- Space-based tokenization
Rich Token Attributes:
- Lemmatization
- POS tagging
- Clean form generation
Custom Dialect Support: Use pre-configured dialects or create your own
File Processing: Process both strings and files with automatic output generation
Robust Handling: Manages complex cases like double tseks and spaces within words

Installation

Requirements

Python 3.6 or higher
pip package manager

Basic Installation

pip install botok

Development Installation

git clone https://github.com/OpenPecha/botok.git
cd botok
pip install -e .

Basic Usage

Simple Word Tokenization

from botok import WordTokenizer
from botok.config import Config
from pathlib import Path

# Initialize tokenizer with default configuration
config = Config(dialect_name="general", base_path=Path.home())
wt = WordTokenizer(config=config)

# Tokenize text
text = "བཀྲ་ཤིས་བདེ་ལེགས་ཞུས་རྒྱུ་ཡིན་ སེམས་པ་སྐྱིད་པོ་འདུག།"
tokens = wt.tokenize(text, split_affixes=False)

# Print each token
for token in tokens:
    print(token)

File Processing

from botok import Text
from pathlib import Path

# Process a file
input_file = Path("input.txt")
t = Text(input_file)
t.tokenize_chunks_plaintext  # Creates input_pybo.txt with tokenized output

Advanced Usage

Custom Dialect Configuration

from botok import WordTokenizer
from botok.config import Config
from pathlib import Path

# Configure custom dialect
config = Config(
    dialect_name="custom",
    base_path=Path.home() / "my_dialects"
)

# Initialize tokenizer with custom config
wt = WordTokenizer(config=config)

# Process text with custom settings
text = "བཀྲ་ཤིས་བདེ་ལེགས།"
tokens = wt.tokenize(
    text,
    split_affixes=True,
    pos_tagging=True,
    lemmatize=True
)

Different Tokenization Modes

from botok import Text

text = """ལེ གས། བཀྲ་ཤིས་མཐའི་ ༆ ཤི་བཀྲ་ཤིས་"""
t = Text(text)

# 1. Word tokenization
words = t.tokenize_words_raw_text

# 2. Chunk tokenization (groups of meaningful characters)
chunks = t.tokenize_chunks_plaintext

# 3. Space-based tokenization
spaces = t.tokenize_on_spaces

Documentation

For comprehensive documentation, visit:

ReadTheDocs - Full API documentation
Wiki - Guides and tutorials
Examples - Code examples

Development

Building from Source

rm -rf dist/
python setup.py clean sdist

Publishing to PyPI

Automated Publishing with Semantic Versioning

The repository is configured with GitHub Actions to automatically handle version bumping and publishing to PyPI when changes are pushed to the master branch. The workflow uses semantic versioning based on commit messages:

Use the following commit message formats:

fix: your message - For bug fixes (triggers PATCH version bump)
feat: your message - For new features (triggers MINOR version bump)
Add BREAKING CHANGE: description in the commit body for breaking changes (triggers MAJOR version bump)

Examples:

# This will trigger a PATCH version bump (e.g., 0.8.12 → 0.8.13)
fix: improve test coverage to 90% and fix Python 3.12 compatibility

# This will trigger a MINOR version bump (e.g., 0.8.12 → 0.9.0)
feat: add new sentence tokenization mode for complex Tibetan sentences

# This will trigger a MAJOR version bump (e.g., 0.8.12 → 1.0.0)
feat: refactor token attributes structure

BREAKING CHANGE: Token.attributes now uses a dictionary format instead of properties, requiring changes to code that accesses token attributes directly

When you push to the master branch, the CI workflow will:
- Run all tests across multiple Python versions
- Analyze commit messages to determine the next version number
- Update version numbers in the code
- Create a new release on GitHub
- Publish the package to PyPI

Manual Publishing

For manual publishing (if needed):

twine upload dist/*

Running Tests

pytest tests/

Contributing

We welcome contributions! Here's how you can help:

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

Please ensure your PR adheres to:

Code style guidelines
Test coverage requirements
Documentation standards

Project Owners

Acknowledgements

botok is an open source library for Tibetan NLP. We are grateful to our sponsors and contributors:

Contributors

Drupchen - Core development
Élie Roux - Architecture and development
Ngawang Trinley - Project management
Mikko Kotila - Development
Thubten Rinzin - Testing and documentation
Tenzin - Development
Joyce Mackzenzie - Logo design

License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.9.0

Mar 9, 2025

0.8.12

May 17, 2023

0.8.11

May 11, 2023

0.8.10

Apr 5, 2022

0.8.8

Oct 12, 2021

0.8.7

Jun 21, 2021

0.8.6

May 20, 2021

0.8.5

Apr 15, 2021

0.8.4

Apr 14, 2021

0.8.3

Mar 29, 2021

0.8.2

Mar 22, 2021

0.8.1

Jul 28, 2020

0.7.5

Dec 30, 2019

0.7.4

Dec 15, 2019

0.7.3

Dec 12, 2019

0.7.2

Dec 12, 2019

0.7.1

Dec 11, 2019

0.7.0

Dec 10, 2019

0.6.18

Nov 21, 2019

0.6.17

Nov 7, 2019

0.6.16

Nov 7, 2019

0.6.15

Nov 6, 2019

0.6.14

Nov 5, 2019

0.6.13

Nov 1, 2019

0.6.12

Oct 7, 2019

0.6.11

Oct 4, 2019

0.6.10

Sep 12, 2019

0.6.9

Sep 1, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

botok-0.9.0.tar.gz (68.9 kB view details)

Uploaded Mar 9, 2025 Source

Built Distribution

botok-0.9.0-py3-none-any.whl (79.9 kB view details)

Uploaded Mar 9, 2025 Python 3

File details

Details for the file botok-0.9.0.tar.gz.

File metadata

Download URL: botok-0.9.0.tar.gz
Upload date: Mar 9, 2025
Size: 68.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.8.0 pkginfo/1.12.1.2 readme-renderer/44.0 requests/2.32.3 requests-toolbelt/1.0.0 urllib3/2.3.0 tqdm/4.67.1 importlib-metadata/8.6.1 keyring/25.6.0 rfc3986/2.0.0 colorama/0.4.6 CPython/3.10.16

File hashes

Hashes for botok-0.9.0.tar.gz
Algorithm	Hash digest
SHA256	`afd10d38af6c45c74b0bcb4e428b9a26915d4ca2062becf36429e36b44616ad4`
MD5	`a66f123bcff7d16f4c5043d84e870027`
BLAKE2b-256	`3a94debb1619b0129a224d91edb62146589ee3b3de31a570dfa81b320999d116`

See more details on using hashes here.

File details

Details for the file botok-0.9.0-py3-none-any.whl.

File metadata

Download URL: botok-0.9.0-py3-none-any.whl
Upload date: Mar 9, 2025
Size: 79.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.8.0 pkginfo/1.12.1.2 readme-renderer/44.0 requests/2.32.3 requests-toolbelt/1.0.0 urllib3/2.3.0 tqdm/4.67.1 importlib-metadata/8.6.1 keyring/25.6.0 rfc3986/2.0.0 colorama/0.4.6 CPython/3.10.16

File hashes

Hashes for botok-0.9.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ef23f16abdeda1a4e34289cafd08da21d0e4d0b1cdd41571e286ad059021196e`
MD5	`4b77ebcb0c8b81546e99dfe9f95e0781`
BLAKE2b-256	`af55110acad86e1faa0cc7ed83806599a5a4f75de05250f34f3801891285c02d`

See more details on using hashes here.

botok 0.9.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Botok – Python Tibetan Tokenizer

Description

Key Features

Installation

Requirements

Basic Installation

Development Installation

Basic Usage

Simple Word Tokenization

File Processing

Advanced Usage

Custom Dialect Configuration

Different Tokenization Modes

Documentation

Development

Building from Source

Publishing to PyPI

Automated Publishing with Semantic Versioning

Manual Publishing

Running Tests

Contributing

Project Owners

Acknowledgements

Sponsors

Contributors

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes