Sinhala NLP Toolkit
Project description
Sinlib
A Python toolkit for Sinhala natural language processing — phonological tokenization, spell checking, and text preprocessing.
Note: The
RomanizerandTransliteratormodules are temporarily unavailable due to a known bug and will be restored in a future release.
Installation
pip install sinlib
Quick Start
Tokenization
from sinlib import Tokenizer
tokenizer = Tokenizer.from_pretrained("Ransaka/sinlib")
# Split into phonological units (base consonant + diacritics)
tokens = tokenizer.tokenize("ආයුබෝවන්")
# ['ආ', 'යු', 'බෝ', 'ව', 'න්']
# Encode to integer IDs
encoding = tokenizer("ආයුබෝවන්")
encoding.input_ids # [4, 23, 18, 7, 12]
encoding.attention_mask # [1, 1, 1, 1, 1]
# Batch encode with padding
batch = tokenizer(["ආයුබෝවන්", "සිංහල"], padding=True)
batch.input_ids # [[4, 23, 18, 7, 12], [9, 31, 6, 0, 0]]
Spell Checking
from sinlib import TypoDetector
detector = TypoDetector.from_pretrained("Ransaka/sinlib")
# Auto-correct a sentence
detector("අපකරියට ගිය")
# 'අපකීර්තියට ගිය'
# Get correction suggestions
detector.suggest_correction("අඩිරාජ")
# ['අධිරාජ']
Preprocessing
from sinlib import preprocessing
# Remove noise and normalise text
clean = preprocessing.process_text("Hello, මේ සිංහල වාක්යකි.")
# Compute Sinhala character ratio
ratio = preprocessing.get_sinhala_character_ratio(["මෙය සිංහල වාක්යක්"])
# [0.9]
Why phonological tokenization?
Sinhala script combines a base consonant with one or more vowel diacritics into a single phonetic unit. Standard Unicode tokenization breaks these apart, producing incorrect representations for downstream tasks like ASR and TTS.
"ආයුබෝවන්"
Sinlib → ['ආ', 'යු', 'බෝ', 'ව', 'න්'] ✓ phonological units
Unicode → ['ආ', 'ය', 'ු', 'බ', 'ෝ', 'ව', 'න', '්'] ✗ raw code points
Vocab and model weights are fetched automatically from Ransaka/sinlib on HuggingFace Hub at first use — no manual setup required.
Documentation
Full documentation is available at sinlib.readthedocs.io, including:
Contributing
Contributions are welcome. Please open an issue or submit a pull request on GitHub.
- Fork the repository
- Create a feature branch (
git checkout -b feature/my-feature) - Commit your changes (
git commit -m 'Add my feature') - Push to the branch (
git push origin feature/my-feature) - Open a Pull Request
License
MIT License — see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sinlib-0.1.13.tar.gz.
File metadata
- Download URL: sinlib-0.1.13.tar.gz
- Upload date:
- Size: 4.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
76aa37c78978ee503ad29a6bcc41211840c5c40e85097e8e0a3bb738cffc099e
|
|
| MD5 |
4b4f2bf3053db80ee192c3d7610d543a
|
|
| BLAKE2b-256 |
62d22579666e46c8fba3c475ba9f66abf3b07d355d8a2cd297af7bfbd70925b3
|
File details
Details for the file sinlib-0.1.13-py3-none-any.whl.
File metadata
- Download URL: sinlib-0.1.13-py3-none-any.whl
- Upload date:
- Size: 4.2 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
45472360cc3d073b101a15f39d1640011b0c7817d01ba1cb39374a573543a21b
|
|
| MD5 |
4024f9e3e70d6ce47e10c0111b911686
|
|
| BLAKE2b-256 |
4904ccfbba388cacc6c87e581c68c3d2ded88a214088ab6f2da35ca3e4380469
|