Skip to main content

Sinhala NLP Toolkit

Project description

Sinlib

Sinlib Logo

PyPI version Python Versions License: MIT Docs

A Python toolkit for Sinhala natural language processing — phonological tokenization, spell checking, and text preprocessing.

Note: The Romanizer and Transliterator modules are temporarily unavailable due to a known bug and will be restored in a future release.

Installation

pip install sinlib

Quick Start

Tokenization

from sinlib import Tokenizer

tokenizer = Tokenizer.from_pretrained("Ransaka/sinlib")

# Split into phonological units (base consonant + diacritics)
tokens = tokenizer.tokenize("ආයුබෝවන්")
# ['ආ', 'යු', 'බෝ', 'ව', 'න්']

# Encode to integer IDs
encoding = tokenizer("ආයුබෝවන්")
encoding.input_ids       # [4, 23, 18, 7, 12]
encoding.attention_mask  # [1, 1, 1, 1, 1]

# Batch encode with padding
batch = tokenizer(["ආයුබෝවන්", "සිංහල"], padding=True)
batch.input_ids  # [[4, 23, 18, 7, 12], [9, 31, 6, 0, 0]]

Spell Checking

from sinlib import TypoDetector

detector = TypoDetector.from_pretrained("Ransaka/sinlib")

# Auto-correct a sentence
detector("අපකරියට ගිය")
# 'අපකීර්තියට ගිය'

# Get correction suggestions
detector.suggest_correction("අඩිරාජ")
# ['අධිරාජ']

Preprocessing

from sinlib import preprocessing

# Remove noise and normalise text
clean = preprocessing.process_text("Hello, මේ සිංහල වාක්‍යකි.")

# Compute Sinhala character ratio
ratio = preprocessing.get_sinhala_character_ratio(["මෙය සිංහල වාක්‍යක්"])
# [0.9]

Why phonological tokenization?

Sinhala script combines a base consonant with one or more vowel diacritics into a single phonetic unit. Standard Unicode tokenization breaks these apart, producing incorrect representations for downstream tasks like ASR and TTS.

"ආයුබෝවන්"

Sinlib  →  ['ආ', 'යු', 'බෝ', 'ව', 'න්']   ✓ phonological units
Unicode →  ['ආ', 'ය', 'ු', 'බ', 'ෝ', 'ව', 'න', '්']   ✗ raw code points

Vocab and model weights are fetched automatically from Ransaka/sinlib on HuggingFace Hub at first use — no manual setup required.

Documentation

Full documentation is available at sinlib.readthedocs.io, including:

Contributing

Contributions are welcome. Please open an issue or submit a pull request on GitHub.

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/my-feature)
  3. Commit your changes (git commit -m 'Add my feature')
  4. Push to the branch (git push origin feature/my-feature)
  5. Open a Pull Request

License

MIT License — see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sinlib-0.1.13.tar.gz (4.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sinlib-0.1.13-py3-none-any.whl (4.2 MB view details)

Uploaded Python 3

File details

Details for the file sinlib-0.1.13.tar.gz.

File metadata

  • Download URL: sinlib-0.1.13.tar.gz
  • Upload date:
  • Size: 4.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sinlib-0.1.13.tar.gz
Algorithm Hash digest
SHA256 76aa37c78978ee503ad29a6bcc41211840c5c40e85097e8e0a3bb738cffc099e
MD5 4b4f2bf3053db80ee192c3d7610d543a
BLAKE2b-256 62d22579666e46c8fba3c475ba9f66abf3b07d355d8a2cd297af7bfbd70925b3

See more details on using hashes here.

File details

Details for the file sinlib-0.1.13-py3-none-any.whl.

File metadata

  • Download URL: sinlib-0.1.13-py3-none-any.whl
  • Upload date:
  • Size: 4.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sinlib-0.1.13-py3-none-any.whl
Algorithm Hash digest
SHA256 45472360cc3d073b101a15f39d1640011b0c7817d01ba1cb39374a573543a21b
MD5 4024f9e3e70d6ce47e10c0111b911686
BLAKE2b-256 4904ccfbba388cacc6c87e581c68c3d2ded88a214088ab6f2da35ca3e4380469

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page