Skip to main content

Sinhala NLP Toolkit

Project description

SINLIB

Sinlib Logo

PyPI version Python Versions License: MIT

A Python library for Sinhala text processing and analysis

Note: The Romanizer and Transliterator modules are temporarily unavailable due to a potential bug. We are working to resolve this and will restore them in a future update.

Overview

Sinlib is a specialized Python library designed for processing and analyzing Sinhala text. It provides tools for tokenization, preprocessing, and romanization to facilitate natural language processing tasks for the Sinhala language.

Features

  • Tokenizer: Tokenization for Sinhala text
  • Preprocessor: Text preprocessing utilities including Sinhala character ratio analysis
  • Romanizer: Convert Sinhala text to Roman characters

Installation

Install the latest stable version from PyPI:

pip install sinlib

Usage Examples

Tokenizer

Split Sinhala text into meaningful tokens:

from sinlib import Tokenizer

# Sample Sinhala text
corpus = """මේ අතර, පෙබරවාරි මාසයේ පළමු දින 08 තුළ පමණක් විදෙස් සංචාරකයන් 60,122 දෙනෙකු මෙරටට පැමිණ තිබේ.
ඒ අනුව මේ වසරේ ගත වූ කාලය තුළ සංචාරකයන් 268‍,375 දෙනෙකු දිවයිනට පැමිණ ඇති බව සංචාරක සංවර්ධන අධිකාරිය සඳහන් කරයි.
ඉන් වැඩි ම සංචාරකයන් පිරිසක් ඉන්දියාවෙන් පැමිණ ඇති අතර, එම සංඛ්‍යාව 42,768කි.
ඊට අමතර ව රුසියාවෙන් සංචාරකයන් 39,914ක්, බ්‍රිතාන්‍යයෙන් 22,278ක් සහ ජර්මනියෙන් සංචාරකයන් 18,016 දෙනෙකු පැමිණ ඇති බව වාර්තා වේ."""

# Initialize and train the tokenizer
tokenizer = Tokenizer()
tokenizer.train([corpus])

# Encode text into tokens
encoding = tokenizer("මේ අතර, පෙබරවාරි මාසයේ පළමු")

# List tokens
tokens = [tokenizer.token_id_to_token_map[id] for id in encoding]
print(tokens)
# Output: ['මේ', ' ', 'අ', 'ත', 'ර', ',', ' ', 'පෙ', 'බ', 'ර', 'වා', 'රි', ' ', 'මා', 'ස', 'යේ', ' ', 'ප', 'ළ', 'මු']

Preprocessor

Analyze Sinhala character ratio in text:

from sinlib.preprocessing import get_sinhala_character_ratio

# Sample sentences with varying Sinhala content
sentences = [
    'මෙය සිංහල වාක්‍යක්',                                  # Full Sinhala
    'මෙය සිංහල වාක්‍යක් සමග english character කීපයක්',     # Mixed Sinhala and English
    'This is a complete English sentence'                   # Full English
]

# Calculate Sinhala character ratio for each sentence
ratios = get_sinhala_character_ratio(sentences)
print(ratios)
# Output: [0.9, 0.46875, 0.0]

Spell Checker (beta)

Detect typos and get spelling suggestions for Sinhala words using n gram models:

from sinlib.spellcheck import TypoDetector

# Initialize the typo detector
typo_detector = TypoDetector()

# Check spelling of a word
result = typo_detector("අඩිරාජයාගේ")
print(result) # ['අධිරාජයාගේ', 'අධිරාජ්\u200dයයාගේ', 'අධිරාජයා']
# Output: Either the word itself if correct, or a list of suggestions if it's a potential typo

Romanizer

Convert Sinhala text to Roman characters:

from sinlib import Romanizer

# Sample texts with Sinhala content
texts = [
    "hello, මේ මාසයේ ගත වූ දින 15ක කාලය තුළ කොළඹ නගරය ආශ්‍රිත ව",
    "මෑතකාලීන ව රට මුහුණ දුන් අභියෝගාත්මකම ආර්ථික කාරණාව ණය ප්‍රතිව්‍යුගතකරණය බව"
]

# Initialize the romanizer
romanizer = Romanizer(char_mapper_fp=None, tokenizer_vocab_path=None)

# Romanize the texts
romanized_texts = romanizer(texts)
print(romanized_texts)
# Output:
# ['hello, me masaye gatha wu dina 15ka kalaya thula kolaba nagaraya ashritha wa',
#  'methakaleena wa rata muhuna dun abhiyogathmakama arthika karanawa naya prathiwyugathakaranaya bawa']

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

  • Thanks to all contributors who have helped with the development of Sinlib
  • Special thanks to the Sinhala NLP community for their support and feedback

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sinlib-0.1.12.tar.gz (4.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sinlib-0.1.12-py3-none-any.whl (4.2 MB view details)

Uploaded Python 3

File details

Details for the file sinlib-0.1.12.tar.gz.

File metadata

  • Download URL: sinlib-0.1.12.tar.gz
  • Upload date:
  • Size: 4.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sinlib-0.1.12.tar.gz
Algorithm Hash digest
SHA256 31d26c140624a6fcab05647a460c84874cd13a3d6d6c5c29a8001ab39106d89e
MD5 7157e82e96d4af2fc74b92a106dd9ef6
BLAKE2b-256 2927d311cb965379eaa9948a9484a27c00d4214a0f12dfbff142ede1630b3717

See more details on using hashes here.

File details

Details for the file sinlib-0.1.12-py3-none-any.whl.

File metadata

  • Download URL: sinlib-0.1.12-py3-none-any.whl
  • Upload date:
  • Size: 4.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sinlib-0.1.12-py3-none-any.whl
Algorithm Hash digest
SHA256 a47cfccfead7530bcc48d7ea3ae1d7c63b52097175d01526f5d3c2b1cbd85c33
MD5 8dff6ec72e108e189fa6cab7abd85658
BLAKE2b-256 5cb06fe62d7cab8889cf1db5a71be1117e8c773d915f3cf414859f7bc11539f4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page