Sinhala NLP Toolkit
Project description
SINLIB
Overview
Sinlib is a specialized Python library designed for processing and analyzing Sinhala text. It provides tools for tokenization, preprocessing, and romanization to facilitate natural language processing tasks for the Sinhala language.
Features
- Tokenizer: Tokenization for Sinhala text
- Preprocessor: Text preprocessing utilities including Sinhala character ratio analysis
- Romanizer: Convert Sinhala text to Roman characters
Installation
Install the latest stable version from PyPI:
pip install sinlib
Usage Examples
Tokenizer
Split Sinhala text into meaningful tokens:
from sinlib import Tokenizer
# Sample Sinhala text
corpus = """මේ අතර, පෙබරවාරි මාසයේ පළමු දින 08 තුළ පමණක් විදෙස් සංචාරකයන් 60,122 දෙනෙකු මෙරටට පැමිණ තිබේ.
ඒ අනුව මේ වසරේ ගත වූ කාලය තුළ සංචාරකයන් 268,375 දෙනෙකු දිවයිනට පැමිණ ඇති බව සංචාරක සංවර්ධන අධිකාරිය සඳහන් කරයි.
ඉන් වැඩි ම සංචාරකයන් පිරිසක් ඉන්දියාවෙන් පැමිණ ඇති අතර, එම සංඛ්යාව 42,768කි.
ඊට අමතර ව රුසියාවෙන් සංචාරකයන් 39,914ක්, බ්රිතාන්යයෙන් 22,278ක් සහ ජර්මනියෙන් සංචාරකයන් 18,016 දෙනෙකු පැමිණ ඇති බව වාර්තා වේ."""
# Initialize and train the tokenizer
tokenizer = Tokenizer()
tokenizer.train([corpus])
# Encode text into tokens
encoding = tokenizer("මේ අතර, පෙබරවාරි මාසයේ පළමු")
# List tokens
tokens = [tokenizer.token_id_to_token_map[id] for id in encoding]
print(tokens)
# Output: ['මේ', ' ', 'අ', 'ත', 'ර', ',', ' ', 'පෙ', 'බ', 'ර', 'වා', 'රි', ' ', 'මා', 'ස', 'යේ', ' ', 'ප', 'ළ', 'මු']
Preprocessor
Analyze Sinhala character ratio in text:
from sinlib.preprocessing import get_sinhala_character_ratio
# Sample sentences with varying Sinhala content
sentences = [
'මෙය සිංහල වාක්යක්', # Full Sinhala
'මෙය සිංහල වාක්යක් සමග english character කීපයක්', # Mixed Sinhala and English
'This is a complete English sentence' # Full English
]
# Calculate Sinhala character ratio for each sentence
ratios = get_sinhala_character_ratio(sentences)
print(ratios)
# Output: [0.9, 0.46875, 0.0]
Spell Checker (beta)
Detect typos and get spelling suggestions for Sinhala words using n gram models:
from sinlib.spellcheck import TypoDetector
# Initialize the typo detector
typo_detector = TypoDetector()
# Check spelling of a word
result = typo_detector.check_spelling("අඩිරාජයාගේ")
print(result) # ['අධිරාජයාගේ', 'අධිරාජ්\u200dයයාගේ', 'අධිරාජයා']
# Output: Either the word itself if correct, or a list of suggestions if it's a potential typo
Romanizer
Convert Sinhala text to Roman characters:
from sinlib import Romanizer
# Sample texts with Sinhala content
texts = [
"hello, මේ මාසයේ ගත වූ දින 15ක කාලය තුළ කොළඹ නගරය ආශ්රිත ව",
"මෑතකාලීන ව රට මුහුණ දුන් අභියෝගාත්මකම ආර්ථික කාරණාව ණය ප්රතිව්යුගතකරණය බව"
]
# Initialize the romanizer
romanizer = Romanizer(char_mapper_fp=None, tokenizer_vocab_path=None)
# Romanize the texts
romanized_texts = romanizer(texts)
print(romanized_texts)
# Output:
# ['hello, me masaye gatha wu dina 15ka kalaya thula kolaba nagaraya ashritha wa',
# 'methakaleena wa rata muhuna dun abhiyogathmakama arthika karanawa naya prathiwyugathakaranaya bawa']
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgements
- Thanks to all contributors who have helped with the development of Sinlib
- Special thanks to the Sinhala NLP community for their support and feedback
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file sinlib-0.1.9.3.tar.gz
.
File metadata
- Download URL: sinlib-0.1.9.3.tar.gz
- Upload date:
- Size: 4.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.22
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
07517e612395090054d6f01ac119fef65aeb8ec4e8111add5067dcb149d56b5b
|
|
MD5 |
b6d370d35ac75a6ab03b7a10514233cb
|
|
BLAKE2b-256 |
b39b41147aded4ab8cb51c6ea425ad48e80aacf5bb5f3bf402de27fd6fb08e82
|
File details
Details for the file sinlib-0.1.9.3-py3-none-any.whl
.
File metadata
- Download URL: sinlib-0.1.9.3-py3-none-any.whl
- Upload date:
- Size: 4.2 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.22
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
aabc2d011cf85c74de6d64f1355e9bc77d9273adc3e9601181e75783bc0218ca
|
|
MD5 |
9b1c60162b13bba90b66c9305c5b5a9a
|
|
BLAKE2b-256 |
e6588226bb077f87ffd275c02315bfa21f22452b6c31972be256acb2382ab2ce
|