Skip to main content

Sinhala NLP Toolkit

Project description

SINLIB

Sinlib Logo

PyPI version Python Versions License: MIT

A Python library for Sinhala text processing and analysis

Overview

Sinlib is a specialized Python library designed for processing and analyzing Sinhala text. It provides tools for tokenization, preprocessing, and romanization to facilitate natural language processing tasks for the Sinhala language.

Features

  • Tokenizer: Tokenization for Sinhala text
  • Preprocessor: Text preprocessing utilities including Sinhala character ratio analysis
  • Romanizer: Convert Sinhala text to Roman characters

Installation

Install the latest stable version from PyPI:

pip install sinlib

Usage Examples

Tokenizer

Split Sinhala text into meaningful tokens:

from sinlib import Tokenizer

# Sample Sinhala text
corpus = """මේ අතර, පෙබරවාරි මාසයේ පළමු දින 08 තුළ පමණක් විදෙස් සංචාරකයන් 60,122 දෙනෙකු මෙරටට පැමිණ තිබේ.
ඒ අනුව මේ වසරේ ගත වූ කාලය තුළ සංචාරකයන් 268‍,375 දෙනෙකු දිවයිනට පැමිණ ඇති බව සංචාරක සංවර්ධන අධිකාරිය සඳහන් කරයි.
ඉන් වැඩි ම සංචාරකයන් පිරිසක් ඉන්දියාවෙන් පැමිණ ඇති අතර, එම සංඛ්‍යාව 42,768කි.
ඊට අමතර ව රුසියාවෙන් සංචාරකයන් 39,914ක්, බ්‍රිතාන්‍යයෙන් 22,278ක් සහ ජර්මනියෙන් සංචාරකයන් 18,016 දෙනෙකු පැමිණ ඇති බව වාර්තා වේ."""

# Initialize and train the tokenizer
tokenizer = Tokenizer()
tokenizer.train([corpus])

# Encode text into tokens
encoding = tokenizer("මේ අතර, පෙබරවාරි මාසයේ පළමු")

# List tokens
tokens = [tokenizer.token_id_to_token_map[id] for id in encoding]
print(tokens)
# Output: ['මේ', ' ', 'අ', 'ත', 'ර', ',', ' ', 'පෙ', 'බ', 'ර', 'වා', 'රි', ' ', 'මා', 'ස', 'යේ', ' ', 'ප', 'ළ', 'මු']

Preprocessor

Analyze Sinhala character ratio in text:

from sinlib.preprocessing import get_sinhala_character_ratio

# Sample sentences with varying Sinhala content
sentences = [
    'මෙය සිංහල වාක්‍යක්',                                  # Full Sinhala
    'මෙය සිංහල වාක්‍යක් සමග english character කීපයක්',     # Mixed Sinhala and English
    'This is a complete English sentence'                   # Full English
]

# Calculate Sinhala character ratio for each sentence
ratios = get_sinhala_character_ratio(sentences)
print(ratios)
# Output: [0.9, 0.46875, 0.0]

Spell Checker (beta)

Detect typos and get spelling suggestions for Sinhala words using n gram models:

from sinlib.spellcheck import TypoDetector

# Initialize the typo detector
typo_detector = TypoDetector()

# Check spelling of a word
result = typo_detector.check_spelling("අඩිරාජයාගේ")
print(result) # ['අධිරාජයාගේ', 'අධිරාජ්\u200dයයාගේ', 'අධිරාජයා']
# Output: Either the word itself if correct, or a list of suggestions if it's a potential typo

Romanizer

Convert Sinhala text to Roman characters:

from sinlib import Romanizer

# Sample texts with Sinhala content
texts = [
    "hello, මේ මාසයේ ගත වූ දින 15ක කාලය තුළ කොළඹ නගරය ආශ්‍රිත ව",
    "මෑතකාලීන ව රට මුහුණ දුන් අභියෝගාත්මකම ආර්ථික කාරණාව ණය ප්‍රතිව්‍යුගතකරණය බව"
]

# Initialize the romanizer
romanizer = Romanizer(char_mapper_fp=None, tokenizer_vocab_path=None)

# Romanize the texts
romanized_texts = romanizer(texts)
print(romanized_texts)
# Output:
# ['hello, me masaye gatha wu dina 15ka kalaya thula kolaba nagaraya ashritha wa',
#  'methakaleena wa rata muhuna dun abhiyogathmakama arthika karanawa naya prathiwyugathakaranaya bawa']

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

  • Thanks to all contributors who have helped with the development of Sinlib
  • Special thanks to the Sinhala NLP community for their support and feedback

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sinlib-0.1.9.3.tar.gz (4.4 MB view details)

Uploaded Source

Built Distribution

sinlib-0.1.9.3-py3-none-any.whl (4.2 MB view details)

Uploaded Python 3

File details

Details for the file sinlib-0.1.9.3.tar.gz.

File metadata

  • Download URL: sinlib-0.1.9.3.tar.gz
  • Upload date:
  • Size: 4.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.22

File hashes

Hashes for sinlib-0.1.9.3.tar.gz
Algorithm Hash digest
SHA256 07517e612395090054d6f01ac119fef65aeb8ec4e8111add5067dcb149d56b5b
MD5 b6d370d35ac75a6ab03b7a10514233cb
BLAKE2b-256 b39b41147aded4ab8cb51c6ea425ad48e80aacf5bb5f3bf402de27fd6fb08e82

See more details on using hashes here.

File details

Details for the file sinlib-0.1.9.3-py3-none-any.whl.

File metadata

  • Download URL: sinlib-0.1.9.3-py3-none-any.whl
  • Upload date:
  • Size: 4.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.22

File hashes

Hashes for sinlib-0.1.9.3-py3-none-any.whl
Algorithm Hash digest
SHA256 aabc2d011cf85c74de6d64f1355e9bc77d9273adc3e9601181e75783bc0218ca
MD5 9b1c60162b13bba90b66c9305c5b5a9a
BLAKE2b-256 e6588226bb077f87ffd275c02315bfa21f22452b6c31972be256acb2382ab2ce

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page