Skip to main content

A language processing tool for Sinhalese (සිංහල)

Project description

A language processing tool for Sinhalese (සිංහල).

Update 2020.08.16: Integrated Part of speech tagger and stemmer tool.

Update 2019.07.21: This tool no longer requires java to run sinhala tokenizer. All java code is ported to Python implementation for convenience.

Binder PyPI version

How to get started

Steps-

  1. Download stat.split.pickle to the resources folder
  2. Import required tools from the sinling module in your desired project (you may have to append this project path to your path environment variable)

How to use

Sinhala Tokenizer

from sinling import SinhalaTokenizer

tokenizer = SinhalaTokenizer()

sentence = '...'  # your sentence

tokenizer.tokenize(sentence)

Sinhala Stemmer (Experimental)

from sinling import SinhalaStemmer

stemmer = SinhalaStemmer()

word = '...'  # your sentence

stemmer.stem(word)

Please cite sinhala-stemmer if you are using this implementation.

Part-of-Speech Tagger

from sinling import SinhalaTokenizer, POSTagger

tokenizer = SinhalaTokenizer()

document = '...'  # may contain multiple sentences

tokenized_sentences = [tokenizer.tokenize(f'{ss}.') for ss in tokenizer.split_sentences(document)]

tagger = POSTagger()

pos_tags = tagger.predict(tokenized_sentences)

Word Joiner (Morphological Joiner)

from sinling import preprocess, word_joiner

w1 = preprocess('මුනි')
w2 = preprocess('උතුමා')
results = word_joiner.join(w1, w2)
# Returns a list of possible results after applying join rules ['මුනිතුමා', ...]

Word Splitter (Morphological Splitter) / corpus based - experimental

from sinling import word_splitter

word = '...'
results = word_splitter.split(word)
# Returns a dict containing debug information, base word and affix

Visit here to see some sample splits.

Contributions

  • Contact wayasas.13@cse.mrt.ac.lk if you would like to contribute to this project.

License

Apache License Version 2.0, January 2004 http://www.apache.org/licenses/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sinling-0.2.0.tar.gz (12.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sinling-0.2.0-py3-none-any.whl (18.8 kB view details)

Uploaded Python 3

File details

Details for the file sinling-0.2.0.tar.gz.

File metadata

  • Download URL: sinling-0.2.0.tar.gz
  • Upload date:
  • Size: 12.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for sinling-0.2.0.tar.gz
Algorithm Hash digest
SHA256 54b69b592278c6eaedbdfa7900d25e70b8031d76fca9bd867c9a5af89cf1583d
MD5 30ca24cf08107067d7fddfcee955a84f
BLAKE2b-256 a65006d0274803b9b9f6e757dfa02786e681225e6170b1075ded0b3ec95cffdb

See more details on using hashes here.

File details

Details for the file sinling-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: sinling-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 18.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for sinling-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9cb5e5b6a4005b6da62559367e8d6169a4500636a51836dfddb77b1c5db001e7
MD5 9c413c23572fc98e32546f533c6e5298
BLAKE2b-256 418e28d4a66d59eaba5299bdaa41e6b9cdaa20e0236e578fac97a723342c59ff

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page