Sinhala Language Tool Kit
Project description
SLTK: A Comprehensive Tokenizer for Sinhala Language
Welcome to the GitHub repository for SLTK, a powerful tokenizer designed to enhance Sinhala Natural Language Processing (NLP) tasks. SLTK implements Grapheme Pair Encoding for tokenizing. Although our first SLTK version was implemented using our own research, this is implemented inspired by the research paper by Velayuthan et al. (2024).
Installation
To install SLTK, run following command:
pip install sltkpy
Usage
You can train the tokenizer on a custom dataset to create your own vabulary and use it to tokenize your text data. First, import SLTK:
from sltkpy import GPETokenizer
Now initialize the tokenizer:
tokenizer = GPETokenizer()
Train new vocab
To train a new vocab, provide corpus to the train method. Additionally you can provide the maximum size of vocab to vocab_size and the minimum frequency for a pair to be qualified as a vocab by setting min_freq.
vocab = tokenizer.train(corpus=corpus, vocab_size=3000)
Note: Default value of
min_freqis 3.
Once the training is finished, the method will return the vocab as a dictionary. You can save it as a JSON file to use it in future.
Load vocab
There are two ways to load vocab to the tokenizer. Either you can use your own vocab or you can load the pre-trained vocab available within the SLTK library. It is trained on Wikipedia Sinhala Dataset on Huggingface Datasets.
- Load pre-trained vocab:
tokenizer.pre_load()
- Load your own trained vocab:
tokenizer.load_vocab('<path_to_your_vocab>.json')
Tokenize text
Once you have loaded vocab using any method above, you can tokenize your text as follows:
tokens = tokenizer.tokenize('ශ්රී ලංකාව සිලෝන් ලෙස ද හැඳින් වේ.')
Encode tokens
To encode tokens, use following method:
encoded_tokens = tokenizer.encode(tokens)
Decode tokens
To decode tokens, use the following method:
decoded_text = tokenizer.decode(encoded_tokens)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sltkpy-0.1.1.tar.gz.
File metadata
- Download URL: sltkpy-0.1.1.tar.gz
- Upload date:
- Size: 101.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
38bf218bc4ca5cda58a61d8492561bc0f5bbb9a8928c9c1b8e38c6592aa09eb9
|
|
| MD5 |
f27ab7f7e46da0d11052fdf2fb8983c2
|
|
| BLAKE2b-256 |
a0c00bf49bd9117e9d6336087b19fe10620188a921ec946a82a7965a0daa29a9
|
File details
Details for the file sltkpy-0.1.1-py3-none-any.whl.
File metadata
- Download URL: sltkpy-0.1.1-py3-none-any.whl
- Upload date:
- Size: 192.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1844ca24b8ec735892167f84a2f48ef036061dee675710b14d7087ba95e7f042
|
|
| MD5 |
4b790300c00f0a031f8bcfcd5e5a67b9
|
|
| BLAKE2b-256 |
951ddac4cd1910f5faa7eaee2d3f6297449d903c0181af49b7b11dfd48b7212d
|