Skip to main content

Sinhala Language Tool Kit

Project description

SLTK: A Comprehensive Tokenizer for Sinhala Language

Welcome to the GitHub repository for SLTK, a powerful tokenizer designed to enhance Sinhala Natural Language Processing (NLP) tasks. SLTK implements Grapheme Pair Encoding for tokenizing. Although our first SLTK version was implemented using our own research, this is implemented inspired by the research paper by Velayuthan et al. (2024).

Installation

To install SLTK, run following command:

pip install sltkpy

Usage

You can train the tokenizer on a custom dataset to create your own vabulary and use it to tokenize your text data. First, import SLTK:

from sltkpy import GPETokenizer

Now initialize the tokenizer:

tokenizer = GPETokenizer()

Train new vocab

To train a new vocab, provide corpus to the train method. Additionally you can provide the maximum size of vocab to vocab_size and the minimum frequency for a pair to be qualified as a vocab by setting min_freq.

vocab = tokenizer.train(corpus=corpus, vocab_size=3000)

Note: Default value of min_freq is 3.

Once the training is finished, the method will return the vocab as a dictionary. You can save it as a JSON file to use it in future.

Load vocab

There are two ways to load vocab to the tokenizer. Either you can use your own vocab or you can load the pre-trained vocab available within the SLTK library. It is trained on Wikipedia Sinhala Dataset on Huggingface Datasets.

  1. Load pre-trained vocab:
tokenizer.pre_load()
  1. Load your own trained vocab:
tokenizer.load_vocab('<path_to_your_vocab>.json')

Tokenize text

Once you have loaded vocab using any method above, you can tokenize your text as follows:

tokens = tokenizer.tokenize('ශ්‍රී ලංකාව සිලෝන් ලෙස ද හැඳින් වේ.')

Encode tokens

To encode tokens, use following method:

encoded_tokens = tokenizer.encode(tokens)

Decode tokens

To decode tokens, use the following method:

decoded_text = tokenizer.decode(encoded_tokens)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sltkpy-0.1.1.tar.gz (101.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sltkpy-0.1.1-py3-none-any.whl (192.0 kB view details)

Uploaded Python 3

File details

Details for the file sltkpy-0.1.1.tar.gz.

File metadata

  • Download URL: sltkpy-0.1.1.tar.gz
  • Upload date:
  • Size: 101.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for sltkpy-0.1.1.tar.gz
Algorithm Hash digest
SHA256 38bf218bc4ca5cda58a61d8492561bc0f5bbb9a8928c9c1b8e38c6592aa09eb9
MD5 f27ab7f7e46da0d11052fdf2fb8983c2
BLAKE2b-256 a0c00bf49bd9117e9d6336087b19fe10620188a921ec946a82a7965a0daa29a9

See more details on using hashes here.

File details

Details for the file sltkpy-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: sltkpy-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 192.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for sltkpy-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1844ca24b8ec735892167f84a2f48ef036061dee675710b14d7087ba95e7f042
MD5 4b790300c00f0a031f8bcfcd5e5a67b9
BLAKE2b-256 951ddac4cd1910f5faa7eaee2d3f6297449d903c0181af49b7b11dfd48b7212d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page