Skip to main content

Pre-trained Tokenizers for the Nepali language with an interface to HuggingFace's tokenizers library for customizability.

Project description

Nepali Tokenizers

Build and Release GitHub tag (latest SemVer pre-release) LICENSE

This package provides access to pre-trained WordPiece and SentencePiece (Unigram) tokenizers for Nepali language, trained using HuggingFace's tokenizers library. It is a simple and short Python package tailored specifically for Nepali language with a default set of configurations for the normalizer, pre-tokenizer, post-processor, and decoder.

It delegates further customization by providing an interface to HuggingFace's Tokenizer pipeline, allowing users to adapt the tokenizers according to their requirements.

Installation

You can install nepalitokenizers using pip:

pip install nepalitokenizers

Usage

After installing the package, you can use the tokenizers in your Python code:

WordPiece Tokenizer

from nepalitokenizers import WordPiece

text = "हाम्रा सबै क्रियाकलापहरु भोलिवादी छन् । मेरो पानीजहाज वाम माछाले भरिपूर्ण छ । इन्जिनियरहरुले गएको हप्ता राजधानीमा त्यस्तै बहस गरे ।"

tokenizer_wp = WordPiece()

tokens = tokenizer_wp.encode(text)
print(tokens.ids)
print(tokens.tokens)

print(tokenizer_wp.decode(tokens.ids))

Output

[1, 11366, 8625, 14157, 8423, 13344, 9143, 8425, 1496, 9505, 22406, 11693, 12679, 8340, 27445, 1430, 1496, 13890, 9008, 9605, 13591, 14547, 9957, 12507, 8700, 1496, 2]
['[CLS]', 'हाम्रा', 'सबै', 'क्रियाकलाप', '##हरु', 'भोलि', '##वादी', 'छन्', '।', 'मेरो', 'पानीजहाज', 'वाम', 'माछा', '##ले', 'भरिपूर्ण', 'छ', '।', 'इन्जिनियर', '##हरुले', 'गएको', 'हप्ता', 'राजधानीमा', 'त्यस्तै', 'बहस', 'गरे', '।', '[SEP]']
हाम्रा सबै क्रियाकलापहरु भोलिवादी छन् । मेरो पानीजहाज वाम माछाले भरिपूर्ण छ । इन्जिनियरहरुले गएको हप्ता राजधानीमा त्यस्तै बहस गरे ।

SentencePiece (Unigram) Tokenizer

from nepalitokenizers import SentencePiece

text = "कोभिड महामारीको पिडाबाट मुक्त नहुँदै मानव समाजलाई यतिबेला युद्धको विध्वंसकारी क्षतिको चिन्ताले चिन्तित बनाएको छ ।"

tokenizer_sp = SentencePiece()

tokens = tokenizer_sp.encode(text)
print(tokens.ids)
print(tokens.tokens)

print(tokenizer_wp.decode(tokens.ids))

Output

[7, 9, 3241, 483, 12081, 9, 11079, 23, 2567, 11254, 1002, 789, 20, 3334, 2161, 9, 23517, 2711, 1115, 9, 1718, 12, 5941, 781, 19, 8, 1, 0]
['▁', 'को', 'भि', 'ड', '▁महामारी', 'को', '▁पिडा', 'बाट', '▁मुक्त', '▁नहुँदै', '▁मानव', '▁समाज', 'लाई', '▁यतिबेला', '▁युद्ध', 'को', '▁विध्वंस', 'कारी', '▁क्षति', 'को', '▁चिन्ता', 'ले', '▁चिन्तित', '▁बनाएको', '▁छ', '▁।', '<sep>', '<cls>']
कोभिड महामारीको पिडाबाट मुक्त नहुँदै मानव समाजलाई यतिबेला युद्धको विध्वंसकारी क्षतिको चिन्ताले चिन्तित बनाएको छ ।

Configuration & Customization

Each tokenizer class has a default and standard set of configurations for the normalizer, pre-tokenizer, post-processor, and decoder. For more information, look at the training files available in the train/ directory.

The package delegates further customization by providing an interface to directly access to HuggingFace's tokenizer pipeline. Therefore, you can treat nepalitokenizers's tokenizer instances as HuggingFace's Tokenizer objects. For example:

from nepalitokenizers import WordPiece

# importing from the HuggingFace tokenizers package
from tokenizers.processors import TemplateProcessing

text = "हाम्रो मातृभूमि नेपाल हो"

tokenizer_sp = WordPiece()

# using default post processor
tokens = tokenizer_sp.encode(text)
print(tokens.tokens)

# change the post processor to not add any special tokens
# treat tokenizer_sp as HuggingFace's Tokenizer object
tokenizer_sp.post_processor = TemplateProcessing()

tokens = tokenizer_sp.encode(text)
print(tokens.tokens)

Output

['[CLS]', 'हाम्रो', 'मातृ', '##भूमि', 'नेपाल', 'हो', '[SEP]']
['हाम्रो', 'मातृ', '##भूमि', 'नेपाल', 'हो']

To learn more about further customizations that can be performed, visit HuggingFace's Tokenizer Documentation.

Note: The delegation to HuggingFace's Tokenizer pipeline was done with the following generic wrapper class because tokenizers.Tokenizer is not an acceptable base type for inheritance. It is a useful trick I use for solving similar issues:

class Delegate:
   """
   A generic wrapper class that delegates attributes and method calls
   to the specified self.delegate instance.
   """

   @property
   def _items(self):
       return dir(self.delegate)

   def __getattr__(self, name):
       if name in self._items:
           return getattr(self.delegate, name)
       raise AttributeError(
           f"'{self.__class__.__name__}' object has no attribute '{name}'")

   def __setattr__(self, name, value):
       if name == "delegate" or name not in self._items:
           super().__setattr__(name, value)
       else:
           setattr(self.delegate, name, value)

   def __dir__(self):
       return dir(type(self)) + list(self.__dict__.keys()) + self._items

Training

The python files used to train the tokenizers are available in the train/ directory. You can also use these files to train your own tokenizers on a custom text corpus.

These tokenizers were trained on two datasets:

1. The Nepali Subset of the OSCAR dataset

You can download it using the following code:

import datasets
from tqdm.auto import tqdm
import os

dataset = datasets.load_dataset(
    'oscar', 'unshuffled_deduplicated_ne',
    split='train'
)

os.mkdir('data')

batch = []
counter = 0

for sample in tqdm(dataset):
    sample = sample['text'].replace('\n', ' ')
    batch.append(sample)

    if len(batch) == 10_000:
        with open(f'data/ne_{counter}.txt', 'w', encoding='utf-8') as f:
            f.write('\n'.join(batch))
            batch = []
            counter += 1

2. A Large Scale Nepali Text Corpus by Rabindra Lamsal (2020)

To download the dataset, follow the instructions provided in this link: A Large Scale Nepali Text Corpus.

License

This package is licensed under the Apache 2.0 License, which is consistent with the license used by HuggingFace's tokenizers library. Please see the LICENSE file for more details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nepalitokenizers-0.0.2.tar.gz (656.7 kB view details)

Uploaded Source

Built Distribution

nepalitokenizers-0.0.2-py3-none-any.whl (678.2 kB view details)

Uploaded Python 3

File details

Details for the file nepalitokenizers-0.0.2.tar.gz.

File metadata

  • Download URL: nepalitokenizers-0.0.2.tar.gz
  • Upload date:
  • Size: 656.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.4

File hashes

Hashes for nepalitokenizers-0.0.2.tar.gz
Algorithm Hash digest
SHA256 fd320e482023062c230f100c40eae15f952a97b9658f7c0019814ce08af20f06
MD5 83f5f7168426985594213b67be5f4664
BLAKE2b-256 a6f7014444f1468b7db7ae63a54934e96c8747a07935a81bef0d70282b37876e

See more details on using hashes here.

File details

Details for the file nepalitokenizers-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for nepalitokenizers-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 11e8851529e3774d30f24a5cd1eb3b0ad0448d40e552cbd2e0e018a96d43b83b
MD5 b20d952dc293eb87058b0b2c9ae38095
BLAKE2b-256 e3ee9b52ba391a9b3b74760adb93c855a17b1df0f4a11e5eb3c88e6378a6f97d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page