Skip to main content

Pre-trained Tokenizers for the Nepali language with an interface to HuggingFace's tokenizers library for customizability.

Project description

Nepali Tokenizers

LICENSE Build and Release

This package provides access to pre-trained WordPiece and SentencePiece (Unigram) tokenizers for Nepali language, trained using HuggingFace's tokenizers library. It is a simple and short Python package tailored specifically for Nepali language with a default set of configurations for the normalizer, pre-tokenizer, post-processor, and decoder.

It delegates further customization by providing an interface to HuggingFace's Tokenizer pipeline, allowing users to adapt the tokenizers according to their requirements.

Installation

You can install nepalitokenizers using pip:

pip install nepalitokenizers

Usage

After installing the package, you can use the tokenizers in your Python code:

WordPiece Tokenizer

from nepalitokenizers import WordPiece

text = "हाम्रा सबै क्रियाकलापहरु भोलिवादी छन् । मेरो पानीजहाज वाम माछाले भरिपूर्ण छ । इन्जिनियरहरुले गएको हप्ता राजधानीमा त्यस्तै बहस गरे ।"

tokenizer_wp = WordPiece()

tokens = tokenizer_wp.encode(text)
print(tokens.ids)
print(tokens.tokens)

print(tokenizer_wp.decode(tokens.ids))

Output

[1, 11366, 8625, 14157, 8423, 13344, 9143, 8425, 1496, 9505, 22406, 11693, 12679, 8340, 27445, 1430, 1496, 13890, 9008, 9605, 13591, 14547, 9957, 12507, 8700, 1496, 2]
['[CLS]', 'हाम्रा', 'सबै', 'क्रियाकलाप', '##हरु', 'भोलि', '##वादी', 'छन्', '।', 'मेरो', 'पानीजहाज', 'वाम', 'माछा', '##ले', 'भरिपूर्ण', 'छ', '।', 'इन्जिनियर', '##हरुले', 'गएको', 'हप्ता', 'राजधानीमा', 'त्यस्तै', 'बहस', 'गरे', '।', '[SEP]']
हाम्रा सबै क्रियाकलापहरु भोलिवादी छन् । मेरो पानीजहाज वाम माछाले भरिपूर्ण छ । इन्जिनियरहरुले गएको हप्ता राजधानीमा त्यस्तै बहस गरे ।

SentencePiece (Unigram) Tokenizer

from nepalitokenizers import SentencePiece

text = "कोभिड महामारीको पिडाबाट मुक्त नहुँदै मानव समाजलाई यतिबेला युद्धको विध्वंसकारी क्षतिको चिन्ताले चिन्तित बनाएको छ ।"

tokenizer_sp = SentencePiece()

tokens = tokenizer_sp.encode(text)
print(tokens.ids)
print(tokens.tokens)

print(tokenizer_wp.decode(tokens.ids))

Output

[7, 9, 3241, 483, 12081, 9, 11079, 23, 2567, 11254, 1002, 789, 20, 3334, 2161, 9, 23517, 2711, 1115, 9, 1718, 12, 5941, 781, 19, 8, 1, 0]
['▁', 'को', 'भि', 'ड', '▁महामारी', 'को', '▁पिडा', 'बाट', '▁मुक्त', '▁नहुँदै', '▁मानव', '▁समाज', 'लाई', '▁यतिबेला', '▁युद्ध', 'को', '▁विध्वंस', 'कारी', '▁क्षति', 'को', '▁चिन्ता', 'ले', '▁चिन्तित', '▁बनाएको', '▁छ', '▁।', '<sep>', '<cls>']
कोभिड महामारीको पिडाबाट मुक्त नहुँदै मानव समाजलाई यतिबेला युद्धको विध्वंसकारी क्षतिको चिन्ताले चिन्तित बनाएको छ ।

Configuration & Customization

Each tokenizer class has a default and standard set of configurations for the normalizer, pre-tokenizer, post-processor, and decoder. For more information, look at the training files available in the train/ directory.

The package delegates further customization by providing an interface to directly access to HuggingFace's tokenizer pipeline. Therefore, you can treat nepalitokenizers's tokenizer instances as HuggingFace's Tokenizer objects. For example:

from nepalitokenizers import WordPiece

# importing from the HuggingFace tokenizers package
from tokenizers.processors import TemplateProcessing

text = "हाम्रो मातृभूमि नेपाल हो"

tokenizer_sp = WordPiece()

# using default post processor
tokens = tokenizer_sp.encode(text)
print(tokens.tokens)

# change the post processor to not add any special tokens
# treat tokenizer_sp as HuggingFace's Tokenizer object
tokenizer_sp.post_processor = TemplateProcessing()

tokens = tokenizer_sp.encode(text)
print(tokens.tokens)

Output

['[CLS]', 'हाम्रो', 'मातृ', '##भूमि', 'नेपाल', 'हो', '[SEP]']
['हाम्रो', 'मातृ', '##भूमि', 'नेपाल', 'हो']

To learn more about further customizations that can be performed, visit HuggingFace's Tokenizer Documentation.

Note: The delegation to HuggingFace's Tokenizer pipeline was done with the following generic wrapper class because tokenizers.Tokenizer is not an acceptable base type for inheritance. It is a useful trick I use for solving similar issues:

class Delegate:
   """
   A generic wrapper class that delegates attributes and method calls
   to the specified self.delegate instance.
   """

   @property
   def _items(self):
       return dir(self.delegate)

   def __getattr__(self, name):
       if name in self._items:
           return getattr(self.delegate, name)
       raise AttributeError(
           f"'{self.__class__.__name__}' object has no attribute '{name}'")

   def __setattr__(self, name, value):
       if name == "delegate" or name not in self._items:
           super().__setattr__(name, value)
       else:
           setattr(self.delegate, name, value)

   def __dir__(self):
       return dir(type(self)) + list(self.__dict__.keys()) + self._items

Training

The python files used to train the tokenizers are available in the train/ directory. You can also use these files to train your own tokenizers on a custom text corpus.

These tokenizers were trained on two datasets:

1. The Nepali Subset of the OSCAR dataset

You can download it using the following code:

import datasets
from tqdm.auto import tqdm
import os

dataset = datasets.load_dataset(
    'oscar', 'unshuffled_deduplicated_ne',
    split='train'
)

os.mkdir('data')

batch = []
counter = 0

for sample in tqdm(dataset):
    sample = sample['text'].replace('\n', ' ')
    batch.append(sample)

    if len(batch) == 10_000:
        with open(f'data/ne_{counter}.txt', 'w', encoding='utf-8') as f:
            f.write('\n'.join(batch))
            batch = []
            counter += 1

2. A Large Scale Nepali Text Corpus by Rabindra Lamsal (2020)

To download the dataset, follow the instructions provided in this link: A Large Scale Nepali Text Corpus.

License

This package is licensed under the Apache 2.0 License, which is consistent with the license used by HuggingFace's tokenizers library. Please see the LICENSE file for more details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nepalitokenizers-0.0.1.tar.gz (9.4 kB view details)

Uploaded Source

Built Distribution

nepalitokenizers-0.0.1-py3-none-any.whl (9.4 kB view details)

Uploaded Python 3

File details

Details for the file nepalitokenizers-0.0.1.tar.gz.

File metadata

  • Download URL: nepalitokenizers-0.0.1.tar.gz
  • Upload date:
  • Size: 9.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.4

File hashes

Hashes for nepalitokenizers-0.0.1.tar.gz
Algorithm Hash digest
SHA256 c5713b6d88d2f855c89445ca60de93eba0b0fb9404045a4ff290499d69253016
MD5 42abd0829b452b7a5f7348f16f47033c
BLAKE2b-256 41bbe05f85fa8bea83f714b054c5b050c59a812bba69c6ec07c922abaf889a40

See more details on using hashes here.

Provenance

File details

Details for the file nepalitokenizers-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for nepalitokenizers-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a6c1fadb0f2d86b603d3861a7fecee810d064b4c66d6acbcf9de301ece083085
MD5 eac68d91ff6a1b740e8a5954b74c7096
BLAKE2b-256 fa2247bbd1d408a94dc9470c809235f13d22a47750a25996583f058772fa15f9

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page