Skip to main content

Nepali Language Processing Toolkit

Project description

NPLTK

Nepali Language Processing Toolkit (NPLTK) is a lightweight and modular NLP library designed specifically for the Nepali language. It provides tools for tokenization, normalization, lemmatization, stop-word removal, POS tagging, and Named Entity Recognition (NER).


Why NPLTK?

Most NLP libraries are designed primarily for English and do not handle Nepali morphology, suffixes, and tokenization well.

NPLTK is built specifically for Nepali and provides:

  • Hybrid tokenizer combining rule-based logic and SentencePiece
  • Hybrid lemmatization using dictionary + rules
  • Lightweight POS and NER models
  • Fully self-contained package with bundled resources

Installation

pip install npltk

For testing from TestPyPI:

pip install -i https://test.pypi.org/simple/ npltk

Minimal Example

from npltk import create_tokenizer

tokens = create_tokenizer().tokenize("नेपाल सुन्दर देश हो।")
print([t.text for t in tokens])

Tokenizer

NPLTK provides a tokenizer factory through create_tokenizer(...).

create_tokenizer(
    mode="hybrid",
    split_into_sentences=True,
    keep_punct=True,
    model_path=None,
    subword=True,
    preprocess=None,
    fallback_to_rule=True,
)

Main arguments

  • mode: "hybrid" or "rule"

    • "hybrid" uses rule-based tokenization together with SentencePiece
    • "rule" uses only rule-based tokenization
  • split_into_sentences: whether sentence splitting is enabled internally

  • keep_punct: whether punctuation tokens are kept in output

  • model_path: optional custom SentencePiece model path

  • subword: enables SentencePiece-based subword support in hybrid mode

  • preprocess: optional preprocessing function applied before tokenization

  • fallback_to_rule: if hybrid loading fails, automatically use rule mode

Tokenizer Example

from npltk import create_tokenizer

tokenizer = create_tokenizer(
    mode="hybrid",
    keep_punct=True,
    fallback_to_rule=True,
)

tokens = tokenizer.tokenize("नेपाल एक सुन्दर देश हो।")
print([t.text for t in tokens])

Sentence Tokenization Example

from npltk import create_tokenizer

tokenizer = create_tokenizer(mode="hybrid")
sentences = tokenizer.tokenize_sentences("नेपाल सुन्दर देश हो। यहाँ हिमाल छन्।")

for sent in sentences:
    print([t.text for t in sent.tokens])

Detokenization Example

from npltk import create_tokenizer

tokenizer = create_tokenizer(mode="hybrid")
tokens = tokenizer.tokenize("नेपाल सुन्दर देश हो।")
text = tokenizer.detokenize(tokens)

print(text)

Separate Examples for Each Component

1. Normalizer

from npltk.normalizer import build_normalizer

result = build_normalizer().normalize("  नेपाल।।  ")
print(result.text)

2. Tokenizer

from npltk import create_tokenizer

tokenizer = create_tokenizer(mode="hybrid")
tokens = tokenizer.tokenize("नेपालको प्रधानमन्त्री काठमाडौं गए।")
print([t.text for t in tokens])

3. Lemmatizer

from npltk import Lemmatizer

lemmatizer = Lemmatizer()
print(lemmatizer.lemmatize("गयो"))
print(lemmatizer.lemmatize("घरहरूमा"))

4. Stop Word Removal

from npltk import create_tokenizer
from npltk.stop_word.remover import StopWordRemover

tokens = create_tokenizer().tokenize("नेपाल सुन्दर देश हो र यहाँ हिमाल छन् ।")
filtered, info = StopWordRemover().remove(tokens)

print([t.text for t in filtered])
print(info)

5. POS Tagger

from npltk import create_tokenizer, POSTagger

tokens = [t.text for t in create_tokenizer().tokenize("नेपालको प्रधानमन्त्री काठमाडौं गए।")]
tagger = POSTagger()

print(tagger.tag_with_tokens(tokens))

6. NER Tagger

from npltk import NERTagger

tagger = NERTagger(tokenizer_mode="hybrid")
print(tagger.extract("शेरबहादुर देउवा काठमाडौं पुगे।"))

Full Workflow Pipeline Example

from pprint import pprint

from npltk import create_tokenizer, Lemmatizer, POSTagger, NERTagger
from npltk.normalizer import build_normalizer
from npltk.stop_word.remover import StopWordRemover

text = "  शेरबहादुर देउवा काठमाडौं पुगे र नेपालको बारेमा बोले।  "

# 1. Normalize
normalizer = build_normalizer()
norm_result = normalizer.normalize(text)
normalized_text = norm_result.text
print("Normalized:", normalized_text)

# 2. Tokenize
tokenizer = create_tokenizer(mode="hybrid", fallback_to_rule=True)
tokens = tokenizer.tokenize(normalized_text)
token_texts = [t.text for t in tokens]
print("Tokens:", token_texts)

# 3. Remove stop words
filtered_tokens, info = StopWordRemover().remove(tokens)
filtered_texts = [t.text for t in filtered_tokens]
print("Filtered Tokens:", filtered_texts)
print("Stopword Info:", info)

# 4. Lemmatize
lemmatizer = Lemmatizer()
lemmas = [lemmatizer.lemmatize(token) for token in filtered_texts]
print("Lemmas:", lemmas)

# 5. POS tagging
pos_tagger = POSTagger()
pos_pairs = pos_tagger.tag_with_tokens(token_texts)
print("POS Tags:", pos_pairs)

# 6. NER
ner_tagger = NERTagger(tokenizer_mode="hybrid")
ner_result = ner_tagger.predict(normalized_text)

print("NER Token-Tag Pairs:")
for token, tag in zip(ner_result["tokens"], ner_result["tags"]):
    print(f"{token:12} {tag}")

print("Entities:")
pprint(ner_result["entities"], width=100)

Features

  • Nepali normalizer
  • Hybrid tokenizer (rule-based + SentencePiece)
  • Lemmatizer
  • Stop-word removal
  • POS tagging
  • Named Entity Recognition (NER)

Models

NPLTK includes bundled trained models for:

  • POS Tagger
  • NER Tagger

These work out of the box after installation.


Suggested Workflow

  1. Normalize text
  2. Tokenize text
  3. Optionally remove stop words
  4. Lemmatize tokens
  5. Run POS tagging
  6. Run NER extraction

Contributors

  • Anurag Sharma
  • Anita Budha Magar
  • Apeksha Parajuli
  • Apeksha Katwal

Supervisor:

  • Pukar Karki

Institute of Engineering, Purwanchal Campus


License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

npltk-0.3.2.tar.gz (51.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

npltk-0.3.2-py3-none-any.whl (51.1 MB view details)

Uploaded Python 3

File details

Details for the file npltk-0.3.2.tar.gz.

File metadata

  • Download URL: npltk-0.3.2.tar.gz
  • Upload date:
  • Size: 51.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for npltk-0.3.2.tar.gz
Algorithm Hash digest
SHA256 9d1d8fc8b509a50f941280b044a875b8b77cec83d7b62ba11accacbd2fcc1d42
MD5 b8ccd1e63000f6ae08f02b8ed113c305
BLAKE2b-256 52b52e61317aa47b7082b9621d033db76bb446b8d15a28a6cf12145384544b2c

See more details on using hashes here.

File details

Details for the file npltk-0.3.2-py3-none-any.whl.

File metadata

  • Download URL: npltk-0.3.2-py3-none-any.whl
  • Upload date:
  • Size: 51.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for npltk-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 1167d9bcf7d7cf440c4088e40eceaa486ff38a2babde9a91f74c6d2ad0c39429
MD5 a0a34735480749146f3350c09353c13c
BLAKE2b-256 f2ebe1497d2f11c244ff4240245c0415fa9242d3a8821be8bcc1e7ced97af86d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page