Nepali Language Processing Toolkit

Project description

NPLTK

Nepali Language Processing Toolkit (NPLTK) is a lightweight and modular NLP library designed specifically for the Nepali language. It provides tools for tokenization, normalization, lemmatization, stop-word removal, POS tagging, and Named Entity Recognition (NER).

Why NPLTK?

Most NLP libraries are designed primarily for English and do not handle Nepali morphology, suffixes, and tokenization well.

NPLTK is built specifically for Nepali and provides:

Hybrid tokenizer combining rule-based logic and SentencePiece
Hybrid lemmatization using dictionary + rules
Lightweight POS and NER models
Fully self-contained package with bundled resources

Installation

pip install npltk

For testing from TestPyPI:

pip install -i https://test.pypi.org/simple/ npltk

Minimal Example

from npltk import create_tokenizer

tokens = create_tokenizer().tokenize("नेपाल सुन्दर देश हो।")
print([t.text for t in tokens])

Tokenizer

NPLTK provides a tokenizer factory through create_tokenizer(...).

create_tokenizer(
    mode="hybrid",
    split_into_sentences=True,
    keep_punct=True,
    model_path=None,
    subword=True,
    preprocess=None,
    fallback_to_rule=True,
)

Main arguments

mode: "hybrid" or "rule"
- "hybrid" uses rule-based tokenization together with SentencePiece
- "rule" uses only rule-based tokenization
split_into_sentences: whether sentence splitting is enabled internally
keep_punct: whether punctuation tokens are kept in output
model_path: optional custom SentencePiece model path
subword: enables SentencePiece-based subword support in hybrid mode
preprocess: optional preprocessing function applied before tokenization
fallback_to_rule: if hybrid loading fails, automatically use rule mode

Tokenizer Example

from npltk import create_tokenizer

tokenizer = create_tokenizer(
    mode="hybrid",
    keep_punct=True,
    fallback_to_rule=True,
)

tokens = tokenizer.tokenize("नेपाल एक सुन्दर देश हो।")
print([t.text for t in tokens])

Sentence Tokenization Example

from npltk import create_tokenizer

tokenizer = create_tokenizer(mode="hybrid")
sentences = tokenizer.tokenize_sentences("नेपाल सुन्दर देश हो। यहाँ हिमाल छन्।")

for sent in sentences:
    print([t.text for t in sent.tokens])

Detokenization Example

from npltk import create_tokenizer

tokenizer = create_tokenizer(mode="hybrid")
tokens = tokenizer.tokenize("नेपाल सुन्दर देश हो।")
text = tokenizer.detokenize(tokens)

print(text)

Separate Examples for Each Component

1. Normalizer

from npltk.normalizer import build_normalizer

result = build_normalizer().normalize("  नेपाल।।  ")
print(result.text)

2. Tokenizer

from npltk import create_tokenizer

tokenizer = create_tokenizer(mode="hybrid")
tokens = tokenizer.tokenize("नेपालको प्रधानमन्त्री काठमाडौं गए।")
print([t.text for t in tokens])

3. Lemmatizer

from npltk import Lemmatizer

lemmatizer = Lemmatizer()
print(lemmatizer.lemmatize("गयो"))
print(lemmatizer.lemmatize("घरहरूमा"))

4. Stop Word Removal

from npltk import create_tokenizer
from npltk.stop_word.remover import StopWordRemover

tokens = create_tokenizer().tokenize("नेपाल सुन्दर देश हो र यहाँ हिमाल छन् ।")
filtered, info = StopWordRemover().remove(tokens)

print([t.text for t in filtered])
print(info)

5. POS Tagger

from npltk import create_tokenizer, POSTagger

tokens = [t.text for t in create_tokenizer().tokenize("नेपालको प्रधानमन्त्री काठमाडौं गए।")]
tagger = POSTagger()

print(tagger.tag_with_tokens(tokens))

6. NER Tagger

from npltk import NERTagger

tagger = NERTagger(tokenizer_mode="hybrid")
print(tagger.extract("शेरबहादुर देउवा काठमाडौं पुगे।"))

Full Workflow Pipeline Example

from pprint import pprint

from npltk import create_tokenizer, Lemmatizer, POSTagger, NERTagger
from npltk.normalizer import build_normalizer
from npltk.stop_word.remover import StopWordRemover

text = "  शेरबहादुर देउवा काठमाडौं पुगे र नेपालको बारेमा बोले।  "

# 1. Normalize
normalizer = build_normalizer()
norm_result = normalizer.normalize(text)
normalized_text = norm_result.text
print("Normalized:", normalized_text)

# 2. Tokenize
tokenizer = create_tokenizer(mode="hybrid", fallback_to_rule=True)
tokens = tokenizer.tokenize(normalized_text)
token_texts = [t.text for t in tokens]
print("Tokens:", token_texts)

# 3. Remove stop words
filtered_tokens, info = StopWordRemover().remove(tokens)
filtered_texts = [t.text for t in filtered_tokens]
print("Filtered Tokens:", filtered_texts)
print("Stopword Info:", info)

# 4. Lemmatize
lemmatizer = Lemmatizer()
lemmas = [lemmatizer.lemmatize(token) for token in filtered_texts]
print("Lemmas:", lemmas)

# 5. POS tagging
pos_tagger = POSTagger()
pos_pairs = pos_tagger.tag_with_tokens(token_texts)
print("POS Tags:", pos_pairs)

# 6. NER
ner_tagger = NERTagger(tokenizer_mode="hybrid")
ner_result = ner_tagger.predict(normalized_text)

print("NER Token-Tag Pairs:")
for token, tag in zip(ner_result["tokens"], ner_result["tags"]):
    print(f"{token:12} {tag}")

print("Entities:")
pprint(ner_result["entities"], width=100)

Features

Nepali normalizer
Hybrid tokenizer (rule-based + SentencePiece)
Lemmatizer
Stop-word removal
POS tagging
Named Entity Recognition (NER)

Models

NPLTK includes bundled trained models for:

POS Tagger
NER Tagger

These work out of the box after installation.

Suggested Workflow

Normalize text
Tokenize text
Optionally remove stop words
Lemmatize tokens
Run POS tagging
Run NER extraction

Contributors

Anurag Sharma
Anita Budha Magar
Apeksha Parajuli
Apeksha Katwal

Supervisor:

Pukar Karki

Institute of Engineering, Purwanchal Campus

License

MIT License

Project details

Release history Release notifications | RSS feed

This version

0.3.2

Mar 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

npltk-0.3.2.tar.gz (51.1 MB view details)

Uploaded Mar 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

npltk-0.3.2-py3-none-any.whl (51.1 MB view details)

Uploaded Mar 30, 2026 Python 3

File details

Details for the file npltk-0.3.2.tar.gz.

File metadata

Download URL: npltk-0.3.2.tar.gz
Upload date: Mar 30, 2026
Size: 51.1 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for npltk-0.3.2.tar.gz
Algorithm	Hash digest
SHA256	`9d1d8fc8b509a50f941280b044a875b8b77cec83d7b62ba11accacbd2fcc1d42`
MD5	`b8ccd1e63000f6ae08f02b8ed113c305`
BLAKE2b-256	`52b52e61317aa47b7082b9621d033db76bb446b8d15a28a6cf12145384544b2c`

See more details on using hashes here.

File details

Details for the file npltk-0.3.2-py3-none-any.whl.

File metadata

Download URL: npltk-0.3.2-py3-none-any.whl
Upload date: Mar 30, 2026
Size: 51.1 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for npltk-0.3.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1167d9bcf7d7cf440c4088e40eceaa486ff38a2babde9a91f74c6d2ad0c39429`
MD5	`a0a34735480749146f3350c09353c13c`
BLAKE2b-256	`f2ebe1497d2f11c244ff4240245c0415fa9242d3a8821be8bcc1e7ced97af86d`

See more details on using hashes here.

npltk 0.3.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

NPLTK

Why NPLTK?

Installation

Minimal Example

Tokenizer

Main arguments

Tokenizer Example

Sentence Tokenization Example

Detokenization Example

Separate Examples for Each Component

1. Normalizer

2. Tokenizer

3. Lemmatizer

4. Stop Word Removal

5. POS Tagger

6. NER Tagger

Full Workflow Pipeline Example

Features

Models

Suggested Workflow

Contributors

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes