Nepali Language Processing Toolkit
Project description
NPLTK
Nepali Language Processing Toolkit (NPLTK) is a lightweight and modular NLP library designed specifically for the Nepali language. It provides tools for tokenization, normalization, lemmatization, stop-word removal, POS tagging, and Named Entity Recognition (NER).
Why NPLTK?
Most NLP libraries are designed primarily for English and do not handle Nepali morphology, suffixes, and tokenization well.
NPLTK is built specifically for Nepali and provides:
- Hybrid tokenizer combining rule-based logic and SentencePiece
- Hybrid lemmatization using dictionary + rules
- Lightweight POS and NER models
- Fully self-contained package with bundled resources
Installation
pip install npltk
For testing from TestPyPI:
pip install -i https://test.pypi.org/simple/ npltk
Minimal Example
from npltk import create_tokenizer
tokens = create_tokenizer().tokenize("नेपाल सुन्दर देश हो।")
print([t.text for t in tokens])
Tokenizer
NPLTK provides a tokenizer factory through create_tokenizer(...).
create_tokenizer(
mode="hybrid",
split_into_sentences=True,
keep_punct=True,
model_path=None,
subword=True,
preprocess=None,
fallback_to_rule=True,
)
Main arguments
-
mode:"hybrid"or"rule""hybrid"uses rule-based tokenization together with SentencePiece"rule"uses only rule-based tokenization
-
split_into_sentences: whether sentence splitting is enabled internally -
keep_punct: whether punctuation tokens are kept in output -
model_path: optional custom SentencePiece model path -
subword: enables SentencePiece-based subword support in hybrid mode -
preprocess: optional preprocessing function applied before tokenization -
fallback_to_rule: if hybrid loading fails, automatically use rule mode
Tokenizer Example
from npltk import create_tokenizer
tokenizer = create_tokenizer(
mode="hybrid",
keep_punct=True,
fallback_to_rule=True,
)
tokens = tokenizer.tokenize("नेपाल एक सुन्दर देश हो।")
print([t.text for t in tokens])
Sentence Tokenization Example
from npltk import create_tokenizer
tokenizer = create_tokenizer(mode="hybrid")
sentences = tokenizer.tokenize_sentences("नेपाल सुन्दर देश हो। यहाँ हिमाल छन्।")
for sent in sentences:
print([t.text for t in sent.tokens])
Detokenization Example
from npltk import create_tokenizer
tokenizer = create_tokenizer(mode="hybrid")
tokens = tokenizer.tokenize("नेपाल सुन्दर देश हो।")
text = tokenizer.detokenize(tokens)
print(text)
Separate Examples for Each Component
1. Normalizer
from npltk.normalizer import build_normalizer
result = build_normalizer().normalize(" नेपाल।। ")
print(result.text)
2. Tokenizer
from npltk import create_tokenizer
tokenizer = create_tokenizer(mode="hybrid")
tokens = tokenizer.tokenize("नेपालको प्रधानमन्त्री काठमाडौं गए।")
print([t.text for t in tokens])
3. Lemmatizer
from npltk import Lemmatizer
lemmatizer = Lemmatizer()
print(lemmatizer.lemmatize("गयो"))
print(lemmatizer.lemmatize("घरहरूमा"))
4. Stop Word Removal
from npltk import create_tokenizer
from npltk.stop_word.remover import StopWordRemover
tokens = create_tokenizer().tokenize("नेपाल सुन्दर देश हो र यहाँ हिमाल छन् ।")
filtered, info = StopWordRemover().remove(tokens)
print([t.text for t in filtered])
print(info)
5. POS Tagger
from npltk import create_tokenizer, POSTagger
tokens = [t.text for t in create_tokenizer().tokenize("नेपालको प्रधानमन्त्री काठमाडौं गए।")]
tagger = POSTagger()
print(tagger.tag_with_tokens(tokens))
6. NER Tagger
from npltk import NERTagger
tagger = NERTagger(tokenizer_mode="hybrid")
print(tagger.extract("शेरबहादुर देउवा काठमाडौं पुगे।"))
Full Workflow Pipeline Example
from pprint import pprint
from npltk import create_tokenizer, Lemmatizer, POSTagger, NERTagger
from npltk.normalizer import build_normalizer
from npltk.stop_word.remover import StopWordRemover
text = " शेरबहादुर देउवा काठमाडौं पुगे र नेपालको बारेमा बोले। "
# 1. Normalize
normalizer = build_normalizer()
norm_result = normalizer.normalize(text)
normalized_text = norm_result.text
print("Normalized:", normalized_text)
# 2. Tokenize
tokenizer = create_tokenizer(mode="hybrid", fallback_to_rule=True)
tokens = tokenizer.tokenize(normalized_text)
token_texts = [t.text for t in tokens]
print("Tokens:", token_texts)
# 3. Remove stop words
filtered_tokens, info = StopWordRemover().remove(tokens)
filtered_texts = [t.text for t in filtered_tokens]
print("Filtered Tokens:", filtered_texts)
print("Stopword Info:", info)
# 4. Lemmatize
lemmatizer = Lemmatizer()
lemmas = [lemmatizer.lemmatize(token) for token in filtered_texts]
print("Lemmas:", lemmas)
# 5. POS tagging
pos_tagger = POSTagger()
pos_pairs = pos_tagger.tag_with_tokens(token_texts)
print("POS Tags:", pos_pairs)
# 6. NER
ner_tagger = NERTagger(tokenizer_mode="hybrid")
ner_result = ner_tagger.predict(normalized_text)
print("NER Token-Tag Pairs:")
for token, tag in zip(ner_result["tokens"], ner_result["tags"]):
print(f"{token:12} {tag}")
print("Entities:")
pprint(ner_result["entities"], width=100)
Features
- Nepali normalizer
- Hybrid tokenizer (rule-based + SentencePiece)
- Lemmatizer
- Stop-word removal
- POS tagging
- Named Entity Recognition (NER)
Models
NPLTK includes bundled trained models for:
- POS Tagger
- NER Tagger
These work out of the box after installation.
Suggested Workflow
- Normalize text
- Tokenize text
- Optionally remove stop words
- Lemmatize tokens
- Run POS tagging
- Run NER extraction
Contributors
- Anurag Sharma
- Anita Budha Magar
- Apeksha Parajuli
- Apeksha Katwal
Supervisor:
- Pukar Karki
Institute of Engineering, Purwanchal Campus
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file npltk-0.3.2.tar.gz.
File metadata
- Download URL: npltk-0.3.2.tar.gz
- Upload date:
- Size: 51.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9d1d8fc8b509a50f941280b044a875b8b77cec83d7b62ba11accacbd2fcc1d42
|
|
| MD5 |
b8ccd1e63000f6ae08f02b8ed113c305
|
|
| BLAKE2b-256 |
52b52e61317aa47b7082b9621d033db76bb446b8d15a28a6cf12145384544b2c
|
File details
Details for the file npltk-0.3.2-py3-none-any.whl.
File metadata
- Download URL: npltk-0.3.2-py3-none-any.whl
- Upload date:
- Size: 51.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1167d9bcf7d7cf440c4088e40eceaa486ff38a2babde9a91f74c6d2ad0c39429
|
|
| MD5 |
a0a34735480749146f3350c09353c13c
|
|
| BLAKE2b-256 |
f2ebe1497d2f11c244ff4240245c0415fa9242d3a8821be8bcc1e7ced97af86d
|