Pashto Natural Language Processing Toolkit
Project description
NLPashto – NLP Toolkit for Pashto
NLPashto is a Python suite for Pashto Natural Language Processing, initiated at Shanghai Jiao Tong University.
Prerequisites
To use NLPashto you will need:
- Python 3.8+
Installing NLPashto
NLPashto can be installed from PyPi using this command
pip install nlpashto
Downloading Models
Call the download() function and pass the "model name" as argument.
nlpashto.download('space_correct')
Valid model names: 'space_correct', 'pos_tag', 'word_segment', 'pold', 'snd'
If the model name was not specified, all the available models will be downloaded
Basic Usage
Space Correction
Space correction module can be used to correct the space-omission and space-insertion errors. It will remove extra spaces from the text and will insert space where necessary. It’s a beta version and only recommended if the input text is extremely noisy.
from nlpashto import space_correct
noisy_text = 'ه م د ا ر ن ګ ه ت ی ر ه ش پ ه ا وورځپههیوادکېدکروناویروسلهامله۵تنهمړهشوي'
corrected = space_correct(noisy_text)
print(corrected)
Output:: همدارنګه تیره شپه او ورځ په هیواد کې د کرونا ویروس له امله ۵ تنه مړه شوي
Word Segmentatoin
from nlpashto import word_segment
text = 'همدارنګه تیره شپه او ورځ په هیواد کې د کرونا ویروس له امله ۵ تنه مړه شوي'
segmented_text = word_segment(text)
print(segmented_text)
Output:: ['همدارنګه', 'تیره', 'شپه', 'او', 'ورځ', 'په', 'هیواد', 'کې', 'د', 'کرونا ویروس', 'له امله', '۵', 'تنه', 'مړه', 'شوي']
Part-of-speech (POS) Tagging
For further detail about the POS tagger and the corpus used for training please have a look at our paper The Pashto Corpus and Machine Learning Model for Automatic POS Tagging
from nlpashto import pos_tag
text = 'همدارنګه تیره شپه او ورځ په هیواد کې د کرونا ویروس له امله ۵ تنه مړه شوي'
segmented_text = word_segment(text)
tagged = pos_tag(segmented_text)
print(tagged)
Output:: [['همدارنګه', 'RB'], ['تیره', 'JJ'], ['شپه', 'NNF'], ['او', 'CC'], ['ورځ', 'NNM'], ['په', 'IN'], ['هیواد', 'NNM'], ['کې', 'PT'], ['د', 'IN'], ['کرونا ویروس', 'NNP'], ['له امله', 'RB'], ['۵', 'NB'], ['تنه', 'NNS'], ['مړه', 'JJ'], ['شوي', 'VBDX']]
Offensive Language Detection
A fine-tuned BERT model for toxicity detection in Pashto text
from nlpashto import pold
offensive_text = 'مړه یو کس وی صرف ځان شرموی او یو ستا غوندے جاهل وی چې قوم او ملت شرموی'
pold(text)
Output:: 1
normal_text = 'تاسو رښتیا وایئ خور 🙏'
pold(text)
Output:: 0
Spammy Names Detection
A Naive Bayes classifier model that will predict whether the string of characters is a valid name or not. It can be used to identify spammy profile names on social media.
from nlpashto import snd
not_a_name = 'مسافر لالی'
snd(not_a_name)
Output:: 0.2
valid_name = 'شاهد افريدی'
snd(text)
Output:: 1.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for nlpashto-0.0.12-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 47fbf3432c489e469918843c4023092867013d018377e666f330cacf545d08e0 |
|
MD5 | 494214a50f535b258f983a8fd4f9444c |
|
BLAKE2b-256 | f9080ebf2c4385d904d6d159c82c16caa3ee88bf8a7c9f4827ddaafc806e88a1 |