Pashto Natural Language Processing Toolkit
Project description
NLPashto – NLP Toolkit for Pashto
NLPashto is a Python suite for Pashto Natural Language Processing, initiated at Shanghai Jiao Tong University.
Prerequisites
To use NLPashto you will need:
- Python 3.8+
Installing NLPashto
NLPashto can be installed from PyPi using this command
pip install nlpashto
Downloading Models
Call the download() function and pass the "model name" as argument.
nlpashto.download('space_correct')
Valid model names: 'space_correct', 'pos_tag', 'word_segment', 'pold', 'snd'
If the model name was not specified, all the available models will be downloaded
Basic Usage
Space Correction
Space correction module can be used to correct the space-omission and space-insertion errors. It will remove extra spaces from the text and will insert space where necessary. It’s a beta version and only recommended if the input text is extremely noisy.
from nlpashto import space_correct
noisy_text = 'ه م د ا ر ن ګ ه ت ی ر ه ش پ ه ا وورځپههیوادکېدکروناویروسلهامله۵تنهمړهشوي'
corrected = space_correct(noisy_text)
print(corrected)
Output:: همدارنګه تیره شپه او ورځ په هیواد کې د کرونا ویروس له امله ۵ تنه مړه شوي
Word Segmentatoin
from nlpashto import word_segment
text = 'همدارنګه تیره شپه او ورځ په هیواد کې د کرونا ویروس له امله ۵ تنه مړه شوي'
segmented_text = word_segment(text)
print(segmented_text)
Output:: ['همدارنګه', 'تیره', 'شپه', 'او', 'ورځ', 'په', 'هیواد', 'کې', 'د', 'کرونا ویروس', 'له امله', '۵', 'تنه', 'مړه', 'شوي']
Part-of-speech (POS) Tagging
For further detail about the POS tagger and the corpus used for training please have a look at our paper The Pashto Corpus and Machine Learning Model for Automatic POS Tagging
from nlpashto import pos_tag
text = 'همدارنګه تیره شپه او ورځ په هیواد کې د کرونا ویروس له امله ۵ تنه مړه شوي'
segmented_text = word_segment(text)
tagged = pos_tag(segmented_text)
print(tagged)
Output:: [['همدارنګه', 'RB'], ['تیره', 'JJ'], ['شپه', 'NNF'], ['او', 'CC'], ['ورځ', 'NNM'], ['په', 'IN'], ['هیواد', 'NNM'], ['کې', 'PT'], ['د', 'IN'], ['کرونا ویروس', 'NNP'], ['له امله', 'RB'], ['۵', 'NB'], ['تنه', 'NNS'], ['مړه', 'JJ'], ['شوي', 'VBDX']]
Offensive Language Detection
A fine-tuned BERT model for toxicity detection in Pashto text
from nlpashto import pold
offensive_text = 'مړه یو کس وی صرف ځان شرموی او یو ستا غوندے جاهل وی چې قوم او ملت شرموی'
pold(text)
Output:: 1
normal_text = 'تاسو رښتیا وایئ خور 🙏'
pold(text)
Output:: 0
Spammy Names Detection
A Naive Bayes classifier model that will predict whether the string of characters is a valid name or not. It can be used to identify spammy profile names on social media.
from nlpashto import snd
not_a_name = 'مسافر لالی'
snd(not_a_name)
Output:: 0.2
valid_name = 'شاهد افريدی'
snd(text)
Output:: 1.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for nlpashto-0.0.14-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2ef8733c232cb79bbfea6b5bb8953dbc37dde8ef8b8a62c442eb3e6383660a2a |
|
MD5 | d9bc61141d51812cba6cae004cbf840f |
|
BLAKE2b-256 | 156a5fe8bf776b4de2528b2541ff0bdcb25dd5e7de129753596877af8c9c6ac4 |