Pashto Natural Language Processing Toolkit
Project description
NLPashto – NLP Toolkit for Pashto
NLPashto is a Python suite for Pashto Natural Language Processing, initiated at Shanghai Jiao Tong University.
Prerequisites
To use NLPashto you will need:
- Python 3.8+
Installing NLPashto
NLPashto can be installed from PyPi using this command
pip install nlpashto
Downloading Models
Call the download() function and pass the "model name" as argument.
nlpashto.download('space_correct')
Valid model names: 'space_correct', 'pos_tag', 'word_segment', 'pold', 'snd'
If the model name was not specified, all the available models will be downloaded
Basic Usage
Space Correction
Space correction module can be used to correct the space-omission and space-insertion errors. It will remove extra spaces from the text and will insert space where necessary. It’s a beta version and only recommended if the input text is extremely noisy.
from nlpashto import space_correct
noisy_text = 'ه م د ا ر ن ګ ه ت ی ر ه ش پ ه ا وورځپههیوادکېدکروناویروسلهامله۵تنهمړهشوي'
corrected = space_correct(noisy_text)
print(corrected)
Output:: همدارنګه تیره شپه او ورځ په هیواد کې د کرونا ویروس له امله ۵ تنه مړه شوي
Word Segmentatoin
from nlpashto import word_segment
text = 'همدارنګه تیره شپه او ورځ په هیواد کې د کرونا ویروس له امله ۵ تنه مړه شوي'
segmented_text = word_segment(text)
print(segmented_text)
Output:: ['همدارنګه', 'تیره', 'شپه', 'او', 'ورځ', 'په', 'هیواد', 'کې', 'د', 'کرونا ویروس', 'له امله', '۵', 'تنه', 'مړه', 'شوي']
Part-of-speech (POS) Tagging
For further detail about the POS tagger and the corpus used for training please have a look at our paper The Pashto Corpus and Machine Learning Model for Automatic POS Tagging
from nlpashto import pos_tag
text = 'همدارنګه تیره شپه او ورځ په هیواد کې د کرونا ویروس له امله ۵ تنه مړه شوي'
segmented_text = word_segment(text)
tagged = pos_tag(segmented_text)
print(tagged)
Output:: [['همدارنګه', 'RB'], ['تیره', 'JJ'], ['شپه', 'NNF'], ['او', 'CC'], ['ورځ', 'NNM'], ['په', 'IN'], ['هیواد', 'NNM'], ['کې', 'PT'], ['د', 'IN'], ['کرونا ویروس', 'NNP'], ['له امله', 'RB'], ['۵', 'NB'], ['تنه', 'NNS'], ['مړه', 'JJ'], ['شوي', 'VBDX']]
Offensive Language Detection
A fine-tuned BERT model for toxicity detection in Pashto text
from nlpashto import pold
offensive_text = 'مړه یو کس وی صرف ځان شرموی او یو ستا غوندے جاهل وی چې قوم او ملت شرموی'
pold(text)
Output:: 1
normal_text = 'تاسو رښتیا وایئ خور 🙏'
pold(text)
Output:: 0
Spammy Names Detection
A Naive Bayes classifier model that will predict whether the string of characters is a valid name or not. It can be used to identify spammy profile names on social media.
from nlpashto import snd
not_a_name = 'مسافر لالی'
snd(not_a_name)
Output:: 0.2
valid_name = 'شاهد افريدی'
snd(text)
Output:: 1.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file nlpashto-0.0.16.tar.gz
.
File metadata
- Download URL: nlpashto-0.0.16.tar.gz
- Upload date:
- Size: 9.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 95684d4ed4392d7a97e1f05cc2eb921b129d4c0ef0ef9cdec5715d779d0b5bf1 |
|
MD5 | e798fd8b4f88d84c67067bad97f37a97 |
|
BLAKE2b-256 | bee971a5caf7f5c51b684342a5ed1eb3a72cb74d5826dde1c04b7d9cb2a19196 |
Provenance
File details
Details for the file nlpashto-0.0.16-py3-none-any.whl
.
File metadata
- Download URL: nlpashto-0.0.16-py3-none-any.whl
- Upload date:
- Size: 10.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 71b6ceadfa6441578668513c232bac453bd0b57ddaea7128c1d7eeaf1178a299 |
|
MD5 | 15e816e19c24274c7332b32afc1a5980 |
|
BLAKE2b-256 | b487f0eb5c892e0fcd3a0171ca3009f73468a61853f9fb477b23d6712f618282 |