Pashto Natural Language Processing Toolkit
Project description
NLPashto – NLP Toolkit for Pashto
NLPashto is a Python suite for Pashto Natural Language Processing, initiated at Shanghai Jiao Tong University. A sample of the Pashto Corpus is available here that is used to train some of the models in NLPashto.
Prerequisites
To use NLPashto you will need:
- Python 3.8+
Installing NLPashto
NLPashto can be installed from PyPi using this command
pip install nlpashto
Using NLPashto
Sentence Tokenizer
from nlpashto import sentence_tokenizer
sentences_list = sentence_tokenizer(content)
tagged = pos_tagger(tokenized)
Word Tokenizer
from nlpashto import word_tokenizer
text = 'همدارنګه تیره شپه او ورځ په هیواد کې د کرونا ویروس له امله ۵ تنه مړه شوي'
tokenized = word_tokenizer(text)
print(tokenized)
['همدارنګه', 'تیره', 'شپه', 'او', 'ورځ', 'په', 'هیواد', 'کې', 'د', 'کرونا ویروس', 'له امله', '۵', 'تنه', 'مړه', 'شوي']
Whitespace Tokenizer (Proofing)
Whitespace Tokenizer can be used as a proofing tool to remove the space-omission and space-insertion errors. It will remove extra spaces from the text and will insert space where necessary. It’s a beta version and only recommended if the input text is extremely noisy.
from nlpashto import tokenizer
noisy_text = 'ه م د ا ر ن ګ ه ت ی ر ه ش پ ه ا وورځپههیوادکېدکروناویروسلهامله۵تنهمړهشوي'
corrected = tokenizer(noisy_text)
print(corrected)
همدارنګه تیره شپه او ورځ په هیواد کې د کرونا ویروس له امله ۵ تنه مړه شوي
POS Tagging
from nlpashto import pos_tagger
text = 'همدارنګه تیره شپه او ورځ په هیواد کې د کرونا ویروس له امله ۵ تنه مړه شوي'
tokenized = word_tokenizer(text)
tagged = pos_tagger(tokenized)
print(tagged)
[['همدارنګه', 'RB'], ['تیره', 'JJ'], ['شپه', 'NNF'], ['او', 'CC'], ['ورځ', 'NNM'], ['په', 'IN'], ['هیواد', 'NNM'], ['کې', 'PT'], ['د', 'IN'], ['کرونا ویروس', 'NNP'], ['له امله', 'RB'], ['۵', 'NB'], ['تنه', 'NNS'], ['مړه', 'JJ'], ['شوي', 'VBDX']]
Offensive Comments Detection
Coming soon…
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
No source distribution files available for this release.See tutorial on generating distribution archives.
Built Distribution
Close
Hashes for nlpashto-0.0.10-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6a440169b452f80015a4e29fe48d3c89f7c4038077480e26e1eb916f555c78f7 |
|
MD5 | 1d99d7abf168bd8d0f44da8067d36223 |
|
BLAKE2b-256 | 48ab2dec2329683de2b1daa01e4eeff40671d9e27bb251420d2a516ad828fc42 |