Skip to main content

Pashto Natural Language Processing Toolkit

Project description

NLPashto – NLP Toolkit for Pashto

GitHub GitHub contributors code size

NLPashto is a Python suite for Pashto Natural Language Processing, initiated at Shanghai Jiao Tong University. A sample of the Pashto Corpus is available here that is used to train some of the models in NLPashto.

Prerequisites

To use NLPashto you will need:

  • Python 3.8+

Installing NLPashto

NLPashto can be installed from PyPi using this command

pip install nlpashto

Using NLPashto

Sentence Tokenizer

from nlpashto import sentence_tokenizer
sentences_list = sentence_tokenizer(content)
tagged = pos_tagger(tokenized)

Word Tokenizer

from nlpashto import word_tokenizer

text = 'همدارنګه تیره شپه او ورځ په هیواد کې د کرونا ویروس له امله ۵ تنه مړه شوي'
tokenized = word_tokenizer(text)
print(tokenized)
['همدارنګه', 'تیره', 'شپه', 'او', 'ورځ', 'په', 'هیواد', 'کې', 'د', 'کرونا ویروس', 'له امله', '۵', 'تنه', 'مړه', 'شوي']

Whitespace Tokenizer (Proofing)

Whitespace Tokenizer can be used as a proofing tool to remove the space-omission and space-insertion errors. It will remove extra spaces from the text and will insert space where necessary. It’s a beta version and only recommended if the input text is extremely noisy.

from nlpashto import tokenizer

noisy_text = 'ه  م  د  ا  ر  ن  ګ ه ت ی ر ه ش پ ه ا وورځپههیوادکېدکروناویروسلهامله۵تنهمړهشوي'
corrected = tokenizer(noisy_text)
print(corrected)
همدارنګه تیره شپه او ورځ په هیواد کې د کرونا ویروس له امله ۵ تنه مړه شوي

POS Tagging

from nlpashto import pos_tagger

text = 'همدارنګه تیره شپه او ورځ په هیواد کې د کرونا ویروس له امله ۵ تنه مړه شوي'
tokenized = word_tokenizer(text)
tagged = pos_tagger(tokenized)
print(tagged) 
[['همدارنګه', 'RB'], ['تیره', 'JJ'], ['شپه', 'NNF'], ['او', 'CC'], ['ورځ', 'NNM'], ['په', 'IN'], ['هیواد', 'NNM'], ['کې', 'PT'], ['د', 'IN'], ['کرونا ویروس', 'NNP'], ['له امله', 'RB'], ['۵', 'NB'], ['تنه', 'NNS'], ['مړه', 'JJ'], ['شوي', 'VBDX']]

Offensive Comments Detection

Coming soon…

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

nlpashto-0.0.10-py3-none-any.whl (5.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page