Skip to main content

Pashto Natural Language Processing Toolkit

Project description

NLPashto – NLP Toolkit for Pashto

GitHub GitHub contributors code size

NLPashto is a Python suite for Pashto Natural Language Processing, initiated at Shanghai Jiao Tong University.

Pashto Word Cloud

Prerequisites

To use NLPashto you will need:

  • Python 3.8+

Installing NLPashto

NLPashto can be installed from PyPi using this command

pip install nlpashto

Downloading Models

Call the download() function and pass the "model name" as argument.

nlpashto.download('space_correct')

Valid model names: 'space_correct', 'pos_tag', 'word_segment', 'pold', 'snd'

If the model name was not specified, all the available models will be downloaded

Basic Usage

Space Correction

Space correction module can be used to correct the space-omission and space-insertion errors. It will remove extra spaces from the text and will insert space where necessary. It’s a beta version and only recommended if the input text is extremely noisy.

from nlpashto import space_correct

noisy_text = 'ه  م  د  ا  ر  ن  ګ ه ت ی ر ه ش پ ه ا وورځپههیوادکېدکروناویروسلهامله۵تنهمړهشوي'
corrected = space_correct(noisy_text)
print(corrected)
Output:: همدارنګه تیره شپه او ورځ په هیواد کې د کرونا ویروس له امله ۵ تنه مړه شوي

Word Segmentatoin

from nlpashto import word_segment

text = 'همدارنګه تیره شپه او ورځ په هیواد کې د کرونا ویروس له امله ۵ تنه مړه شوي'
segmented_text = word_segment(text)
print(segmented_text)

Output:: ['همدارنګه', 'تیره', 'شپه', 'او', 'ورځ', 'په', 'هیواد', 'کې', 'د', 'کرونا ویروس', 'له امله', '۵', 'تنه', 'مړه', 'شوي']

Part-of-speech (POS) Tagging

For further detail about the POS tagger and the corpus used for training please have a look at our paper The Pashto Corpus and Machine Learning Model for Automatic POS Tagging

from nlpashto import pos_tag

text = 'همدارنګه تیره شپه او ورځ په هیواد کې د کرونا ویروس له امله ۵ تنه مړه شوي'
segmented_text = word_segment(text)
tagged = pos_tag(segmented_text)
print(tagged) 

Output:: [['همدارنګه', 'RB'], ['تیره', 'JJ'], ['شپه', 'NNF'], ['او', 'CC'], ['ورځ', 'NNM'], ['په', 'IN'], ['هیواد', 'NNM'], ['کې', 'PT'], ['د', 'IN'], ['کرونا ویروس', 'NNP'], ['له امله', 'RB'], ['۵', 'NB'], ['تنه', 'NNS'], ['مړه', 'JJ'], ['شوي', 'VBDX']]

Offensive Language Detection

A fine-tuned BERT model for toxicity detection in Pashto text

from nlpashto import pold

offensive_text = 'مړه یو کس وی صرف ځان شرموی او یو ستا غوندے جاهل وی چې قوم او ملت شرموی'
pold(text)

Output:: 1


normal_text = 'تاسو رښتیا وایئ خور 🙏'
pold(text)

Output:: 0

Spammy Names Detection

A Naive Bayes classifier model that will predict whether the string of characters is a valid name or not. It can be used to identify spammy profile names on social media.

from nlpashto import snd

not_a_name = 'مسافر لالی'
snd(not_a_name)

Output:: 0.2


valid_name = 'شاهد افريدی'
snd(text)

Output:: 1.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlpashto-0.0.16.tar.gz (9.5 kB view details)

Uploaded Source

Built Distribution

nlpashto-0.0.16-py3-none-any.whl (10.0 kB view details)

Uploaded Python 3

File details

Details for the file nlpashto-0.0.16.tar.gz.

File metadata

  • Download URL: nlpashto-0.0.16.tar.gz
  • Upload date:
  • Size: 9.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.1

File hashes

Hashes for nlpashto-0.0.16.tar.gz
Algorithm Hash digest
SHA256 95684d4ed4392d7a97e1f05cc2eb921b129d4c0ef0ef9cdec5715d779d0b5bf1
MD5 e798fd8b4f88d84c67067bad97f37a97
BLAKE2b-256 bee971a5caf7f5c51b684342a5ed1eb3a72cb74d5826dde1c04b7d9cb2a19196

See more details on using hashes here.

Provenance

File details

Details for the file nlpashto-0.0.16-py3-none-any.whl.

File metadata

  • Download URL: nlpashto-0.0.16-py3-none-any.whl
  • Upload date:
  • Size: 10.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.1

File hashes

Hashes for nlpashto-0.0.16-py3-none-any.whl
Algorithm Hash digest
SHA256 71b6ceadfa6441578668513c232bac453bd0b57ddaea7128c1d7eeaf1178a299
MD5 15e816e19c24274c7332b32afc1a5980
BLAKE2b-256 b487f0eb5c892e0fcd3a0171ca3009f73468a61853f9fb477b23d6712f618282

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page