Skip to main content

Pashto Natural Language Processing Toolkit

Project description

NLPashto – NLP Toolkit for Pashto

GitHub GitHub contributors code size

NLPashto is a Python suite for Pashto Natural Language Processing, initiated at Shanghai Jiao Tong University.

Prerequisites

To use NLPashto you will need:

  • Python 3.8+

Installing NLPashto

NLPashto can be installed from PyPi using this command

pip install nlpashto

Downloading Models

Call the download() function and pass the "model name" as argument.

nlpashto.download('space_correct')

Valid model names: 'space_correct', 'pos_tag', 'word_segment', 'pold', 'snd'

If the model name was not specified, all the available models will be downloaded

Basic Usage

Space Correction

Space correction module can be used to correct the space-omission and space-insertion errors. It will remove extra spaces from the text and will insert space where necessary. It’s a beta version and only recommended if the input text is extremely noisy.

from nlpashto import space_correct

noisy_text = 'ه  م  د  ا  ر  ن  ګ ه ت ی ر ه ش پ ه ا وورځپههیوادکېدکروناویروسلهامله۵تنهمړهشوي'
corrected = space_correct(noisy_text)
print(corrected)
Output:: همدارنګه تیره شپه او ورځ په هیواد کې د کرونا ویروس له امله ۵ تنه مړه شوي

Word Segmentatoin

from nlpashto import word_segment

text = 'همدارنګه تیره شپه او ورځ په هیواد کې د کرونا ویروس له امله ۵ تنه مړه شوي'
segmented_text = word_segment(text)
print(segmented_text)

Output:: ['همدارنګه', 'تیره', 'شپه', 'او', 'ورځ', 'په', 'هیواد', 'کې', 'د', 'کرونا ویروس', 'له امله', '۵', 'تنه', 'مړه', 'شوي']

Part-of-speech (POS) Tagging

For further detail about the POS tagger and the corpus used for training please have a look at our paper The Pashto Corpus and Machine Learning Model for Automatic POS Tagging

from nlpashto import pos_tag

text = 'همدارنګه تیره شپه او ورځ په هیواد کې د کرونا ویروس له امله ۵ تنه مړه شوي'
segmented_text = word_segment(text)
tagged = pos_tag(segmented_text)
print(tagged) 

Output:: [['همدارنګه', 'RB'], ['تیره', 'JJ'], ['شپه', 'NNF'], ['او', 'CC'], ['ورځ', 'NNM'], ['په', 'IN'], ['هیواد', 'NNM'], ['کې', 'PT'], ['د', 'IN'], ['کرونا ویروس', 'NNP'], ['له امله', 'RB'], ['۵', 'NB'], ['تنه', 'NNS'], ['مړه', 'JJ'], ['شوي', 'VBDX']]

Offensive Language Detection

A fine-tuned BERT model for toxicity detection in Pashto text

from nlpashto import pold

offensive_text = 'مړه یو کس وی صرف ځان شرموی او یو ستا غوندے جاهل وی چې قوم او ملت شرموی'
pold(text)

Output:: 1


normal_text = 'تاسو رښتیا وایئ خور 🙏'
pold(text)

Output:: 0

Spammy Names Detection

A Naive Bayes classifier model that will predict whether the string of characters is a valid name or not. It can be used to identify spammy profile names on social media.

from nlpashto import snd

not_a_name = 'مسافر لالی'
snd(not_a_name)

Output:: 0.2


valid_name = 'شاهد افريدی'
snd(text)

Output:: 1.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlpashto-0.0.14.tar.gz (6.6 kB view details)

Uploaded Source

Built Distribution

nlpashto-0.0.14-py3-none-any.whl (8.3 kB view details)

Uploaded Python 3

File details

Details for the file nlpashto-0.0.14.tar.gz.

File metadata

  • Download URL: nlpashto-0.0.14.tar.gz
  • Upload date:
  • Size: 6.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.1

File hashes

Hashes for nlpashto-0.0.14.tar.gz
Algorithm Hash digest
SHA256 a446353554386f6a216b747f1bd40b2a6d67aadb05f8f1cd58609b3ae56e582e
MD5 e8fea47ff4abfd683cd35f3c942186de
BLAKE2b-256 e10627f0a48b7eb892668abf5e64f056ae9b881e9e91d4c20f566947c75e3ef9

See more details on using hashes here.

Provenance

File details

Details for the file nlpashto-0.0.14-py3-none-any.whl.

File metadata

  • Download URL: nlpashto-0.0.14-py3-none-any.whl
  • Upload date:
  • Size: 8.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.1

File hashes

Hashes for nlpashto-0.0.14-py3-none-any.whl
Algorithm Hash digest
SHA256 2ef8733c232cb79bbfea6b5bb8953dbc37dde8ef8b8a62c442eb3e6383660a2a
MD5 d9bc61141d51812cba6cae004cbf840f
BLAKE2b-256 156a5fe8bf776b4de2528b2541ff0bdcb25dd5e7de129753596877af8c9c6ac4

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page