Pashto Natural Language Processing Toolkit

These details have not been verified by PyPI

Project links

Homepage

Project description

Pashto Word Cloud

NLPashto – NLP Toolkit for Pashto

GitHub GitHub contributors Code size

NLPashto is a Python suite for Pashto Natural Language Processing. It provides tools for fundamental text processing tasks, such as text cleaning, tokenization, and chunking (word segmentation). Additionally, it includes state-of-the-art models for POS tagging and sentiment analysis (specifically offensive language detection).

Prerequisites

To use NLPashto, you will need:

Python 3.8+

Installing NLPashto

Install NLPashto via PyPi:

pip install nlpashto

Basic Usage

Text Cleaning

This module contains basic text cleaning utilities:

from nlpashto import Cleaner

cleaner = Cleaner()
noisy_txt = "په ژوند کی علم 📚🖖 , 🖊  او پيسي 💵.  💸💲 دواړه حاصل کړه پوهان به دی علم ته درناوی ولري اوناپوهان به دي پیسو ته... https://t.co/xIiEXFg"

cleaned_text = cleaner.clean(noisy_txt)
print(cleaned_text)
# Output: په ژوند کی علم , او پيسي دواړه حاصل کړه پوهان به دی علم ته درناوی ولري او ناپوهان به دي پیسو ته

Parameters of the clean method:

text (str or list): Input noisy text to clean.
split_into_sentences (bool): Split text into sentences.
remove_emojis (bool): Remove emojis.
normalize_nums (bool): Normalize Arabic numerals (1, 2, 3, ...) to Pashto numerals (۱، ۲، ۳، ...).
remove_puncs (bool): Remove punctuations.
remove_special_chars (bool): Remove special characters.
special_chars (list): List of special characters to keep.

Tokenization (Space Correction)

This module corrects space omission and insertion errors. It removes extra spaces and inserts necessary ones:

from nlpashto import Tokenizer

tokenizer = Tokenizer()
noisy_txt = 'جلال اباد ښار کې هره ورځ لس ګونه کسانپهډلهییزهتوګهدنشهيي توکو کارولو ته ا د ا م ه و رک وي'

tokenized_text = tokenizer.tokenize(noisy_txt)
print(tokenized_text)
# Output: [['جلال', 'اباد', 'ښار', 'کې', 'هره', 'ورځ', 'لسګونه', 'کسان', 'په', 'ډله', 'ییزه', 'توګه', 'د', 'نشه', 'يي', 'توکو', 'کارولو', 'ته', 'ادامه', 'ورکوي']]

Chunking (Word Segmentation)

To retrieve full compound words instead of space-delimited tokens, use the Segmenter:

from nlpashto import Segmenter

segmenter = Segmenter()
segmented_text = segmenter.segment(tokenized_text)
print(segmented_text)
# Output: [['جلال اباد', 'ښار', 'کې', 'هره', 'ورځ', 'لسګونه', 'کسان', 'په', 'ډله ییزه', 'توګه', 'د', 'نشه يي', 'توکو', 'کارولو', 'ته', 'ادامه', 'ورکوي']]

Specify batch size for multiple sentences:

segmenter = Segmenter(batch_size=32)  # Default is 16

Part-of-speech (POS) Tagging

For a detailed explanation about the POS tagger, refer to the POS tagging paper:

from nlpashto import POSTagger

pos_tagger = POSTagger()
pos_tagged = pos_tagger.tag(segmented_text)
print(pos_tagged)
# Output: [[('جلال اباد', 'NNP'), ('ښار', 'NNM'), ('کې', 'PT'), ('هره', 'JJ'), ('ورځ', 'NNF'), ...]]

Sentiment Analysis (Offensive Language Detection)

Detect offensive language using a fine-tuned PsBERT model:

from nlpashto import POLD

sentiment_analysis = POLD()

# Offensive example
offensive_text = 'مړه یو کس وی صرف ځان شرموی او یو ستا غوندے جاهل وی چې قوم او ملت شرموی'
sentiment = sentiment_analysis.predict(offensive_text)
print(sentiment)
# Output: 1

# Normal example
normal_text = 'تاسو رښتیا وایئ خور 🙏'
sentiment = sentiment_analysis.predict(normal_text)
print(sentiment)
# Output: 0

Other Resources

Pretrained Models

BERT (WordPiece Level): ijazulhaq/bert-base-pashto
BERT (Character Level): ijazulhaq/bert-base-pashto-c
Static Word Embeddings: Available on Kaggle: Word2Vec, fastText, GloVe

Datasets and Examples

Sample datasets: Kaggle
Jupyter Notebooks: Kaggle

Citations

NLPashto: NLP Toolkit for Low-resource Pashto Language

H. Ijazul, Q. Weidong, G. Jie, and T. Peng, "NLPashto: NLP Toolkit for Low-resource Pashto Language," International Journal of Advanced Computer Science and Applications, vol. 14, no. 6, pp. 1345-1352, 2023.

BibTeX

@article{haq2023nlpashto,
  title={NLPashto: NLP Toolkit for Low-resource Pashto Language},
  author={Ijazul Haq and Weidong Qiu and Jie Guo and Peng Tang},
  journal={International Journal of Advanced Computer Science and Applications},
  issn={2156-5570},
  volume={14},
  number={6},
  pages={1345-1352},
  year={2023},
  doi={https://dx.doi.org/10.14569/IJACSA.2023.01406142}
}

Correction of Whitespace and Word Segmentation in Noisy Pashto Text using CRF

H. Ijazul, Q. Weidong, G. Jie, and T. Peng, "Correction of whitespace and word segmentation in noisy Pashto text using CRF," Speech Communication, vol. 153, p. 102970, 2023.

BibTeX

@article{HAQ2023102970,
  title={Correction of whitespace and word segmentation in noisy Pashto text using CRF},
  journal={Speech Communication},
  issn={1872-7182},
  volume={153},
  pages={102970},
  year={2023},
  doi={https://doi.org/10.1016/j.specom.2023.102970},
  author={Ijazul Haq and Weidong Qiu and Jie Guo and Peng Tang}
}

POS Tagging of Low-resource Pashto Language: Annotated Corpus and Bert-based Model

H. Ijazul, Q. Weidong, G. Jie, and T. Peng, "POS Tagging of Low-resource Pashto Language: Annotated Corpus and BERT-based Model," Preprint, 2023.

BibTeX

@article{haq2023pashto,
  title={POS Tagging of Low-resource Pashto Language: Annotated Corpus and Bert-based Model},
  author={Ijazul Haq and Weidong Qiu and Jie Guo and Peng Tang},
  journal={Preprint},
  year={2023},
  doi={https://doi.org/10.21203/rs.3.rs-2712906/v1}
}

Pashto Offensive Language Detection: A Benchmark Dataset and Monolingual Pashto BERT

H. Ijazul, Q. Weidong, G. Jie, and T. Peng, "Pashto Offensive Language Detection: A Benchmark Dataset and Monolingual Pashto BERT," PeerJ Computer Science, vol. 9, p. e1617, 2023.

BibTeX

@article{haq2023pold,
  title={Pashto Offensive Language Detection: A Benchmark Dataset and Monolingual Pashto BERT},
  author={Ijazul Haq and Weidong Qiu and Jie Guo and Peng Tang},
  journal={PeerJ Computer Science},
  issn={2376-5992},
  volume={9},
  pages={e1617},
  year={2023},
  doi={10.7717/peerj-cs.1617}
}

Social Media’s Dark Secrets: A Propagation, Lexical and Psycholinguistic Ooriented Deep Learning Approach for Fake News Proliferation

K. Ahmed, M. A. Khan, I. Haq, A. Al Mazroa, M. Syam, N. Innab, et al., "Social media’s dark secrets: A propagation, lexical and psycholinguistic oriented deep learning approach for fake news proliferation," Expert Systems with Applications, vol. 255, p. 124650, 2024.

BibTeX

@article{AHMED2024124650,
  title={Social media’s dark secrets: A propagation, lexical and psycholinguistic oriented deep learning approach for fake news proliferation},
  author={Kanwal Ahmed and Muhammad Asghar Khan and Ijazul Haq and Alanoud Al Mazroa and Syam M.S. and Nisreen Innab and Masoud Alajmi and Hend Khalid Alkahtani},
  journal={Expert Systems with Applications},
  volume={255},
  pages={124650},
  year={2024},
  issn={0957-4174},
  doi={https://doi.org/10.1016/j.eswa.2024.124650}
}

Contact

Website: https://ijaz.me/
LinkedIn: https://www.linkedin.com/in/drijazulhaq/

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.0.25

Feb 1, 2025

0.0.24

Feb 1, 2025

0.0.23

Oct 9, 2023

0.0.22

Oct 9, 2023

0.0.21

Oct 9, 2023

0.0.20

Oct 9, 2023

0.0.19

Oct 9, 2023

0.0.18

Oct 9, 2023

0.0.17

Oct 8, 2023

0.0.16

Oct 8, 2023

0.0.15

Jul 2, 2023

0.0.14

May 8, 2023

0.0.13

May 8, 2023

0.0.12

May 6, 2023

0.0.11

May 6, 2023

0.0.10

May 6, 2023

0.0.9

May 6, 2023

0.0.8

Mar 1, 2023

0.0.7

Mar 1, 2023

0.0.6

Mar 1, 2023

0.0.5

Mar 1, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlpashto-0.0.25.tar.gz (13.0 kB view details)

Uploaded Feb 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nlpashto-0.0.25-py3-none-any.whl (12.0 kB view details)

Uploaded Feb 1, 2025 Python 3

File details

Details for the file nlpashto-0.0.25.tar.gz.

File metadata

Download URL: nlpashto-0.0.25.tar.gz
Upload date: Feb 1, 2025
Size: 13.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.11.10

File hashes

Hashes for nlpashto-0.0.25.tar.gz
Algorithm	Hash digest
SHA256	`16b6f33650145c6e1b435126a674618690f1868358ef8a5819a54729bba3ca38`
MD5	`6c452cc038eef50f9fcfa5d9f3f25ee4`
BLAKE2b-256	`a9e520a1c30bf170b94c72babe9cdc8414c2773cf311ed4cb39a125b911d734c`

See more details on using hashes here.

File details

Details for the file nlpashto-0.0.25-py3-none-any.whl.

File metadata

Download URL: nlpashto-0.0.25-py3-none-any.whl
Upload date: Feb 1, 2025
Size: 12.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.11.10

File hashes

Hashes for nlpashto-0.0.25-py3-none-any.whl
Algorithm	Hash digest
SHA256	`713ba9ced7d66a121ace0cc4de71ccdd08d781e77a833a65eaf167019bdcb021`
MD5	`c62babcd048418b6efabe79e141aae4a`
BLAKE2b-256	`cbb05781747c04ad5488b61cd494e4d32de0d118c42975a5fccc174adf22bd7d`

See more details on using hashes here.

nlpashto 0.0.25

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

NLPashto – NLP Toolkit for Pashto

Prerequisites

Installing NLPashto

Basic Usage

Text Cleaning

Tokenization (Space Correction)

Chunking (Word Segmentation)

Part-of-speech (POS) Tagging

Sentiment Analysis (Offensive Language Detection)

Other Resources

Pretrained Models

Datasets and Examples

Citations

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes