A Comprehensive Persian (Farsi) Natural Language Processing Library
Project description
BidNLP
A Comprehensive Persian (Farsi) Natural Language Processing Library
BidNLP is a production-ready Python library for Persian text processing, offering a complete suite of NLP tools specifically designed for the unique challenges of Persian language processing.
✨ Features
🔧 Preprocessing (100% Complete)
- Text Normalization: Arabic to Persian character conversion, diacritic removal, ZWNJ normalization
- Text Cleaning: URL, email, HTML tag removal, emoji handling
- Number Processing: Persian ↔ English ↔ Arabic-Indic digit conversion
- Date Normalization: Jalali date handling and formatting
- Punctuation: Persian and Latin punctuation normalization
✂️ Tokenization (100% 100% Complete)
- Word Tokenizer: ZWNJ-aware, handles compound words and mixed scripts
- Sentence Tokenizer: Smart boundary detection with abbreviation support
- Character Tokenizer: Character-level tokenization with diacritic handling
- Morpheme Tokenizer: Prefix/suffix detection and morphological analysis
- Syllable Tokenizer: Persian syllable segmentation
🔍 Stemming & Lemmatization (100% Complete)
- Stemming: Conservative suffix removal with minimum stem length
- Lemmatization: Dictionary-based lemmatization with irregular form support
- Arabic Plural Handling: Special support for Arabic broken plurals
🏷️ POS Tagging (100% Complete)
- Rule-Based Tagger: Dictionary and morphology-based POS tagging
- HMM Tagger: Statistical Hidden Markov Model-based tagging with training support
- Comprehensive Tag Set: 30+ Persian-specific POS tags
- Custom Dictionaries: Extensible with custom words and tags
📊 Classification (100% Complete)
- Sentiment Analysis: Keyword-based with 100+ sentiment keywords and negation handling
- Text Classification: Keyword-based multi-class categorization
- Feature Extraction: Bag-of-Words, TF-IDF, N-gram extraction
🛠️ Utilities (100% Complete)
- Character Utils: Persian alphabet, character type detection, diacritic handling
- Statistics: Word count, sentence count, lexical diversity, n-gram frequency
- Stop Words: 100+ Persian stop words with custom support
- Validators: Text quality scoring, normalization checking
- Metrics: Precision, Recall, F1, BLEU, edit distance, and more
📦 Installation
pip install bidnlp
From source:
git clone https://github.com/aghabidareh/bidnlp.git
cd bidnlp
pip install -e .
🚀 Quick Start
Preprocessing
from bidnlp.preprocessing import PersianNormalizer, PersianTextCleaner
# Normalize text
normalizer = PersianNormalizer()
text = normalizer.normalize("كتاب يک") # Converts: کتاب یک
# Clean text
cleaner = PersianTextCleaner(remove_urls=True, remove_emojis=True)
clean_text = cleaner.clean("سلام 😊 https://test.com") # Output: سلام
Tokenization
from bidnlp.tokenization import PersianWordTokenizer, PersianSentenceTokenizer
# Word tokenization
tokenizer = PersianWordTokenizer()
words = tokenizer.tokenize("من به دانشگاه میروم")
# Output: ['من', 'به', 'دانشگاه', 'می', 'روم']
# Sentence tokenization
sent_tokenizer = PersianSentenceTokenizer()
sentences = sent_tokenizer.tokenize("سلام. چطوری؟")
# Output: ['سلام.', 'چطوری؟']
POS Tagging
from bidnlp.pos import RuleBasedPOSTagger, HMMPOSTagger
# Rule-based POS tagging
tagger = RuleBasedPOSTagger()
tagged = tagger.tag("من به دانشگاه میروم")
# Output: [('من', 'PRO_PERS'), ('به', 'PREP'), ('دانشگاه', 'N'), ('میروم', 'V_PRES')]
# HMM-based tagging
hmm_tagger = HMMPOSTagger()
# Train with tagged data
training_data = [
[("من", "PRO_PERS"), ("به", "PREP"), ("خانه", "N"), ("میروم", "V_PRES")],
# ... more training examples
]
hmm_tagger.train(training_data)
tagged = hmm_tagger.tag("او کتاب میخواند")
Sentiment Analysis
from bidnlp.classification import PersianSentimentAnalyzer
analyzer = PersianSentimentAnalyzer()
# Simple sentiment
sentiment = analyzer.predict("این کتاب خیلی خوب است")
# Output: 'positive'
# Detailed analysis
result = analyzer.analyze("محصول عالی اما گران است")
# Output: {'sentiment': 'neutral', 'score': 0.0,
# 'positive_words': ['عالی'], 'negative_words': ['گران']}
Text Classification
from bidnlp.classification import KeywordClassifier
classifier = KeywordClassifier()
# Add categories
classifier.add_category('ورزش', {'فوتبال', 'بازیکن', 'تیم'})
classifier.add_category('تکنولوژی', {'کامپیوتر', 'نرمافزار', 'برنامه'})
# Classify
category = classifier.predict("تیم فوتبال برد گرفت")
# Output: 'ورزش'
Text Statistics
from bidnlp.utils import PersianTextStatistics
stats = PersianTextStatistics()
text = "من به دانشگاه میروم. دانشگاه بزرگ است."
statistics = stats.get_statistics(text)
# Output: {
# 'words': 8, 'sentences': 2, 'characters': 35,
# 'average_word_length': 4.38, 'lexical_diversity': 0.875, ...
# }
Stop Words
from bidnlp.utils import PersianStopWords
stopwords = PersianStopWords()
# Remove stop words
text = "من از دانشگاه به خانه می روم"
filtered = stopwords.remove_stopwords(text)
# Output: "دانشگاه خانه می روم"
# Check if word is stop word
is_stop = stopwords.is_stopword('از') # True
Feature Extraction
from bidnlp.classification import TfidfVectorizer, BagOfWords
# TF-IDF
tfidf = TfidfVectorizer(max_features=100)
vectors = tfidf.fit_transform(documents)
# Bag of Words
bow = BagOfWords(max_features=50)
vectors = bow.fit_transform(documents)
🧪 Testing
# Run all tests
pytest tests/
# Run specific module tests
pytest tests/preprocessing/ -v
pytest tests/tokenization/ -v
pytest tests/classification/ -v
pytest tests/pos/ -v
pytest tests/utils/ -v
# Run with coverage
pytest tests/ --cov=bidnlp
📊 Project Status
| Module | Status | Tests | Coverage |
|---|---|---|---|
| Preprocessing | ✅ Complete | 58/58 | 100% |
| Tokenization | ✅ Complete | 64/64 | 100% |
| Classification | ✅ Complete | 46/46 | 100% |
| POS Tagging | ✅ Complete | 109/109 | 100% |
| Utils | ✅ Complete | 117/117 | 100% |
| Stemming | ✅ Complete | 11/11 | 100% |
| Lemmatization | ✅ Complete | 11/11 | 100% |
| Overall | ✅ 100% | 415/415 | 88%+ |
🎯 Key Features
- Persian-Specific: Designed specifically for Persian language challenges
- ZWNJ Handling: Proper handling of zero-width non-joiner characters
- Mixed Script Support: Handles Persian, Arabic, and English text
- Production Ready: 100% test coverage with comprehensive testing
- Easy to Use: Simple, intuitive API with extensive documentation
- Extensible: Easy to extend and customize for your needs
🌟 Use Cases
- Text Preprocessing: Clean and normalize Persian text for ML pipelines
- Sentiment Analysis: Analyze sentiment in Persian reviews and social media
- Text Classification: Categorize Persian documents and news articles
- Information Extraction: Extract meaningful information from Persian text
- Search & Retrieval: Build Persian search engines with proper tokenization
- NLP Research: Foundation for Persian NLP research and experiments
🔄 CI/CD & Quality Assurance
BidNLP uses comprehensive automated workflows to ensure code quality and reliability:
Continuous Integration
- ✅ Multi-version Testing: Automated tests across Python 3.7-3.12 on Ubuntu, macOS, and Windows
- ✅ Code Coverage: Comprehensive coverage reporting with Codecov integration
- ✅ Code Quality: Automated checks with Black, isort, flake8, and mypy
- ✅ Security Scanning: Regular security audits with Bandit, Safety, and CodeQL
- ✅ Dependency Updates: Automated dependency management with Dependabot
Release Pipeline
- ✅ Automated PyPI Publishing: Seamless releases on version tags
- ✅ GitHub Releases: Automatic changelog and artifact generation
- ✅ Package Validation: Pre-release checks ensure package integrity
🤝 Contributing
We welcome contributions! Please see our Contributing Guide for details.
Quick Start:
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Make your changes and add tests
- Ensure all tests pass (
pytest tests/) - Format code (
black . && isort .) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
See CONTRIBUTING.md for detailed guidelines.
📝 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
- Thanks to all contributors who have helped build this library
- Inspired by the need for comprehensive Persian NLP tools
- Built with ❤️ for the Persian NLP community
📧 Contact
For questions, issues, or suggestions, please open an issue on GitHub.
Made with ❤️ for Persian NLP
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bidnlp-0.2.2.tar.gz.
File metadata
- Download URL: bidnlp-0.2.2.tar.gz
- Upload date:
- Size: 109.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cad0d69d6a483cddea96aa8071f10353d690ae1992136206544e8474b6ae92cf
|
|
| MD5 |
03b3c144758ef16d3fe97a0d9581d459
|
|
| BLAKE2b-256 |
61fbba744ad42aeb4ceb9ab7470b62732eea7bb3eb1d74733aaedffe0f0f5983
|
Provenance
The following attestation bundles were made for bidnlp-0.2.2.tar.gz:
Publisher:
release.yml on aghabidareh/bidnlp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
bidnlp-0.2.2.tar.gz -
Subject digest:
cad0d69d6a483cddea96aa8071f10353d690ae1992136206544e8474b6ae92cf - Sigstore transparency entry: 699291650
- Sigstore integration time:
-
Permalink:
aghabidareh/bidnlp@d83c3dd0c2515ebbfd6d852843fbd2f52d77157a -
Branch / Tag:
refs/tags/v0.2.2 - Owner: https://github.com/aghabidareh
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d83c3dd0c2515ebbfd6d852843fbd2f52d77157a -
Trigger Event:
push
-
Statement type:
File details
Details for the file bidnlp-0.2.2-py3-none-any.whl.
File metadata
- Download URL: bidnlp-0.2.2-py3-none-any.whl
- Upload date:
- Size: 63.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cf27aed1215682bae4c794b4be231cf0e6ef13121f745c74d4688fc98b3c8811
|
|
| MD5 |
b066dafed41179d40c948eddd0a8f402
|
|
| BLAKE2b-256 |
e395b46463d4ad9c576f5d45a7ddb35d85743fcc15c80e36a6eb351451692921
|
Provenance
The following attestation bundles were made for bidnlp-0.2.2-py3-none-any.whl:
Publisher:
release.yml on aghabidareh/bidnlp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
bidnlp-0.2.2-py3-none-any.whl -
Subject digest:
cf27aed1215682bae4c794b4be231cf0e6ef13121f745c74d4688fc98b3c8811 - Sigstore transparency entry: 699291654
- Sigstore integration time:
-
Permalink:
aghabidareh/bidnlp@d83c3dd0c2515ebbfd6d852843fbd2f52d77157a -
Branch / Tag:
refs/tags/v0.2.2 - Owner: https://github.com/aghabidareh
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d83c3dd0c2515ebbfd6d852843fbd2f52d77157a -
Trigger Event:
push
-
Statement type: