A Persian (Farsi) Natural Language Processing library

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Natural Language
- Persian
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

BidNLP

A Comprehensive Persian (Farsi) Natural Language Processing Library

BidNLP is a production-ready Python library for Persian text processing, offering a complete suite of NLP tools specifically designed for the unique challenges of Persian language processing.

✨ Features

🔧 Preprocessing (100% Complete)

Text Normalization: Arabic to Persian character conversion, diacritic removal, ZWNJ normalization
Text Cleaning: URL, email, HTML tag removal, emoji handling
Number Processing: Persian ↔ English ↔ Arabic-Indic digit conversion
Date Normalization: Jalali date handling and formatting
Punctuation: Persian and Latin punctuation normalization

✂️ Tokenization (100% Complete)

Word Tokenizer: ZWNJ-aware, handles compound words and mixed scripts
Sentence Tokenizer: Smart boundary detection with abbreviation support
Character Tokenizer: Character-level tokenization with diacritic handling
Morpheme Tokenizer: Prefix/suffix detection and morphological analysis
Syllable Tokenizer: Persian syllable segmentation

🔍 Stemming & Lemmatization (Partial)

Stemming: Conservative suffix removal with minimum stem length
Lemmatization: Dictionary-based lemmatization with irregular form support
Arabic Plural Handling: Special support for Arabic broken plurals

📊 Classification (100% Complete)

Sentiment Analysis: Keyword-based with 100+ sentiment keywords and negation handling
Text Classification: Keyword-based multi-class categorization
Feature Extraction: Bag-of-Words, TF-IDF, N-gram extraction

🛠️ Utilities (100% Complete)

Character Utils: Persian alphabet, character type detection, diacritic handling
Statistics: Word count, sentence count, lexical diversity, n-gram frequency
Stop Words: 100+ Persian stop words with custom support
Validators: Text quality scoring, normalization checking
Metrics: Precision, Recall, F1, BLEU, edit distance, and more

📦 Installation

pip install bidnlp

From source:

git clone https://github.com/aghabidareh/bidnlp.git
cd bidnlp
pip install -e .

🚀 Quick Start

Preprocessing

from bidnlp.preprocessing import PersianNormalizer, PersianTextCleaner

# Normalize text
normalizer = PersianNormalizer()
text = normalizer.normalize("كتاب يک")  # Converts: کتاب یک

# Clean text
cleaner = PersianTextCleaner(remove_urls=True, remove_emojis=True)
clean_text = cleaner.clean("سلام 😊 https://test.com")  # Output: سلام

Tokenization

from bidnlp.tokenization import PersianWordTokenizer, PersianSentenceTokenizer

# Word tokenization
tokenizer = PersianWordTokenizer()
words = tokenizer.tokenize("من به دانشگاه می‌روم")
# Output: ['من', 'به', 'دانشگاه', 'می', 'روم']

# Sentence tokenization
sent_tokenizer = PersianSentenceTokenizer()
sentences = sent_tokenizer.tokenize("سلام. چطوری؟")
# Output: ['سلام.', 'چطوری؟']

Sentiment Analysis

from bidnlp.classification import PersianSentimentAnalyzer

analyzer = PersianSentimentAnalyzer()

# Simple sentiment
sentiment = analyzer.predict("این کتاب خیلی خوب است")
# Output: 'positive'

# Detailed analysis
result = analyzer.analyze("محصول عالی اما گران است")
# Output: {'sentiment': 'neutral', 'score': 0.0,
#          'positive_words': ['عالی'], 'negative_words': ['گران']}

Text Classification

from bidnlp.classification import KeywordClassifier

classifier = KeywordClassifier()

# Add categories
classifier.add_category('ورزش', {'فوتبال', 'بازیکن', 'تیم'})
classifier.add_category('تکنولوژی', {'کامپیوتر', 'نرم‌افزار', 'برنامه'})

# Classify
category = classifier.predict("تیم فوتبال برد گرفت")
# Output: 'ورزش'

Text Statistics

from bidnlp.utils import PersianTextStatistics

stats = PersianTextStatistics()
text = "من به دانشگاه می‌روم. دانشگاه بزرگ است."

statistics = stats.get_statistics(text)
# Output: {
#   'words': 8, 'sentences': 2, 'characters': 35,
#   'average_word_length': 4.38, 'lexical_diversity': 0.875, ...
# }

Stop Words

from bidnlp.utils import PersianStopWords

stopwords = PersianStopWords()

# Remove stop words
text = "من از دانشگاه به خانه می روم"
filtered = stopwords.remove_stopwords(text)
# Output: "دانشگاه خانه می روم"

# Check if word is stop word
is_stop = stopwords.is_stopword('از')  # True

Feature Extraction

from bidnlp.classification import TfidfVectorizer, BagOfWords

# TF-IDF
tfidf = TfidfVectorizer(max_features=100)
vectors = tfidf.fit_transform(documents)

# Bag of Words
bow = BagOfWords(max_features=50)
vectors = bow.fit_transform(documents)

📚 Documentation

For detailed documentation and examples, see:

Quick Start Guide - Get started quickly
Roadmap - Full project documentation and development guide
Session Summary - Latest development updates
Examples - Comprehensive usage examples

🧪 Testing

# Run all tests
pytest tests/

# Run specific module tests
pytest tests/preprocessing/ -v
pytest tests/tokenization/ -v
pytest tests/classification/ -v
pytest tests/utils/ -v

# Run with coverage
pytest tests/ --cov=bidnlp

📊 Project Status

Module	Status	Tests	Coverage
Preprocessing	✅ Complete	58/58	100%
Tokenization	✅ Complete	64/64	100%
Classification	✅ Complete	46/46	100%
Utils	✅ Complete	117/117	100%
Stemming	⚠️ Partial	7/14	50%
Lemmatization	⚠️ Partial	9/20	45%
Overall	94.1%	302/321	94.1%

🎯 Key Features

Persian-Specific: Designed specifically for Persian language challenges
ZWNJ Handling: Proper handling of zero-width non-joiner characters
Mixed Script Support: Handles Persian, Arabic, and English text
Production Ready: 94.1% test coverage with comprehensive testing
Easy to Use: Simple, intuitive API with extensive documentation
Extensible: Easy to extend and customize for your needs

🌟 Use Cases

Text Preprocessing: Clean and normalize Persian text for ML pipelines
Sentiment Analysis: Analyze sentiment in Persian reviews and social media
Text Classification: Categorize Persian documents and news articles
Information Extraction: Extract meaningful information from Persian text
Search & Retrieval: Build Persian search engines with proper tokenization
NLP Research: Foundation for Persian NLP research and experiments

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Thanks to all contributors who have helped build this library
Inspired by the need for comprehensive Persian NLP tools
Built with ❤️ for the Persian NLP community

📧 Contact

For questions, issues, or suggestions, please open an issue on GitHub.

Made with ❤️ for Persian NLP

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Natural Language
- Persian
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.2.2

Nov 13, 2025

0.1.4

Oct 9, 2025

This version

0.1.0

Oct 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bidnlp-0.1.0.tar.gz (82.1 kB view details)

Uploaded Oct 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bidnlp-0.1.0-py3-none-any.whl (79.2 kB view details)

Uploaded Oct 2, 2025 Python 3

File details

Details for the file bidnlp-0.1.0.tar.gz.

File metadata

Download URL: bidnlp-0.1.0.tar.gz
Upload date: Oct 2, 2025
Size: 82.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for bidnlp-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`3a74f68e5d50a3d60fd7a8014abb5fb3cc89af899d6f9e4e57b4e89eff82d56e`
MD5	`f7232624a3c5968ee690ef8f52a343af`
BLAKE2b-256	`cd6a172a1cc01cd41ad7c1087e1f5f7abf7b7ebdad6a441a99da5e49553b9974`

See more details on using hashes here.

File details

Details for the file bidnlp-0.1.0-py3-none-any.whl.

File metadata

Download URL: bidnlp-0.1.0-py3-none-any.whl
Upload date: Oct 2, 2025
Size: 79.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for bidnlp-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6f0d58baa185ed4488f963fcfb6caee9e1bbc7931bf4fd9787b2ab054a8bcc9e`
MD5	`754b08c9e2888889d6eb5d0890801a2e`
BLAKE2b-256	`7dd0dcdcc175f8884428a82315d4b1b384f87cae7131a12a49a9cf7c1a5efc10`

See more details on using hashes here.

bidnlp 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

BidNLP

✨ Features

🔧 Preprocessing (100% Complete)

✂️ Tokenization (100% Complete)

🔍 Stemming & Lemmatization (Partial)

📊 Classification (100% Complete)

🛠️ Utilities (100% Complete)

📦 Installation

🚀 Quick Start

Preprocessing

Tokenization

Sentiment Analysis

Text Classification

Text Statistics

Stop Words

Feature Extraction

📚 Documentation

🧪 Testing

📊 Project Status

🎯 Key Features

🌟 Use Cases

🤝 Contributing

📝 License

🙏 Acknowledgments

📧 Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes