A bilingual text preprocessing toolkit for English and Persian.
Project description
English and Persian Text Preprocessing Pipeline
This project provides a robust preprocessing pipeline for English and Persian text, designed for a variety of Natural Language Processing (NLP) tasks such as translation, sentiment analysis, named entity recognition (NER), and more. It includes tools for data cleaning, normalization, frequency analysis, and dataset preparation for machine learning models.
Features
- Task-Specific Preprocessing:
- Supports tasks like
translation,sentiment,ner,spam_detection,topic_modeling, andsummarization.
- Supports tasks like
- Language-Specific Preprocessing:
- Persian: Diacritic removal, numeral normalization, punctuation handling.
- English: Spelling correction, contractions expansion, lemmatization.
- Dataset Splitting:
- Splits data into train, validation, and test sets with configurable ratios.
- Frequency Analysis:
- Word and character frequency analysis with export to CSV and Excel files.
Prerequisites
Python Version
- Requires Python 3.8 or higher.
Install Dependencies
Install required libraries:
pip install -r requirements.txt
Download the SpaCy model for English processing:
python -m spacy download en_core_web_sm
Usage
Step 1: Preprocess Data
Run the main.py script to preprocess data for a specific task. Example for the translation task:
python main.py --task translation --input translation_data.csv --output output_directory
Arguments:
--task: The NLP task (translation,sentiment,ner, etc.).--input: Path to the input CSV file.--output: Directory to save the cleaned data.
Step 2: Split Dataset (Optional)
Use separate_train_test_validation.py to split the preprocessed dataset into train, validation, and test sets:
python separate_train_test_validation.py \
--input output_directory/cleaned_data_translation.csv \
--target Persian \
--train_ratio 0.7 \
--val_ratio 0.2 \
--test_ratio 0.1 \
--output_dir output_directory
Arguments:
--input: Path to the preprocessed file.--target: The target column (e.g.,Persianfor translation).--train_ratio,--val_ratio,--test_ratio: Ratios for dataset splitting.--output_dir: Directory to save the train/val/test splits.
Step 3: Frequency Analysis (Optional)
Analyze word and character frequencies using character_word_count.py:
from character_word_count import WordCharacterCount
# Example dataset
data = ["Hello world!", "Welcome to preprocessing."]
# Initialize the tool
counter = WordCharacterCount(output_directory="output_directory")
# Generate word frequency report
word_freq = counter.word_count(data, file_name="example_word_frequency")
# Generate character frequency report
char_freq = counter.character_count(data, file_name="example_char_frequency")
Project Structure
After running the scripts, the directory structure will look like this:
.
├── main.py # Main preprocessing script.
├── english_text_preprocessor.py # English-specific preprocessing utilities.
├── persian_text_preprocessor.py # Persian-specific preprocessing utilities.
├── Dictionaries_En.py # English dictionaries and mappings.
├── Dictionaries_Fa.py # Persian dictionaries and mappings.
├── character_word_count.py # Word and character frequency analysis tool.
├── separate_train_test_validation.py # Dataset splitting script.
├── stopwords.txt # Persian stopword list.
├── requirements.txt # Dependencies list.
├── translation_data.csv # Sample input dataset.
├── output_directory/ # Directory containing generated outputs.
│ ├── cleaned_data_translation.csv # Cleaned dataset (CSV format).
│ ├── cleaned_data_translation.xlsx # Cleaned dataset (Excel format).
│ ├── train.csv # Training set.
│ ├── validation.csv # Validation set.
│ ├── test.csv # Test set.
│ ├── example_word_frequency_WordsCount.csv # Word frequency report (CSV).
│ ├── example_char_frequency_CharactersCount.csv # Character frequency report (CSV).
├── README.md # Project documentation.
Supported Tasks
-
Translation:
- Processes datasets with
EnglishandPersiancolumns. - Retains minimal normalization to preserve translation context.
- Processes datasets with
-
Sentiment Analysis:
- Cleans data by removing emojis, punctuation, and stopwords.
-
Named Entity Recognition (NER):
- Retains entity-specific context while applying basic normalization.
-
Topic Modeling:
- Removes stopwords and applies lemmatization for better topic clustering.
-
Spam Detection:
- Prepares datasets for binary spam vs. non-spam classification.
-
Summarization:
- Retains sentence structure and punctuation for summary generation.
-
Default Task:
- Applies general-purpose text cleaning and normalization.
Sample Input and Output
Input: translation_data.csv
English,Persian
"Hello, world!", "سلام دنیا!"
"This is an example.", "این یک مثال است."
Preprocessed Output
Saved in output_directory/cleaned_data_translation.csv:
English,Persian
"hello world", "سلام دنیا"
"this is an example", "این یک مثال است"
Dataset Splits
Saved in output_directory/:
train.csvvalidation.csvtest.csv
Customization
Task Configurations
- Modify preprocessing settings in
english_text_preprocessor.pyandpersian_text_preprocessor.py. - Adjust configurations for punctuation, stopword removal, or specific tasks.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file MorphoPreText-0.1.0.tar.gz.
File metadata
- Download URL: MorphoPreText-0.1.0.tar.gz
- Upload date:
- Size: 20.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f5413eabae8c856cf722d5f77e6452cfea72682714a9c131d84dc11638661ac2
|
|
| MD5 |
62cb6414aac838b06705fa4dd5c14ad4
|
|
| BLAKE2b-256 |
b37464be61b8b56c80ca77caaeab88fe8fd25b9f7a7c3ed94bcea30a26c5b315
|
File details
Details for the file MorphoPreText-0.1.0-py3-none-any.whl.
File metadata
- Download URL: MorphoPreText-0.1.0-py3-none-any.whl
- Upload date:
- Size: 25.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0742d11d395ee01b375301a044c5bf608a4302c6ae967c0ace3a25fee7b57838
|
|
| MD5 |
ae62838e3af860120dcd4262474920d1
|
|
| BLAKE2b-256 |
57e83fea015db9478c8a7bdf7a7dbc637b5046f0f5e26dcbf31229bbdced74bd
|