A bilingual text preprocessing toolkit for English and Persian.

These details have not been verified by PyPI

Project links

Homepage

Project description

English and Persian Text Preprocessing Pipeline

This project provides a robust preprocessing pipeline for English and Persian text, designed for a variety of Natural Language Processing (NLP) tasks such as translation, sentiment analysis, named entity recognition (NER), and more. It includes tools for data cleaning, normalization, frequency analysis, and dataset preparation for machine learning models.

Features

Task-Specific Preprocessing:
- Supports tasks like translation, sentiment, ner, spam_detection, topic_modeling, and summarization.
Language-Specific Preprocessing:
- Persian: Diacritic removal, numeral normalization, punctuation handling.
- English: Spelling correction, contractions expansion, lemmatization.
Dataset Splitting:
- Splits data into train, validation, and test sets with configurable ratios.
Frequency Analysis:
- Word and character frequency analysis with export to CSV and Excel files.

Prerequisites

Python Version

Requires Python 3.8 or higher.

Install Dependencies

Install required libraries:

pip install -r requirements.txt

Download the SpaCy model for English processing:

python -m spacy download en_core_web_sm

Usage

Step 1: Preprocess Data

Run the main.py script to preprocess data for a specific task. Example for the translation task:

python main.py --task translation --input translation_data.csv --output output_directory

Arguments:

--task: The NLP task (translation, sentiment, ner, etc.).
--input: Path to the input CSV file.
--output: Directory to save the cleaned data.

Step 2: Split Dataset (Optional)

Use separate_train_test_validation.py to split the preprocessed dataset into train, validation, and test sets:

python separate_train_test_validation.py \
  --input output_directory/cleaned_data_translation.csv \
  --target Persian \
  --train_ratio 0.7 \
  --val_ratio 0.2 \
  --test_ratio 0.1 \
  --output_dir output_directory

Arguments:

--input: Path to the preprocessed file.
--target: The target column (e.g., Persian for translation).
--train_ratio, --val_ratio, --test_ratio: Ratios for dataset splitting.
--output_dir: Directory to save the train/val/test splits.

Step 3: Frequency Analysis (Optional)

Analyze word and character frequencies using character_word_count.py:

from character_word_count import WordCharacterCount

# Example dataset
data = ["Hello world!", "Welcome to preprocessing."]

# Initialize the tool
counter = WordCharacterCount(output_directory="output_directory")

# Generate word frequency report
word_freq = counter.word_count(data, file_name="example_word_frequency")

# Generate character frequency report
char_freq = counter.character_count(data, file_name="example_char_frequency")

Project Structure

After running the scripts, the directory structure will look like this:

.
├── main.py                     # Main preprocessing script.
├── english_text_preprocessor.py # English-specific preprocessing utilities.
├── persian_text_preprocessor.py # Persian-specific preprocessing utilities.
├── Dictionaries_En.py          # English dictionaries and mappings.
├── Dictionaries_Fa.py          # Persian dictionaries and mappings.
├── character_word_count.py     # Word and character frequency analysis tool.
├── separate_train_test_validation.py # Dataset splitting script.
├── stopwords.txt               # Persian stopword list.
├── requirements.txt            # Dependencies list.
├── translation_data.csv        # Sample input dataset.
├── output_directory/           # Directory containing generated outputs.
│   ├── cleaned_data_translation.csv   # Cleaned dataset (CSV format).
│   ├── cleaned_data_translation.xlsx  # Cleaned dataset (Excel format).
│   ├── train.csv                       # Training set.
│   ├── validation.csv                  # Validation set.
│   ├── test.csv                        # Test set.
│   ├── example_word_frequency_WordsCount.csv    # Word frequency report (CSV).
│   ├── example_char_frequency_CharactersCount.csv # Character frequency report (CSV).
├── README.md                   # Project documentation.

Supported Tasks

Translation:
- Processes datasets with English and Persian columns.
- Retains minimal normalization to preserve translation context.
Sentiment Analysis:
- Cleans data by removing emojis, punctuation, and stopwords.
Named Entity Recognition (NER):
- Retains entity-specific context while applying basic normalization.
Topic Modeling:
- Removes stopwords and applies lemmatization for better topic clustering.
Spam Detection:
- Prepares datasets for binary spam vs. non-spam classification.
Summarization:
- Retains sentence structure and punctuation for summary generation.
Default Task:
- Applies general-purpose text cleaning and normalization.

Sample Input and Output

Input: `translation_data.csv`

English,Persian
"Hello, world!", "سلام دنیا!"
"This is an example.", "این یک مثال است."

Preprocessed Output

Saved in output_directory/cleaned_data_translation.csv:

English,Persian
"hello world", "سلام دنیا"
"this is an example", "این یک مثال است"

Dataset Splits

Saved in output_directory/:

train.csv
validation.csv
test.csv

Customization

Task Configurations

Modify preprocessing settings in english_text_preprocessor.py and persian_text_preprocessor.py.
Adjust configurations for punctuation, stopword removal, or specific tasks.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.2.0

Aug 6, 2025

0.1.2

Apr 28, 2025

0.1.1

Jan 19, 2025

This version

0.1.0

Jan 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

MorphoPreText-0.1.0.tar.gz (20.1 kB view details)

Uploaded Jan 13, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

MorphoPreText-0.1.0-py3-none-any.whl (25.1 kB view details)

Uploaded Jan 13, 2025 Python 3

File details

Details for the file MorphoPreText-0.1.0.tar.gz.

File metadata

Download URL: MorphoPreText-0.1.0.tar.gz
Upload date: Jan 13, 2025
Size: 20.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.10.12

File hashes

Hashes for MorphoPreText-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`f5413eabae8c856cf722d5f77e6452cfea72682714a9c131d84dc11638661ac2`
MD5	`62cb6414aac838b06705fa4dd5c14ad4`
BLAKE2b-256	`b37464be61b8b56c80ca77caaeab88fe8fd25b9f7a7c3ed94bcea30a26c5b315`

See more details on using hashes here.

File details

Details for the file MorphoPreText-0.1.0-py3-none-any.whl.

File metadata

Download URL: MorphoPreText-0.1.0-py3-none-any.whl
Upload date: Jan 13, 2025
Size: 25.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.10.12

File hashes

Hashes for MorphoPreText-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0742d11d395ee01b375301a044c5bf608a4302c6ae967c0ace3a25fee7b57838`
MD5	`ae62838e3af860120dcd4262474920d1`
BLAKE2b-256	`57e83fea015db9478c8a7bdf7a7dbc637b5046f0f5e26dcbf31229bbdced74bd`

See more details on using hashes here.

MorphoPreText 0.1.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

English and Persian Text Preprocessing Pipeline

Features

Prerequisites

Python Version

Install Dependencies

Usage

Step 1: Preprocess Data

Arguments:

Step 2: Split Dataset (Optional)

Arguments:

Step 3: Frequency Analysis (Optional)

Project Structure

Supported Tasks

Sample Input and Output

Input: translation_data.csv

Preprocessed Output

Dataset Splits

Customization

Task Configurations

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Input: `translation_data.csv`