A utility for normalizing persian, arabic and english texts
Project description
Piraye: Advanced NLP Utilities for Persian, Arabic, and English
Piraye is a Python library providing flexible text normalization and tokenization utilities for Persian, Arabic, and English NLP tasks. With comprehensive type hints, extensive documentation, and a clean architecture, Piraye is production-ready for modern NLP pipelines.
📑 Table of Contents
- Key Features
- Installation
- Quick Start: Text Normalization
- Position Mapping
- Configuration Options
- Tokenization Framework
- Built-in Tokenizers
- TokenizerPipeline
- Project Structure
- License
- Maintainers
- Show Your Support
🚀 Key Features
| Feature | Description |
|---|---|
| Multi-Language Normalization | Normalize alphabets, digits, punctuation, and whitespace for Persian, Arabic, and English. |
| Advanced Tokenization | Regex-based, NLTK-based, Spacy-based, and custom tokenizers with hierarchical support. |
| Tokenizer Pipeline | Chain multiple tokenizers for sophisticated text processing workflows. |
| Position Tracking | Map positions between original and normalized text. |
| Multi-Lingual Detection | Automatic language detection and appropriate normalization. |
| Type Safe | Complete type hints for modern Python development. |
| Well Documented | Comprehensive documentation and usage examples. |
| Production Ready | Clean architecture, extensive testing, and easy integration. |
📦 Installation
Basic Installation
pip install piraye
Full Installation (with Spacy support)
pip install piraye[full]
Requirements: Python 3.11+
🧠 Quick Start: Text Normalization
Normalize Persian text by correcting and standardizing letters, digits, and punctuation, performing tokenization, and removing extra spaces to produce clean, consistent text ready for NLP processing.
Basic Normalization (Builder Pattern)
from piraye import NormalizerBuilder
text = "این یک متن تسة اسﺘ , 24/12/1400 "
normalizer = (NormalizerBuilder()
.alphabet_fa()
.digit_fa()
.punctuation_fa()
.tokenizing()
.remove_extra_spaces()
.build())
# normalize() returns tuple: (normalized_text, NormalizationResult)
normalized_text, result = normalizer.normalize(text)
print(normalized_text) # "این یک متن تست است ، ۲۴/۱۲/۱۴۰۰"
print(result.shifts) # Position shifts for mapping
print(result.punc_positions) # Punctuation locations in normalized text
Using Config Constructor
from piraye import NormalizerBuilder
from piraye.tasks.normalizer.normalizer_builder import Config
text = "این یک متن تسة اسﺘ , 24/12/1400 "
normalizer = NormalizerBuilder(
configs=[Config.PUNCTUATION_FA, Config.ALPHABET_FA, Config.DIGIT_FA],
remove_extra_spaces=True,
tokenization=True
).build()
normalized_text, result = normalizer.normalize(text)
print(normalized_text) # "این یک متن تست است ، ۲۴/۱۲/۱۴۰۰"
📖 For more examples and usage patterns, see Normalizer Examples.
📊 Normalizer Output
The normalize() method returns a tuple containing the normalized text and a NormalizationResult object with
metadata.
Return Value Structure
normalized_text, result = normalizer.normalize(text)
# Returns: tuple[str, NormalizationResult]
NormalizationResult Properties
| Property | Type | Description |
|---|---|---|
shifts |
list[tuple[int, int]] |
Position shifts tracking character position changes during normalization. Format: (position, shift) |
punc_positions |
list[int] |
List of punctuation character positions in the normalized text |
Example
from piraye import NormalizerBuilder
normalizer = (NormalizerBuilder()
.alphabet_fa()
.punctuation_fa()
.digit_fa()
.remove_extra_spaces()
.build())
text = "سلام، این ۱۲۳ است."
normalized_text, result = normalizer.normalize(text)
# Normalized text
print(normalized_text)
# Output: "سلام، این ۱۲۳ است."
# Shifts for position mapping
print(result.shifts)
# Output: [(4, 0), (9, 1), (13, 2), (17, 3)]
# Each tuple represents (position_in_normalized_text, cumulative_shift_from_original)
# Punctuation positions
print(result.punc_positions)
# Output: [4, 17]
# Positions where punctuation characters (، and .) are located in normalized text
# Access individual punctuation characters
for pos in result.punc_positions:
char = normalized_text[pos]
print(f"Punctuation at position {pos}: '{char}'")
# Output:
# Punctuation at position 4: '،'
# Punctuation at position 17: '.'
🔢 Position Mapping After Normalization
When normalizing text, characters may be added, removed, or replaced. Piraye tracks these changes and provides utilities to map positions between normalized and original text.
Methods
| Method | Description |
|---|---|
calc_original_position(shifts, position) |
Returns the original position for a single index in normalized text. |
calc_original_positions(shifts, positions) |
Returns original positions for multiple indices (must be sorted). |
Example
from piraye import NormalizerBuilder
normalizer = (NormalizerBuilder()
.space_normal()
.remove_extra_spaces()
.alphabet_en()
.punctuation_en()
.build())
text = "Hello , World !"
normalized_text, result = normalizer.normalize(text)
# Access shifts from NormalizationResult
shifts = result.shifts
print(f"Shifts: {shifts}")
# Map single position
original_pos = normalizer.calc_original_position(shifts, 7)
print(f"Position 7 in normalized text was at position {original_pos} in original")
# Map multiple positions (must be sorted)
positions = [3, 7, 12]
original_positions = normalizer.calc_original_positions(shifts, positions)
print(f"Positions {positions} map to {original_positions} in original text")
Working with Punctuation Positions
from piraye import NormalizerBuilder
normalizer = (NormalizerBuilder()
.alphabet_fa()
.punctuation_fa()
.build())
text = "سلام، این یک متن است."
normalized_text, result = normalizer.normalize(text)
# Access punctuation positions
print(f"Punctuation found at positions: {result.punc_positions}")
# Get the actual punctuation characters
punc_chars = [normalized_text[pos] for pos in result.punc_positions]
print(f"Punctuation characters: {punc_chars}")
💡 Tip: Use position mapping to align annotations, highlight text, or track character positions through normalization.
⚙️ Configurations
Piraye provides various configurations for text normalization:
| Config | Function | Description |
|---|---|---|
| ALPHABET_AR | alphabet_ar | Maps alphabet characters to Arabic |
| ALPHABET_EN | alphabet_en | Maps alphabet characters to English |
| ALPHABET_FA | alphabet_fa | Maps alphabet characters to Persian |
| DIGIT_AR | digit_ar | Converts digits to Arabic digits |
| DIGIT_EN | digit_en | Converts digits to English digits |
| DIGIT_FA | digit_fa | Converts digits to Persian digits |
| DIACRITIC_DELETE | diacritic_delete | Removes all diacritics |
| SPACE_DELETE | space_delete | Removes all spaces |
| SPACE_NORMAL | space_normal | Normalizes spaces (e.g., NO-BREAK SPACE, Tab, etc.) |
| SPACE_KEEP | space_keep | Maps spaces and keeps them as-is |
| PUNCTUATION_AR | punctuation_ar | Maps punctuations to Arabic punctuations |
| PUNCTUATION_FA | punctuation_fa | Maps punctuations to Persian punctuations |
| PUNCTUATION_EN | punctuation_en | Maps punctuations to English punctuations |
Other attributes:
remove_extra_spaces: Collapses multiple consecutive spaces into a single space.tokenization: Converts punctuation characters into separate tokens.
✂️ Tokenization Framework
All tokenizers inherit from the Tokenizer abstract base class and produce Token objects with rich metadata.
Token Structure
| Field | Type | Description |
|---|---|---|
content |
str |
The text content of the token. |
type |
str |
The type or name of the tokenizer that created it. |
position |
tuple[int, int] |
Start and end indices of the token in the original text. |
sub_tokens |
List[Token] |
A list of child tokens (for hierarchical tokenization). |
Base Methods
tokenize(text: str) -> List[Token]– Main tokenization methodmerge(text: str, previous_tokens: List[Token]) -> List[Token]– Merge tokens hierarchically
🔤 Built-in Tokenizers
NLTK-based Tokenizers
NltkWordTokenizer– Word-level tokenization using NLTKNltkSentenceTokenizer– Sentence-level tokenization using Punkt algorithm
Spacy-based Tokenizers
SpacyWordTokenizer– Word-level tokenization using SpacySpacySentenceTokenizer– Sentence-level tokenization using Spacy
Regex-based Tokenizers
RegexTokenizer– Generic regex pattern tokenizerURLTokenizer– Extract URLs from textEmailTokenizer– Extract email addresses from textHTMLTokenizer– Extract HTML tags from text
Structural Tokenizers
ParagraphTokenizer– Split text into paragraphs
🔄 TokenizerPipeline: Hierarchical Tokenization
The TokenizerPipeline class provides a modular and sequential approach to text tokenization. It allows you to chain
multiple tokenizers together, where the output of one tokenizer can be merged or refined by the next. This design makes
it easy to combine tokenizers (e.g., sentences, words, emojis, URLs) into a unified pipeline for flexible and powerful
text preprocessing.
How It Works
The pipeline starts with the first tokenizer, which processes the raw text. Each subsequent tokenizer is applied sequentially, refining or extending the previous tokens. The final result is a merged list of Token objects representing a fully tokenized text.
Example Usage
from piraye.tasks.tokenizer import NltkSentenceTokenizer
from piraye.tasks.tokenizer import URLTokenizer
from piraye.tasks.tokenizer.pipeline import TokenizerPipeline
pipeline = TokenizerPipeline([
NltkSentenceTokenizer(),
URLTokenizer()
])
text = "Contact us at support@arusha.dev or info@piraye.ai."
tokens = pipeline(text)
print([t.content for t in tokens])
# ["Contact us at support@arusha.dev or info@piraye.ai."]
Paragraph Tokenizer Example
from piraye.tasks.tokenizer import ParagraphTokenizer
text = "First paragraph.\nSecond paragraph.\nThird paragraph."
tokenizer = ParagraphTokenizer()
tokens = tokenizer.tokenize(text)
for token in tokens:
print(token)
# Token(content='First paragraph.', type='Paragraph', position=(0, 16), sub_tokens=[])
# Token(content='\nSecond paragraph.', type='Paragraph', position=(16, 34), sub_tokens=[])
# Token(content='\nThird paragraph.', type='Paragraph', position=(34, 51), sub_tokens=[])
📖 For more examples and usage patterns, see Tokenizing Examples.
📁 Project Structure
piraye/
├── piraye/
│ ├── __init__.py
│ ├── constants.py
│ └── tasks/
│ ├── normalizer/
│ │ ├── __init__.py
│ │ ├── char_config.py
│ │ ├── character_normalizer.py
│ │ ├── mappings.py
│ │ ├── multi_lingual_normalizer.py
│ │ ├── multi_lingual_normalizer_builder.py
│ │ ├── normalizer.py
│ │ ├── normalizer_builder.py
│ │ └── data/
│ │ ├── alphabets/
│ │ ├── digits/
│ │ ├── others/
│ │ └── puncs/
│ └── tokenizer/
│ ├── __init__.py
│ ├── pipeline.py
│ ├── token.py
│ └── tokenizers/
│ ├── __init__.py
│ ├── base_tokenizer.py
│ ├── nltk_tokenizer.py
│ ├── spacy_tokenizer.py
│ ├── regex_tokenizer.py
│ ├── paragraph_tokenizer.py
│ └── regex_tokenizers/
│ ├── __init__.py
│ ├── base_regex_tokenizer.py
│ ├── url_tokenizer.py
│ ├── email_tokenizer.py
│ ├── html_tokenizer.py
│ └── README.md
├── tests/
│ ├── test_normalizer.py
│ ├── test_ml_normalizer.py
│ ├── test_tokenizer.py
│ ├── test_tokenizer_pipeline.py
│ ├── test_html_tokenizer.py
│ └── ...
├── README.md
├── LICENSE
└── pyproject.toml
📄 License
GNU Lesser General Public License v2.1
See LICENSE
❤️ Maintainers
Piraye is maintained by Arusha.
Authors:
- Hamed Khademi Khaledi
- HosseiN Khademi Khaledi
- Majid Asgari Bidhendi
For questions or support, please open an issue on GitHub or contact us at info@arusha.dev.
🌟 Show Your Support
If you find Piraye useful, please consider:
- ⭐ Starring the repository on GitHub
- 📢 Sharing it with others who might benefit
- 🐛 Reporting bugs or suggesting features
- 🤝 Contributing to the codebase
Thank you for using Piraye! 🎉
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file piraye-1.0.0.tar.gz.
File metadata
- Download URL: piraye-1.0.0.tar.gz
- Upload date:
- Size: 67.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.9.16 {"installer":{"name":"uv","version":"0.9.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
95df0ad1ce2c8ac68048767c1b4a197a6b2c758f1976498521ec027788179dd5
|
|
| MD5 |
9c870ea46c1a40daeb0070f46811a879
|
|
| BLAKE2b-256 |
35f3a7951912b2ef8b4b7421b632e6e9e33712de005109c4b535aba193895da2
|
File details
Details for the file piraye-1.0.0-py3-none-any.whl.
File metadata
- Download URL: piraye-1.0.0-py3-none-any.whl
- Upload date:
- Size: 66.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.9.16 {"installer":{"name":"uv","version":"0.9.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f72291a1886256646ddbb1313677354ac9f3aeef5dfab329b142e0a4064f0923
|
|
| MD5 |
485c6deab00dedc01d64cbf1b6522b45
|
|
| BLAKE2b-256 |
5e3324013916086e9fb076a5647897659d33c29329f95fe50974d85de4c7bde3
|