Arabic text post-processing for LLM outputs — diacritics restoration, number-to-word conversion, and LangChain integration
Project description
langchain-arabic
Arabic text post-processing for LLM outputs. Diacritics (tashkeel) restoration, number-to-word conversion, and native LangChain integration. Supports both dictionary-based and neural auto-diacritization via CATT.
Problem
LLMs produce Arabic text without diacritics (tashkeel) 60-70% of the time. This causes mispronunciation in text-to-speech (TTS) pipelines. Numbers in digit form (e.g. 2030, 95%) are also read incorrectly by TTS engines.
langchain-arabic fixes both issues as a post-processing step on LLM output.
Installation
# Dictionary mode only (lightweight, no PyTorch)
pip install langchain-arabic
# With neural auto-diacritization (installs catt-tashkeel + PyTorch)
pip install langchain-arabic[catt]
Quick Start
Dictionary-Based Diacritics
Provide a mapping of plain Arabic words to their diacritized forms. The library applies longest-first replacement to avoid partial matches.
from langchain_arabic import apply_diacritics, parse_diacritics_map
# From a dictionary
diacritics_map = {
"تقنية": "تِقْنِيَة",
"شركة": "شَرِكَة",
"علم الحاسوب": "عِلْمُ الحَاسُوبِ",
}
text = "شركة تقنية في علم الحاسوب"
result = apply_diacritics(text, diacritics_map)
# -> "شَرِكَة تِقْنِيَة في عِلْمُ الحَاسُوبِ"
You can also parse mappings from a markdown file (useful for persona/prompt files):
# Parse from markdown with "- WORD -> DIACRITIZED" format
diacritics_map = parse_diacritics_map("path/to/persona.md")
# Or from a markdown string
markdown = """
- تقنية → تِقْنِيَة
- شركة → شَرِكَة
"""
diacritics_map = parse_diacritics_map(markdown)
Auto-Diacritization with CATT
For neural auto-diacritization (no manual dictionary needed), use the CATT backend. CATT is a state-of-the-art character-level transformer that outperforms GPT-4-turbo on Arabic diacritization benchmarks.
pip install langchain-arabic[catt]
from langchain_arabic import ArabicTextOutputParser
parser = ArabicTextOutputParser(
backend="catt",
catt_model="encoder_only", # faster; or "encoder_decoder" for higher accuracy
convert_numbers=True,
)
result = parser.parse("شركة تقنية في علم الحاسوب")
# CATT auto-diacritizes the entire text
Hybrid Mode: CATT + Dictionary Overrides
The most powerful setup: let CATT handle general text, then override domain-specific terms (proper nouns, brand names) with your dictionary:
parser = ArabicTextOutputParser(
backend="catt",
catt_model="encoder_decoder", # higher accuracy
diacritics_map={
"علم الحاسوب": "عِلْمُ الحَاسُوبِ", # domain term override
},
convert_numbers=True,
)
chain = prompt | llm | parser
CATT runs first, then dictionary overrides are applied on top.
Number-to-Word Conversion
Automatically detects and converts numbers based on context (percentages, currency, phone numbers, plain numbers).
from langchain_arabic import convert_numbers_in_text
# Arabic
convert_numbers_in_text("نسبة 95%", language="ar")
# -> "نسبة خمسة و تسعون بالمائة"
convert_numbers_in_text("المبلغ 500 ريال", language="ar")
# -> "المبلغ خمسمائة ريال"
convert_numbers_in_text("اتصل على 920000247", language="ar")
# -> "اتصل على تسعة اثنان صفر صفر صفر صفر اثنان أربعة سبعة"
# English
convert_numbers_in_text("about 95%", language="en")
# -> "about ninety-five percent"
With LangChain
Use ArabicTextOutputParser as a drop-in replacement for StrOutputParser in any LCEL chain:
from langchain_arabic import ArabicTextOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini")
prompt = ChatPromptTemplate.from_template("أجب بالعربية: {question}")
parser = ArabicTextOutputParser(
diacritics_map={"تقنية": "تِقْنِيَة", "شركة": "شَرِكَة"},
convert_numbers=True,
language="ar",
)
chain = prompt | llm | parser
result = chain.invoke({"question": "ما هي التقنية؟"})
# Output has diacritics restored and numbers converted to words
Streaming note:
ArabicTextOutputParserbuffers all chunks before processing because diacritics and number conversion require complete words. When usingchain.stream(), the processed result is yielded as a single chunk once the LLM finishes generating.
API Reference
Diacritics
| Function / Class | Description |
|---|---|
parse_diacritics_map(source) |
Parse mappings from dict, markdown string, or file path |
apply_diacritics(text, diacritics_map) |
Apply longest-first replacement |
DiacriticsProcessor(source) |
Stateful wrapper with .process(text) method |
Numbers
| Function / Class | Description |
|---|---|
convert_numbers_in_text(text, language, contexts) |
Convert digits to words in context |
NumbersProcessor(language, contexts) |
Stateful wrapper with .process(text) method |
Supported contexts: "percentage", "currency_ar", "currency_en", "phone", "plain"
LangChain Integration
| Class | Description |
|---|---|
ArabicTextOutputParser |
LangChain Runnable combining diacritics + numbers |
Parameters:
| Parameter | Default | Description |
|---|---|---|
backend |
"dictionary" |
"dictionary" or "catt" |
catt_model |
"encoder_only" |
"encoder_only" (faster) or "encoder_decoder" (more accurate) |
diacritics_map |
{} |
Plain -> diacritized mapping (overrides when using CATT) |
convert_numbers |
True |
Convert digit sequences to words |
language |
"ar" |
"ar" or "en" |
number_contexts |
None (all) |
Set of contexts to enable |
CATT Backend
| Class | Description |
|---|---|
CATTBackend(model) |
Direct access to CATT auto-diacritization |
Requires pip install langchain-arabic[catt].
Examples
See the examples/ directory for runnable scripts:
quickstart.py— Dictionary mode, CATT mode, hybrid mode, number conversionlangchain_chain.py— Full LangChain LCEL chain integration
Benchmarks
See benchmarks/ for DER/WER evaluation of different diacritization modes.
Development
git clone https://github.com/louaychoum/langchain-arabic.git
cd langchain-arabic
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest --cov=langchain_arabic
ruff check src/
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file langchain_arabic-0.2.0.tar.gz.
File metadata
- Download URL: langchain_arabic-0.2.0.tar.gz
- Upload date:
- Size: 18.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
987bc401de3d2305260b37d7e9a3c767c809c91b9c0bc7683b4a8c836818049d
|
|
| MD5 |
5fee26afafa8b19806f3391b44ab5105
|
|
| BLAKE2b-256 |
7e2d3076ea829e66f3c566cad517d6f4b5a22099722a790e6c1676d993ebb751
|
File details
Details for the file langchain_arabic-0.2.0-py3-none-any.whl.
File metadata
- Download URL: langchain_arabic-0.2.0-py3-none-any.whl
- Upload date:
- Size: 13.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c6d5500b3abfc39bbefe750a607bb9c8bbf574c78a9293c453a32ee75c685163
|
|
| MD5 |
83a74551a66c46c3266ee5ae4c084835
|
|
| BLAKE2b-256 |
a29f025f255d685ffb9ced9a490f68b2ffc860836638d18f8bab4a36fd3dc577
|