Arabic text post-processing for LLM outputs — diacritics restoration, number-to-word conversion, dialect support, and LangChain integration

These details have not been verified by PyPI

Project links

Project description

langchain-arabic

Arabic text post-processing for LLM outputs. Diacritics (tashkeel) restoration, number-to-word conversion, and native LangChain integration. Supports both dictionary-based and neural auto-diacritization via CATT.

Problem

LLMs produce Arabic text without diacritics (tashkeel) 60-70% of the time. This causes mispronunciation in text-to-speech (TTS) pipelines. Numbers in digit form (e.g. 2030, 95%) are also read incorrectly by TTS engines.

langchain-arabic fixes both issues as a post-processing step on LLM output.

Installation

# Dictionary mode only (lightweight, no PyTorch)
pip install langchain-arabic

# With neural auto-diacritization (installs catt-tashkeel + PyTorch)
pip install langchain-arabic[catt]

Quick Start

Dictionary-Based Diacritics

Provide a mapping of plain Arabic words to their diacritized forms. The library applies longest-first replacement to avoid partial matches.

from langchain_arabic import apply_diacritics, parse_diacritics_map

# From a dictionary
diacritics_map = {
    "تقنية": "تِقْنِيَة",
    "شركة": "شَرِكَة",
    "علم الحاسوب": "عِلْمُ الحَاسُوبِ",
}

text = "شركة تقنية في علم الحاسوب"
result = apply_diacritics(text, diacritics_map)
# -> "شَرِكَة تِقْنِيَة في عِلْمُ الحَاسُوبِ"

You can also parse mappings from a markdown file (useful for persona/prompt files):

# Parse from markdown with "- WORD -> DIACRITIZED" format
diacritics_map = parse_diacritics_map("path/to/persona.md")

# Or from a markdown string
markdown = """
- تقنية → تِقْنِيَة
- شركة → شَرِكَة
"""
diacritics_map = parse_diacritics_map(markdown)

Auto-Diacritization with CATT

For neural auto-diacritization (no manual dictionary needed), use the CATT backend. CATT is a state-of-the-art character-level transformer that outperforms GPT-4-turbo on Arabic diacritization benchmarks.

pip install langchain-arabic[catt]

from langchain_arabic import ArabicTextOutputParser

parser = ArabicTextOutputParser(
    backend="catt",
    catt_model="encoder_only",   # faster; or "encoder_decoder" for higher accuracy
    convert_numbers=True,
)

result = parser.parse("شركة تقنية في علم الحاسوب")
# CATT auto-diacritizes the entire text

Hybrid Mode: CATT + Dictionary Overrides

The most powerful setup: let CATT handle general text, then override domain-specific terms (proper nouns, brand names) with your dictionary:

parser = ArabicTextOutputParser(
    backend="catt",
    catt_model="encoder_decoder",  # higher accuracy
    diacritics_map={
        "علم الحاسوب": "عِلْمُ الحَاسُوبِ",  # domain term override
    },
    convert_numbers=True,
)

chain = prompt | llm | parser

CATT runs first, then dictionary overrides are applied on top.

Number-to-Word Conversion

Automatically detects and converts numbers based on context (percentages, currency, phone numbers, plain numbers).

from langchain_arabic import convert_numbers_in_text

# Arabic
convert_numbers_in_text("نسبة 95%", language="ar")
# -> "نسبة خمسة و تسعون بالمائة"

convert_numbers_in_text("المبلغ 500 ريال", language="ar")
# -> "المبلغ خمسمائة ريال"

convert_numbers_in_text("اتصل على 920000247", language="ar")
# -> "اتصل على تسعة اثنان صفر صفر صفر صفر اثنان أربعة سبعة"

# English
convert_numbers_in_text("about 95%", language="en")
# -> "about ninety-five percent"

Dialect Support

Use dialect="gulf" for Gulf Arabic (Khaleeji) number phrasing and dialect-keyed diacritics maps:

from langchain_arabic import ArabicTextOutputParser

# Gulf dialect — numbers use Gulf phrasing
parser = ArabicTextOutputParser(dialect="gulf", convert_numbers=True)
parser.parse("النسبة 100%")
# -> "النسبة مية بالمية" (not "مائة بالمائة")

# Dialect-keyed diacritics maps — different vocalization per dialect
parser = ArabicTextOutputParser(
    dialect="gulf",
    diacritics_map={
        "msa": {"تقنية": "تِقْنِيَة"},
        "gulf": {"تقنية": "تَقْنِيَّة"},
    },
    convert_numbers=False,
)

Supported dialects: "msa" (Modern Standard Arabic, default), "gulf".

With LangChain

Use ArabicTextOutputParser as a drop-in replacement for StrOutputParser in any LCEL chain:

from langchain_arabic import ArabicTextOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")
prompt = ChatPromptTemplate.from_template("أجب بالعربية: {question}")

parser = ArabicTextOutputParser(
    diacritics_map={"تقنية": "تِقْنِيَة", "شركة": "شَرِكَة"},
    convert_numbers=True,
    language="ar",
)

chain = prompt | llm | parser
result = chain.invoke({"question": "ما هي التقنية؟"})
# Output has diacritics restored and numbers converted to words

Streaming note: ArabicTextOutputParser buffers all chunks before processing because diacritics and number conversion require complete words. When using chain.stream(), the processed result is yielded as a single chunk once the LLM finishes generating.

API Reference

Diacritics

Function / Class	Description
`parse_diacritics_map(source)`	Parse mappings from dict, markdown string, or file path
`apply_diacritics(text, diacritics_map)`	Apply longest-first replacement
`DiacriticsProcessor(source)`	Stateful wrapper with `.process(text)` method

Numbers

Function / Class	Description
`convert_numbers_in_text(text, language, contexts)`	Convert digits to words in context
`NumbersProcessor(language, contexts)`	Stateful wrapper with `.process(text)` method

Supported contexts: "percentage", "currency_ar", "currency_en", "phone", "plain"

LangChain Integration

Class	Description
`ArabicTextOutputParser`	LangChain `Runnable` combining diacritics + numbers

Parameters:

Parameter	Default	Description
`backend`	`"dictionary"`	`"dictionary"` or `"catt"`
`catt_model`	`"encoder_only"`	`"encoder_only"` (faster) or `"encoder_decoder"` (more accurate)
`dialect`	`"msa"`	`"msa"` or `"gulf"` — affects number phrasing and dialect-keyed map resolution
`diacritics_map`	`{}`	Plain -> diacritized mapping, or dialect-keyed `{"msa": {...}, "gulf": {...}}`
`convert_numbers`	`True`	Convert digit sequences to words
`language`	`"ar"`	`"ar"` or `"en"`
`number_contexts`	`None` (all)	Set of contexts to enable

Dialects

Function / Constant	Description
`SUPPORTED_DIALECTS`	`frozenset` of supported dialect codes
`apply_dialect_number_overrides(text, dialect)`	Apply dialect-specific number word substitutions
`resolve_dialect_diacritics_map(map, dialect)`	Resolve flat or dialect-keyed map for a given dialect

CATT Backend

Class	Description
`CATTBackend(model)`	Direct access to CATT auto-diacritization

Requires pip install langchain-arabic[catt].

Examples

See the examples/ directory for runnable scripts:

quickstart.py — Dictionary mode, CATT mode, hybrid mode, number conversion
langchain_chain.py — Full LangChain LCEL chain integration

Benchmarks

See benchmarks/ for DER/WER evaluation of different diacritization modes.

Development

git clone https://github.com/louaychoum/langchain-arabic.git
cd langchain-arabic
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest --cov=langchain_arabic
ruff check src/

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.0

Mar 11, 2026

0.2.0

Mar 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_arabic-0.3.0.tar.gz (21.2 kB view details)

Uploaded Mar 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

langchain_arabic-0.3.0-py3-none-any.whl (15.4 kB view details)

Uploaded Mar 11, 2026 Python 3

File details

Details for the file langchain_arabic-0.3.0.tar.gz.

File metadata

Download URL: langchain_arabic-0.3.0.tar.gz
Upload date: Mar 11, 2026
Size: 21.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for langchain_arabic-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`ea86a1562cba3d8fb5369382149303e6dca67702155cd41c088e2006eec83957`
MD5	`8c3ea15fde6f629fcd1c4977a68b11bb`
BLAKE2b-256	`9e61e337ff79f9460f5caf8aac0ea5730787075b8f2a86e5480ff4986e4de442`

See more details on using hashes here.

File details

Details for the file langchain_arabic-0.3.0-py3-none-any.whl.

File metadata

Download URL: langchain_arabic-0.3.0-py3-none-any.whl
Upload date: Mar 11, 2026
Size: 15.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for langchain_arabic-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2265a7f37cf04cdc4bfe9da4e07f230187c77ac735a282022b975740ba9c9ffb`
MD5	`efea85223a692cb825bf7a69671da82f`
BLAKE2b-256	`ec1707388cc6d71ac9a8c24e123e37e6029a1d1ede329bfcbbb5b40ad0b61167`

See more details on using hashes here.

langchain-arabic 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

langchain-arabic

Problem

Installation

Quick Start

Dictionary-Based Diacritics

Auto-Diacritization with CATT

Hybrid Mode: CATT + Dictionary Overrides

Number-to-Word Conversion

Dialect Support

With LangChain

API Reference

Diacritics

Numbers

LangChain Integration

Dialects

CATT Backend

Examples

Benchmarks

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes