Skip to main content

Arabic text post-processing for LLM outputs — diacritics restoration, number-to-word conversion, and LangChain integration

Project description

langchain-arabic

PyPI version Python versions CI License: MIT

Arabic text post-processing for LLM outputs. Diacritics (tashkeel) restoration, number-to-word conversion, and native LangChain integration. Supports both dictionary-based and neural auto-diacritization via CATT.

Problem

LLMs produce Arabic text without diacritics (tashkeel) 60-70% of the time. This causes mispronunciation in text-to-speech (TTS) pipelines. Numbers in digit form (e.g. 2030, 95%) are also read incorrectly by TTS engines.

langchain-arabic fixes both issues as a post-processing step on LLM output.

Installation

# Dictionary mode only (lightweight, no PyTorch)
pip install langchain-arabic

# With neural auto-diacritization (installs catt-tashkeel + PyTorch)
pip install langchain-arabic[catt]

Quick Start

Dictionary-Based Diacritics

Provide a mapping of plain Arabic words to their diacritized forms. The library applies longest-first replacement to avoid partial matches.

from langchain_arabic import apply_diacritics, parse_diacritics_map

# From a dictionary
diacritics_map = {
    "تقنية": "تِقْنِيَة",
    "شركة": "شَرِكَة",
    "علم الحاسوب": "عِلْمُ الحَاسُوبِ",
}

text = "شركة تقنية في علم الحاسوب"
result = apply_diacritics(text, diacritics_map)
# -> "شَرِكَة تِقْنِيَة في عِلْمُ الحَاسُوبِ"

You can also parse mappings from a markdown file (useful for persona/prompt files):

# Parse from markdown with "- WORD -> DIACRITIZED" format
diacritics_map = parse_diacritics_map("path/to/persona.md")

# Or from a markdown string
markdown = """
- تقنية → تِقْنِيَة
- شركة → شَرِكَة
"""
diacritics_map = parse_diacritics_map(markdown)

Auto-Diacritization with CATT

For neural auto-diacritization (no manual dictionary needed), use the CATT backend. CATT is a state-of-the-art character-level transformer that outperforms GPT-4-turbo on Arabic diacritization benchmarks.

pip install langchain-arabic[catt]
from langchain_arabic import ArabicTextOutputParser

parser = ArabicTextOutputParser(
    backend="catt",
    catt_model="encoder_only",   # faster; or "encoder_decoder" for higher accuracy
    convert_numbers=True,
)

result = parser.parse("شركة تقنية في علم الحاسوب")
# CATT auto-diacritizes the entire text

Hybrid Mode: CATT + Dictionary Overrides

The most powerful setup: let CATT handle general text, then override domain-specific terms (proper nouns, brand names) with your dictionary:

parser = ArabicTextOutputParser(
    backend="catt",
    catt_model="encoder_decoder",  # higher accuracy
    diacritics_map={
        "علم الحاسوب": "عِلْمُ الحَاسُوبِ",  # domain term override
    },
    convert_numbers=True,
)

chain = prompt | llm | parser

CATT runs first, then dictionary overrides are applied on top.

Number-to-Word Conversion

Automatically detects and converts numbers based on context (percentages, currency, phone numbers, plain numbers).

from langchain_arabic import convert_numbers_in_text

# Arabic
convert_numbers_in_text("نسبة 95%", language="ar")
# -> "نسبة خمسة و تسعون بالمائة"

convert_numbers_in_text("المبلغ 500 ريال", language="ar")
# -> "المبلغ خمسمائة ريال"

convert_numbers_in_text("اتصل على 920000247", language="ar")
# -> "اتصل على تسعة اثنان صفر صفر صفر صفر اثنان أربعة سبعة"

# English
convert_numbers_in_text("about 95%", language="en")
# -> "about ninety-five percent"

With LangChain

Use ArabicTextOutputParser as a drop-in replacement for StrOutputParser in any LCEL chain:

from langchain_arabic import ArabicTextOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")
prompt = ChatPromptTemplate.from_template("أجب بالعربية: {question}")

parser = ArabicTextOutputParser(
    diacritics_map={"تقنية": "تِقْنِيَة", "شركة": "شَرِكَة"},
    convert_numbers=True,
    language="ar",
)

chain = prompt | llm | parser
result = chain.invoke({"question": "ما هي التقنية؟"})
# Output has diacritics restored and numbers converted to words

Streaming note: ArabicTextOutputParser buffers all chunks before processing because diacritics and number conversion require complete words. When using chain.stream(), the processed result is yielded as a single chunk once the LLM finishes generating.

API Reference

Diacritics

Function / Class Description
parse_diacritics_map(source) Parse mappings from dict, markdown string, or file path
apply_diacritics(text, diacritics_map) Apply longest-first replacement
DiacriticsProcessor(source) Stateful wrapper with .process(text) method

Numbers

Function / Class Description
convert_numbers_in_text(text, language, contexts) Convert digits to words in context
NumbersProcessor(language, contexts) Stateful wrapper with .process(text) method

Supported contexts: "percentage", "currency_ar", "currency_en", "phone", "plain"

LangChain Integration

Class Description
ArabicTextOutputParser LangChain Runnable combining diacritics + numbers

Parameters:

Parameter Default Description
backend "dictionary" "dictionary" or "catt"
catt_model "encoder_only" "encoder_only" (faster) or "encoder_decoder" (more accurate)
diacritics_map {} Plain -> diacritized mapping (overrides when using CATT)
convert_numbers True Convert digit sequences to words
language "ar" "ar" or "en"
number_contexts None (all) Set of contexts to enable

CATT Backend

Class Description
CATTBackend(model) Direct access to CATT auto-diacritization

Requires pip install langchain-arabic[catt].

Examples

See the examples/ directory for runnable scripts:

Benchmarks

See benchmarks/ for DER/WER evaluation of different diacritization modes.

Development

git clone https://github.com/louaychoum/langchain-arabic.git
cd langchain-arabic
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest --cov=langchain_arabic
ruff check src/

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_arabic-0.2.0.tar.gz (18.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_arabic-0.2.0-py3-none-any.whl (13.2 kB view details)

Uploaded Python 3

File details

Details for the file langchain_arabic-0.2.0.tar.gz.

File metadata

  • Download URL: langchain_arabic-0.2.0.tar.gz
  • Upload date:
  • Size: 18.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for langchain_arabic-0.2.0.tar.gz
Algorithm Hash digest
SHA256 987bc401de3d2305260b37d7e9a3c767c809c91b9c0bc7683b4a8c836818049d
MD5 5fee26afafa8b19806f3391b44ab5105
BLAKE2b-256 7e2d3076ea829e66f3c566cad517d6f4b5a22099722a790e6c1676d993ebb751

See more details on using hashes here.

File details

Details for the file langchain_arabic-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for langchain_arabic-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c6d5500b3abfc39bbefe750a607bb9c8bbf574c78a9293c453a32ee75c685163
MD5 83a74551a66c46c3266ee5ae4c084835
BLAKE2b-256 a29f025f255d685ffb9ced9a490f68b2ffc860836638d18f8bab4a36fd3dc577

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page