A simple library for normalizing and cleaning Persian text

These details have not been verified by PyPI

Project links

Project description

Davat (دوات)

A Python library for normalizing and cleaning Persian text. Composable, single-purpose functions that can be used individually or combined via a pipeline.

Installation

pip install davat

Quick start

from davat import clean

text = """
متنی برای برسی تابع تمیز کردن متن
که #هشتگ_ها را خیلی عاااااللللییییی!!!! تبدیل به متن عادی می‌کند!
منشن‌ها @mh_salari و لینک‌ها www.mh-salari.ir را حذف می‌کند.
حروف غیر فارسی  a b c d و اموجی‌ها :( 🐈‍ را حذف می‌کند
علائم دستوری/نگارشی ?!٫ را حذف نمی‌کند
و ...
http://localhost:8888
"""

print(clean(text))

clean() applies the default Persian pipeline (PERSIAN_STEPS): remove links, mentions, hashtags, emojis, normalize Persian text, fix repeated punctuation, strip non-Persian characters, and collapse extra spaces.

Normalize Persian text

from davat import normalize_persian

>>> normalize_persian("بِسْمِ اللَّهِ الرَّحْمنِ الرَّحِيمِ")
'بسم الله الرحمن الرحیم'

>>> normalize_persian("این یك متن تست است که حروف عربي ، کشیـــــده \n'اعداد 12345' و... دارد     که می خواهیم آن را نرمالایز کنیم .")
"این یک متن تست است که حروف عربی، کشیده\n«اعداد ۱۲۳۴۵» و …  دارد  که می‌خواهیم آن را نرمالایز کنیم."

normalize_persian() handles: whitespace normalization, diacritic removal, keshide removal, Arabic-to-Persian character mapping, digit conversion, quotation normalization, ZWNJ fixes for common affixes (می/نمی, ها/های, تر/ترین, etc.), punctuation spacing, and repeated character collapse.

By default, exaggerated text like عاااللللییییی is collapsed to عالی (3+ repeated characters → single).

Dictionary-aware repeated character collapse

Set use_dictionary=True to enable dictionary-aware collapse using a bundled 453K Persian word list. This preserves legitimate doubled letters in words like الله, موسسه, تردد:

# Default (no dictionary): simple collapse, 3+ → single
>>> normalize_persian("اللله")
'اله'                     # loses the legitimate لل

# With dictionary: preserves legitimate doubles
>>> normalize_persian("اللله", use_dictionary=True)
'الله'                    # dictionary knows الله has لل

>>> normalize_persian("موسسسسسه", use_dictionary=True)
'موسسه'                   # dictionary knows موسسه has سس

>>> normalize_persian("تردددد", use_dictionary=True)
'تردد'                    # dictionary knows تردد has دد

Tradeoff: The dictionary lookup means some informal/slang words may not collapse to what you'd expect. For example, نهههه collapses to نهه (not نه) because نهه is a valid word in the dictionary. Similarly, ههههههه becomes هه because هه is a real word. This is the price of preserving legitimate doubles like الله, موسسه, تردد, محقق, etc. In practice, this is the better tradeoff since breaking real words (الله → اله) is worse than keeping an extra letter in slang text.

Individual functions

Every function takes a string and returns a string. Use them independently:

from davat import (
    convert_digits,
    remove_links,
    remove_mentions,
    remove_hashtags,
    remove_emojis,
    remove_markdown,
    normalize_persian,
    remove_punctuations,
    fix_multiple_punctuations,
    remove_ellipsis,
    strip_characters,
    remove_extra_spaces,
)

>>> remove_links("سلام https://example.com دنیا")
'سلام  دنیا'

>>> remove_mentions("سلام @user دنیا")
'سلام  دنیا'

>>> remove_hashtags("#سلام دنیا")
' سلام دنیا'

>>> remove_hashtags("#سلام دنیا", keep_text=False)
' دنیا'

>>> remove_emojis("سلام 😀 دنیا")
'سلام  دنیا'

>>> remove_markdown("**bold** and [link](http://x.com)")
'bold and link'

>>> convert_digits("123", to="fa")
'۱۲۳'

>>> convert_digits("۱۲۳", to="en")
'123'

>>> strip_characters("hello سلام world", keep="fa")
' سلام '

>>> strip_characters("hello سلام world", keep=["fa", "en"])
'hello سلام world'

Custom pipelines

Build your own pipeline with clean() and steps:

from functools import partial
from davat import clean, remove_links, remove_emojis, strip_characters, remove_extra_spaces

# Multilingual: keep Persian, English, and Arabic
steps = [
    remove_links,
    remove_emojis,
    partial(strip_characters, keep=["fa", "en", "ar"]),
    remove_extra_spaces,
]

>>> clean("hello سلام مرحبا https://x.com 😀 שלום", steps=steps)
'hello سلام مرحبا'

Preset pipelines

from davat import PERSIAN_STEPS, MINIMAL_STEPS

# PERSIAN_STEPS (default): full Persian cleaning pipeline
# MINIMAL_STEPS: just remove links, emojis, and extra spaces

>>> clean("سلام https://x.com 😀 hello", steps=MINIMAL_STEPS)
'سلام  hello'  # minimal doesn't strip non-Persian

Thanks to

Persian-Words-Database for the Persian word list
Persian poems corpus
Hazm
Parsivar

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.1.2

Mar 26, 2026

2.1.1

Mar 24, 2026

2.1.0

Mar 15, 2026

2.0.1

Mar 15, 2026

2.0.0

Mar 15, 2026

1.0.0

Feb 23, 2026

0.0.8

May 7, 2021

0.0.7

May 7, 2021

0.0.6

May 7, 2021

0.0.5

May 6, 2021

0.0.4

May 6, 2021

0.0.3

May 5, 2021

0.0.1

May 3, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

davat-2.1.2.tar.gz (2.4 MB view details)

Uploaded Mar 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

davat-2.1.2-py3-none-any.whl (2.4 MB view details)

Uploaded Mar 26, 2026 Python 3

File details

Details for the file davat-2.1.2.tar.gz.

File metadata

Download URL: davat-2.1.2.tar.gz
Upload date: Mar 26, 2026
Size: 2.4 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for davat-2.1.2.tar.gz
Algorithm	Hash digest
SHA256	`e83b98ebc12930c0f9cd89ca8de7292fd7537979d57d336773fc0a0d5afd27eb`
MD5	`7c412c0424b72a9311f22533913ef084`
BLAKE2b-256	`c68fc5ace1675379999b7d76166358b6c44f3352a25016e35a23b76ffef591ed`

See more details on using hashes here.

File details

Details for the file davat-2.1.2-py3-none-any.whl.

File metadata

Download URL: davat-2.1.2-py3-none-any.whl
Upload date: Mar 26, 2026
Size: 2.4 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for davat-2.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bf7b39b15259026227f0703f0c1985edc68dad469e2ed164640792e838aa2e01`
MD5	`c0faa7ed168d06b7a9a73131ebf6697f`
BLAKE2b-256	`63e4387f807eb04273851d804c6e41058b54e66b273e03a6ee4f71377381c8cb`

See more details on using hashes here.

davat 2.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Davat (دوات)

Installation

Quick start

Normalize Persian text

Dictionary-aware repeated character collapse

Individual functions

Custom pipelines

Preset pipelines

Thanks to

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes