Skip to main content

A simple library for normalizing and cleaning Persian text

Project description

Davat (دوات)

PyPI version Downloads License

A Python library for normalizing and cleaning Persian text. Composable, single-purpose functions that can be used individually or combined via a pipeline.

Installation

pip install davat

Quick start

from davat import clean

text = """
متنی برای برسی تابع تمیز کردن متن
که #هشتگ_ها را خیلی عاااااللللییییی!!!! تبدیل به متن عادی می‌کند!
منشن‌ها @mh_salari و لینک‌ها www.mh-salari.ir را حذف می‌کند.
حروف غیر فارسی  a b c d و اموجی‌ها :( 🐈‍ را حذف می‌کند
علائم دستوری/نگارشی ?!٫ را حذف نمی‌کند
و ...
http://localhost:8888
"""

print(clean(text))

clean() applies the default Persian pipeline (PERSIAN_STEPS): remove links, mentions, hashtags, emojis, normalize Persian text, fix repeated punctuation, strip non-Persian characters, and collapse extra spaces.

Normalize Persian text

from davat import normalize_persian

>>> normalize_persian("بِسْمِ اللَّهِ الرَّحْمنِ الرَّحِيمِ")
'بسم الله الرحمن الرحیم'

>>> normalize_persian("این یك متن تست است که حروف عربي ، کشیـــــده \n'اعداد 12345' و... دارد     که می خواهیم آن را نرمالایز کنیم .")
"این یک متن تست است که حروف عربی، کشیده\n«اعداد ۱۲۳۴۵» و …  دارد  که می‌خواهیم آن را نرمالایز کنیم."

normalize_persian() handles: whitespace normalization, diacritic removal, keshide removal, Arabic-to-Persian character mapping, digit conversion, quotation normalization, ZWNJ fixes for common affixes (می/نمی, ها/های, تر/ترین, etc.), punctuation spacing, and repeated character collapse.

By default, exaggerated text like عاااللللییییی is collapsed to عالی (3+ repeated characters → single).

Dictionary-aware repeated character collapse

Set use_dictionary=True to enable dictionary-aware collapse using a bundled 453K Persian word list. This preserves legitimate doubled letters in words like الله, موسسه, تردد:

# Default (no dictionary): simple collapse, 3+ → single
>>> normalize_persian("اللله")
'اله'                     # loses the legitimate لل

# With dictionary: preserves legitimate doubles
>>> normalize_persian("اللله", use_dictionary=True)
'الله'                    # dictionary knows الله has لل

>>> normalize_persian("موسسسسسه", use_dictionary=True)
'موسسه'                   # dictionary knows موسسه has سس

>>> normalize_persian("تردددد", use_dictionary=True)
'تردد'                    # dictionary knows تردد has دد

Tradeoff: The dictionary lookup means some informal/slang words may not collapse to what you'd expect. For example, نهههه collapses to نهه (not نه) because نهه is a valid word in the dictionary. Similarly, ههههههه becomes هه because هه is a real word. This is the price of preserving legitimate doubles like الله, موسسه, تردد, محقق, etc. In practice, this is the better tradeoff since breaking real words (الله → اله) is worse than keeping an extra letter in slang text.

Individual functions

Every function takes a string and returns a string. Use them independently:

from davat import (
    convert_digits,
    remove_links,
    remove_mentions,
    remove_hashtags,
    remove_emojis,
    remove_markdown,
    normalize_persian,
    remove_punctuations,
    fix_multiple_punctuations,
    remove_ellipsis,
    strip_characters,
    remove_extra_spaces,
)

>>> remove_links("سلام https://example.com دنیا")
'سلام  دنیا'

>>> remove_mentions("سلام @user دنیا")
'سلام  دنیا'

>>> remove_hashtags("#سلام دنیا")
' سلام دنیا'

>>> remove_hashtags("#سلام دنیا", keep_text=False)
' دنیا'

>>> remove_emojis("سلام 😀 دنیا")
'سلام  دنیا'

>>> remove_markdown("**bold** and [link](http://x.com)")
'bold and link'

>>> convert_digits("123", to="fa")
'۱۲۳'

>>> convert_digits("۱۲۳", to="en")
'123'

>>> strip_characters("hello سلام world", keep="fa")
' سلام '

>>> strip_characters("hello سلام world", keep=["fa", "en"])
'hello سلام world'

Custom pipelines

Build your own pipeline with clean() and steps:

from functools import partial
from davat import clean, remove_links, remove_emojis, strip_characters, remove_extra_spaces

# Multilingual: keep Persian, English, and Arabic
steps = [
    remove_links,
    remove_emojis,
    partial(strip_characters, keep=["fa", "en", "ar"]),
    remove_extra_spaces,
]

>>> clean("hello سلام مرحبا https://x.com 😀 שלום", steps=steps)
'hello سلام مرحبا'

Preset pipelines

from davat import PERSIAN_STEPS, MINIMAL_STEPS

# PERSIAN_STEPS (default): full Persian cleaning pipeline
# MINIMAL_STEPS: just remove links, emojis, and extra spaces

>>> clean("سلام https://x.com 😀 hello", steps=MINIMAL_STEPS)
'سلام  hello'  # minimal doesn't strip non-Persian

Thanks to

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

davat-2.1.2.tar.gz (2.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

davat-2.1.2-py3-none-any.whl (2.4 MB view details)

Uploaded Python 3

File details

Details for the file davat-2.1.2.tar.gz.

File metadata

  • Download URL: davat-2.1.2.tar.gz
  • Upload date:
  • Size: 2.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for davat-2.1.2.tar.gz
Algorithm Hash digest
SHA256 e83b98ebc12930c0f9cd89ca8de7292fd7537979d57d336773fc0a0d5afd27eb
MD5 7c412c0424b72a9311f22533913ef084
BLAKE2b-256 c68fc5ace1675379999b7d76166358b6c44f3352a25016e35a23b76ffef591ed

See more details on using hashes here.

File details

Details for the file davat-2.1.2-py3-none-any.whl.

File metadata

  • Download URL: davat-2.1.2-py3-none-any.whl
  • Upload date:
  • Size: 2.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for davat-2.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 bf7b39b15259026227f0703f0c1985edc68dad469e2ed164640792e838aa2e01
MD5 c0faa7ed168d06b7a9a73131ebf6697f
BLAKE2b-256 63e4387f807eb04273851d804c6e41058b54e66b273e03a6ee4f71377381c8cb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page