mahad

An Arabic text processing library intended for use in NLP applications.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Examples are tested

An Arabic text processing library intended for use in NLP applications.

Maha is a text processing library specially developed to deal with Arabic text. The beta version can be used to clean and parse text, files, and folders with or without streaming capability.

If you need help or want to discuss topics related to Maha, feel free to reach out to our Discord server. If you would like to submit a bug report or feature request, please open an issue.

Installation

Simply run the following to install Maha:

pip install mahad # pronounced maha d

For source installation, check the documentation.

Quick start

As of now, Maha supports three main modules: cleaners, parsers and processors.

Cleaners

Cleaners, from its name, contain a set of functions for cleaning texts. They can be used to keep, remove, or replace specific characters as well as normalize characters and check if the text contains specific characters.

Examples

>>> from maha.cleaners.functions import keep, remove, contains, replace
>>> sample_text = """
... 1. بِسْمِ اللَّـهِ الرَّحْمَـٰـــنِ الرَّحِيمِ
... 2. In the name of God, the most gracious, the most merciful
... """
>>> keep(sample_text, arabic=True)
'بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ'
>>> keep(sample_text, arabic_letters=True)
'بسم الله الرحمن الرحيم'
>>> keep(sample_text, arabic_letters=True, harakat=True)
'بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ'
>>> remove(sample_text, numbers=True, punctuations=True)
'بِسْمِ اللَّـهِ الرَّحْمَـٰـــنِ الرَّحِيمِ\n In the name of God the most gracious the most merciful'
>>> contains(sample_text, numbers=True)
True
>>> contains(sample_text, hashtags=True, arabic=True, emails=True)
{'arabic': True, 'emails': False, 'hashtags': False}
>>> replace(keep(sample_text, english_letters=True), "God", "Allah")
'In the name of Allah the most gracious the most merciful'

>>> from maha.cleaners.functions import keep, normalize
>>> sample_text = "وَمَا أَرْسَلْنَاكَ إِلَّا رَحْمَةً لِّلْعَالَمِينَ"
>>> keep(sample_text, arabic_letters=True)
'وما أرسلناك إلا رحمة للعالمين'
>>> normalize(keep(sample_text, arabic_letters=True), alef=True, teh_marbuta=True)
'وما ارسلناك الا رحمه للعالمين'
>>> sample_text = 'ﷺ'
>>> normalize(sample_text, ligatures=True)
'صلى الله عليه وسلم'

>>> from maha.cleaners.functions import reduce_repeated_substring, remove_arabic_letter_dots, connect_single_letter_word
>>> sample_text = "ههههههههههه أضحكني"
>>> reduce_repeated_substring(sample_text)
'هه أضحكني'
>>> remove_arabic_letter_dots(sample_text)
'ههههههههههه أصحكٮى'
>>> connect_single_letter_word('محمد و احمد', waw=True)
'محمد واحمد'

Parsers

Parsers include a set of rules for extracting values from text. All rules can be accessed and utilized by two main functions, parse and parse_dimension.

Examples

Parse character and simple expressions.

>>> from maha.parsers.functions import parse
>>> sample_text = '@Maha مها هي مكتبة لمساعدتك في التعامل مع النص العربي @مها test@example.com'
>>> parse(sample_text, emails=True)
[Dimension(body=test@example.com, value=test@example.com, start=59, end=75, dimension_type=DimensionType.EMAILS)]
>>> parse(sample_text, english_mentions=True)
[Dimension(body=@Maha, value=@Maha, start=0, end=5, dimension_type=DimensionType.ENGLISH_MENTIONS)]
>>> parse(sample_text, mentions=True, emails=True)
[Dimension(body=test@example.com, value=test@example.com, start=59, end=75, dimension_type=DimensionType.EMAILS), Dimension(body=@Maha, value=@Maha, start=0, end=5, dimension_type=DimensionType.MENTIONS), Dimension(body=@مها, value=@مها, start=54, end=58, dimension_type=DimensionType.MENTIONS)]

Parse time.

>>> from maha.parsers.functions import parse_dimension
>>> from datetime import datetime
>>> now = datetime(2021, 9, 1)
>>> sample_text = 'الثالث من شباط بعد ثلاث سنين يوم السبت الساعة خمسة واثنين واربعين دقيقة العصر'
>>> output_time = parse_dimension(sample_text, time=True)[0]
>>> output_time
Dimension(body=الثالث من شباط بعد ثلاث سنين يوم السبت الساعة خمسة واثنين واربعين دقيقة العصر, value=TimeValue(years=3, am_pm='PM', month=2, day=3, weekday=SA, hour=17, minute=42, second=0, microsecond=0), start=0, end=77, dimension_type=DimensionType.TIME)
>>> output_time.value.is_hours_set()
True
>>> output_time.value + now
datetime.datetime(2024, 2, 3, 17, 42)

>>> from maha.parsers.functions import parse_dimension
>>> from datetime import datetime
>>> now = datetime(2021, 9, 1)
>>> now + parse_dimension('غدا الساعة الحادية عشر', time=True)[0].value
datetime.datetime(2021, 9, 2, 11, 0)
>>> now + parse_dimension('الخميس الأسبوع الجاي عالوحدة ونص المسا', time=True)[0].value
datetime.datetime(2021, 9, 9, 13, 30)
>>> now + parse_dimension('عام الفين وواحد', time=True)[0].value
datetime.datetime(2001, 9, 1, 0, 0)

Parse duration.

>>> from maha.parsers.functions import parse_dimension
>>> output = parse_dimension('شهرين واربعين يوم', duration=True)[0].value
>>> output
DurationValue(values=[ValueUnit(value=2, unit=<DurationUnit.MONTHS: 6>), ValueUnit(value=40, unit=<DurationUnit.DAYS: 4>)], normalized_unit=<DurationUnit.SECONDS: 1>)
>>> print('2 months and 40 days in seconds:', output.normalized_value.value)
2 months and 40 days in seconds: 8640000
>>> parse_dimension('الف وخمسمية دقيقة وساعة', duration=True)[0].value
DurationValue(values=[ValueUnit(value=1, unit=<DurationUnit.HOURS: 3>), ValueUnit(value=1500, unit=<DurationUnit.MINUTES: 2>)], normalized_unit=<DurationUnit.SECONDS: 1>)
>>> parse_dimension('30 مليون ثانية', duration=True)[0].value
DurationValue(values=[ValueUnit(value=30000000, unit=<DurationUnit.SECONDS: 1>)], normalized_unit=<DurationUnit.SECONDS: 1>)

Parse numeral.

>>> from maha.parsers.functions import parse_dimension
>>> parse_dimension('عشرة', numeral=True)
[Dimension(body=عشرة, value=10, start=0, end=4, dimension_type=DimensionType.NUMERAL)]
>>> parse_dimension('عشرين الف وخمسمية وثلاثة واربعين', numeral=True)[0].value
20543
>>> parse_dimension('حدعشر', numeral=True)[0].value
11
>>> parse_dimension('200 وعشرين', numeral=True)[0].value
220
>>> parse_dimension('عشرين فاصلة اربعة', numeral=True)[0].value
20.4
>>> parse_dimension('10.5 الف', numeral=True)[0].value
10500.0
>>> parse_dimension('مليون وستمية وعشرة', numeral=True)[0].value
1000610
>>> parse_dimension('اطنعش', numeral=True)[0].value
12
>>> parse_dimension('عشرة وعشرين', numeral=True)[0].value
30

Parse ordinal.

>>> from maha.parsers.functions import parse_dimension
>>> parse_dimension('الأول', ordinal=True)
[Dimension(body=الأول, value=1, start=0, end=5, dimension_type=DimensionType.ORDINAL)]
>>> parse_dimension('العاشر', ordinal=True)[0].value
10
>>> parse_dimension('التاسع والخمسين', ordinal=True)[0].value
59
>>> parse_dimension('المئة والثالث والثلاثون', ordinal=True)[0].value
133
>>> parse_dimension('المليون', ordinal=True)[0].value
1000000

Processors

Processors are wrappers for cleaners to clean text files and folders. There are two types of processors, the simple TextProcessor and FileProcessor processors, and the StreamTextProcessor and StreamFileProcessor processors.

Examples

We can use the sample data that comes with Maha.

>>> from pathlib import Path
>>> import maha

>>> resource_path = Path(maha.__file__).parents[1] / "sample_data/surah_al-ala.txt"
>>> data = resource_path.read_text()
>>> print(data)
﷽
   سَبِّحِ اسْمَ رَبِّكَ الْأَعْلَى ﴿1﴾
الَّذِي خَلَقَ فَسَوَّىٰ ﴿2﴾
وَالَّذِي قَدَّرَ فَهَدَىٰ ﴿3﴾
وَالَّذِي أَخْرَجَ الْمَرْعَىٰ ﴿4﴾
فَجَعَلَهُ غُثَاءً أَحْوَىٰ ﴿5﴾
سَنُقْرِئُكَ فَلَا تَنْسَىٰ ﴿6﴾
إِلَّا مَا شَاءَ اللَّهُ ۚ إِنَّهُ يَعْلَمُ الْجَهْرَ وَمَا يَخْفَىٰ ﴿7﴾
وَنُيَسِّرُكَ لِلْيُسْرَىٰ ﴿8﴾
فَذَكِّرْ إِنْ نَفَعَتِ الذِّكْرَىٰ ﴿9﴾
سَيَذَّكَّرُ مَنْ يَخْشَىٰ ﴿10﴾
وَيَتَجَنَّبُهَا الْأَشْقَى ﴿11﴾
الَّذِي يَصْلَى النَّارَ الْكُبْرَىٰ ﴿12﴾
ثُمَّ لَا يَمُوتُ فِيهَا وَلَا يَحْيَىٰ ﴿13﴾
قَدْ أَفْلَحَ مَنْ تَزَكَّىٰ ﴿14﴾
وَذَكَرَ اسْمَ رَبِّهِ فَصَلَّىٰ ﴿15﴾
بَلْ تُؤْثِرُونَ الْحَيَاةَ الدُّنْيَا ﴿16﴾
وَالْآخِرَةُ خَيْرٌ وَأَبْقَىٰ ﴿17﴾
إِنَّ هَٰذَا لَفِي الصُّحُفِ الْأُولَىٰ ﴿18﴾
صُحُفِ إِبْرَاهِيمَ وَمُوسَىٰ ﴿19﴾
<BLANKLINE>
<BLANKLINE>
<BLANKLINE>

>>> from maha.processors import FileProcessor
>>> proc = FileProcessor(resource_path)
>>> cleaned = proc.normalize(all=True).keep(arabic_letters=True).drop_empty_lines()
>>> print(cleaned.text)
بسم الله الرحمن الرحيم
سبح اسم ربك الاعلي
الذي خلق فسوي
والذي قدر فهدي
والذي اخرج المرعي
فجعله غثاء احوي
سنقريك فلا تنسي
الا ما شاء الله انه يعلم الجهر وما يخفي
ونيسرك لليسري
فذكر ان نفعت الذكري
سيذكر من يخشي
ويتجنبها الاشقي
الذي يصلي النار الكبري
ثم لا يموت فيها ولا يحيي
قد افلح من تزكي
وذكر اسم ربه فصلي
بل توثرون الحياه الدنيا
والاخره خير وابقي
ان هذا لفي الصحف الاولي
صحف ابراهيم وموسي

>>> unique_char = cleaned.get(unique_characters=True)
>>> unique_char.sort()
>>> unique_char
[' ', 'ء', 'ا', 'ب', 'ت', 'ث', 'ج', 'ح', 'خ', 'د', 'ذ', 'ر', 'ز', 'س', 'ش', 'ص', 'ع', 'غ', 'ف', 'ق', 'ك', 'ل', 'م', 'ن', 'ه', 'و', 'ي']

Additional step is required for stream processors. You need to call ~.process_and_save function after calling at least one clean function

...
from maha.processors import StreamFileProcessor

proc = StreamFileProcessor(resource_path)
cleaned = proc.normalize(all=True).keep(arabic_letters=True).drop_empty_lines()
# ----------------
cleaned.process_and_save('output_file.txt')
# ----------------
...

Documentation

Documentation are hosted at ReadTheDocs.

Contributing

Maha welcomes and encourages everyone to contribute. Contributions are always appreciated. Feel free to take a look at our contribution guidelines in the documentation.

License

Maha is BSD-licensed.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.3.0

Apr 4, 2022

0.2.0

Nov 16, 2021

0.1.2

Sep 18, 2021

This version

0.1.1

Sep 17, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mahad-0.1.1.tar.gz (214.8 kB view hashes)

Uploaded Sep 17, 2021 Source

Built Distribution

mahad-0.1.1-py3-none-any.whl (283.7 kB view hashes)

Uploaded Sep 17, 2021 Python 3

Hashes for mahad-0.1.1.tar.gz

Hashes for mahad-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`1b42a3bc28fdbad524a0ca7a29b3f1e4acd9adfed5edd4125bb77eb11d0b13fa`
MD5	`c6b22b0a35649442f3842ba4b1ed2765`
BLAKE2b-256	`37e369228a650fe6ebef270789cd9110c94aa3c5c9dc45ae25fa3f1cdf75907b`

Hashes for mahad-0.1.1-py3-none-any.whl

Hashes for mahad-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ae63d0f2510d63bcc6a67e375f86bc41d8f3bfc1f836e0af78bf7bdc678ffd8f`
MD5	`3ca67584e654e2a87cf372812e56e171`
BLAKE2b-256	`6976b9ed771c28812cf4211098ea6283d7be68067331f9d47ca4693b777ae69a`