Skip to main content

Arabic text processor tool.

Project description

Matn | مَتن
Tests Tests


A shared space for Arabic text processors.

1. Getting started

pip install matn

2. Counters

2.1. Jummal | حِسَاب ٱلْجُمَّل

Or Abjad numerals, a decimal alphabetic numeral system/alphanumeric code, in which the 28 letters of the Arabic alphabet are assigned numerical values. They have been used in the Arabic-speaking world since before the eighth century when positional Arabic numerals were adopted.

2.1.1. Methods

There are different ways and values people use for jummal.

  1. The normal method which doesn't include the hamza count.
  2. The method that considers hamza as a seperate character.
  3. The tarkeeb method; Used to express the numbers from 2000 to 1,000,000, using the rule based on the letter "غ". The rule is fairly simple, any character that comes before "غ" its value will be multiplied with 1000 instead of accumalated to it.
  4. Normalized hamzas method, where we treat all hamza forms as a regular alef instead of the letter it appears on. Defaults to False.

2.1.2. Usage

Python
>>> from matn.counters import jummal

>>> text = "شغل الدموع عن الديار بكاؤنا   لبكاء فاطمــة على أولادها"

>>> jummal(text)
2_273  # شغ's value is 1000 + 300 and hamza value is 0

# To include Hamza count
>>> jummal(text, use_hamza=True)
2_274  # شغ's value is 1000 + 300 and hamza value is 1

# To include hamza normalization
>>> jummal(text, normalize_hamza=True)
2_268  # شغ's value is 1000 + 300, hamza value is 1, and ؤ value is 1

# To use tarkeeb
>>> jummal(text, use_tarkeeb=True)
300_973  # شغ's value is 300 * 1000 and hamza value is 0

# To use hamza and tarkeeb
>>> jummal(text, use_hamza=True, use_tarkeeb=True)
300_974  # شغ's value is 300 * 1000 and hamza value is 1
CLI
matn jummal "شغل الدموع عن الديار بكاؤنا   لبكاء فاطمــة على أولادها"

# To include Hamza count
matn jummal --use-hamza "شغل الدموع عن الديار بكاؤنا   لبكاء فاطمــة على أولادها"

# To use tarkeeb
matn jummal --use-tarkeeb "شغل الدموع عن الديار بكاؤنا   لبكاء فاطمــة على أولادها"

# To normalize hamza
matn jummal --normalize-hamza "شغل الدموع عن الديار بكاؤنا   لبكاء فاطمــة على أولادها"

# All methods at once
matn jummal -z -n -t  "شغل الدموع عن الديار بكاؤنا   لبكاء فاطمــة على أولادها"

2.2. Word Count

Counts the number of characters in a given string.

2.2.1. Methods

The method is very obvious. However, some researchers tend to split words into multiple parts. The only word we took interest in, so far, is بعدما. The word_count method will give you the option to split it into two words or count it as one.

2.2.2. Usage

Python
>>> from matn.counters import word_count

>>> text = "فَمَنۢ بَدَّلَهُۥ بَعۡدَمَا سَمِعَهُۥ"

>>> word_count(text)
4

# To split badama
>>> word_count(text, split_badama=True)
5  # بَعۡدَمَا was split into two words
CLI
matn wc "فَمَنۢ بَدَّلَهُۥ بَعۡدَمَا سَمِعَهُۥ"

# To split badama
matn wc --split-badama "فَمَنۢ بَدَّلَهُۥ بَعۡدَمَا سَمِعَهُۥ"

2.3. Char Count

Counts the number of characters in a given string.

2.3.1. Methods

  • In some cases, we need to consinder spaces as seperate characters, in some cases we don't.
  • In some cases, we consider the hamza-madda (أٓ) character two characters. This character appears in the word الأٓخرة for example.

2.3.2. Usage

Python
>>> from matn.counters import char_count

>>> text = "ٱلدَّارُ ٱلۡأٓخِرَةُ"

>>> char_count(text)
11

# To Include spaces
>>> char_count(text, include_spaces=True)
12

# To Include hamza-madda
>>> char_count(text, hamza_madda=True)
12

# To Include hamza-madda and spaces
>>> char_count(text, hamza_madda=True)
13
CLI
matn cc "ٱلدَّارُ ٱلۡأٓخِرَةُ"

# To Include hamza-madda
matn wc --hamza-madda "فَمَنۢ بَدَّلَهُۥ بَعۡدَمَا سَمِعَهُۥ"

# To Include spaces
matn wc --include-spaces "فَمَنۢ بَدَّلَهُۥ بَعۡدَمَا سَمِعَهُۥ"

# To Include hamza-madda and spaces
matn wc --include-spaces --hamza-madda "فَمَنۢ بَدَّلَهُۥ بَعۡدَمَا سَمِعَهُۥ"

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

matn-0.2.2.tar.gz (10.1 kB view details)

Uploaded Source

Built Distribution

matn-0.2.2-py3-none-any.whl (9.9 kB view details)

Uploaded Python 3

File details

Details for the file matn-0.2.2.tar.gz.

File metadata

  • Download URL: matn-0.2.2.tar.gz
  • Upload date:
  • Size: 10.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for matn-0.2.2.tar.gz
Algorithm Hash digest
SHA256 83aa54d1d92e9622064388239b41268b213877bbcfa0745d769b2aa7f2939359
MD5 a6fb5f88b829cfea1c96501d7cfcaace
BLAKE2b-256 566806693f27631916e0361cd70d13d52aaa68bedc4a7a7c44cfea585e858313

See more details on using hashes here.

File details

Details for the file matn-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: matn-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 9.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for matn-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c573b69393bb0ea40001830d94ae25c82bfb7fd05535c831d0125ee7cf32c4f5
MD5 5fde0d5f301e92ec02693e9d321692d1
BLAKE2b-256 2b40f13354feab95ea6523ce161efebe6e3eb5ac1e01af3922f1e2a8c5040123

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page