Skip to main content

maskouk: Arabic Dictionary for Collocations - python + sqlite

Project description

Arabic collocations library and data for Python +SQLite API maskouk logo

downloads downloads2

Developpers: Taha Zerrouki: http://tahadz.com taha dot zerrouki at gmail dot com

Feature s

value

Authors

Authors.md

Release

0.1

License

GPL

Tracker

linuxscout/maskouk/Issues

Website

http://maskouk.sourceforge.net

Source

Github

Downloa d

sourceforge

Feedbac ks

Comments

Account s

[@Twitter](https://twitter.com/linuxscout) [@Sourceforge](http://sourceforge.net/projects/maskouk/)

Description

Maskouk is a database of arab ic collocations extracted from Wikipedia.

Arabic wikipedia data base 2011-Jun-21.

install

pip install maskouk-pysqlite

Usage

import

>>> import pyarabic.araby as araby
>>> import maskouk.collocations as msk
>>> mydict = msk.CollocationClass()

Test if collocation exists in database

>>> wlist = [u"كرة", u"القدم"]
>>> # test if collocation exists
>>> results = mydict.is_collocated(wlist)
>>> print("inuput:", wlist)
>>> print("output:",results)
inuput: ['كرة', 'القدم']
output: كرة القدم
>>> wlist = [u"شمس", u"النهار"]
>>> results = mydict.is_collocated(wlist)
>>> print("inuput:", wlist)
>>> print("output:",results)
inuput: ['شمس', 'النهار']
output: False

Test if a word has collocations in database

>>> # get all collocations for a specific word
>>> word1 = u"كرة"
>>> results  = mydict.is_collocated_word(word1)
>>> print("inuput:", word1)
>>> print("output:",results)
inuput: كرة
output: {'القدم': 'كُرَة الْقَدَمِ'}
>>>
>>> word = u"بيت"
>>> # get all collocations for a specific word
>>> results  = mydict.is_collocated_word(word)
>>> print("inuput:", word)
>>> print("output:",results)
inuput: بيت
output: {'العدة': 'بَيْت الْعِدَّةِ', 'المستأجر': 'بَيْت الْمُسْتَأْجِرِ', 'المشتري': 'بَيْتِ الْمُشْتَرِي', 'الرجل': 'بَيْت الرَّجُلِ', 'البناء': 'بَيْت الْبِنَاءِ', 'الزوج': 'بَيْت الزَّوْجِ', 'المال': 'بيت المال', 'المقدس': 'بَيْت الْمَقْدِسِ', 'البائع': 'بَيْت الْبَائِعِ', 'الخلاء': 'بَيْت الْخَلَاءِ', 'الأب': 'بَيْت الْأَبِ', 'الله': 'بَيْت اللّهِ'}

Detect collocation in a phrase

It can be presented asseparated lists or tagged forms

>>> # detect collocations in phrase
>>> text = u"لعبنا مباراة كرة القدم في بيت المقدس"
>>> wordlist = araby.tokenize(text)
>>> results  = mydict.ngramfinder(2, wordlist)
>>> print("inuput:", text)
>>> print("output:",results)
inuput: لعبنا مباراة كرة القدم في بيت المقدس
output: ['لعبنا', 'مباراة', 'كرة القدم', 'في', 'بيت المقدس']
>>> # detect collocations in phrase
>>> text = u"لعبنا مباراة كرة القدم في بيت المقدس"
>>> wordlist = araby.tokenize(text)
>>> results   = mydict.lookup(wordlist)
>>> print("inuput:", text)
>>> print("output:",results)
inuput: لعبنا مباراة كرة القدم في بيت المقدس
output: (['لعبنا', 'مباراة', 'كُرَة', 'الْقَدَمِ', 'في', 'بَيْت', 'الْمَقْدِسِ'], ['CO', 'CO', 'CB', 'CI', 'CO', 'CB', 'CI'])
>>>

detect long collocations in a phrase

Some collocations are too long to be used in a bigrams database like “بسم الله الرحمن الرحيم” “السلام عليكم ورحمة الله وبركاته” “أهلا وسهلا بكم”

>>> # get Long collocations
... text = u" قلت لهم السلام عليكم ورحمة الله تعالى وبركاته ثم رجعت"
>>> results  = mydict.lookup4long_collocations(text)
>>> print("inuput:", text)
inuput:  قلت لهم السلام عليكم ورحمة الله تعالى وبركاته ثم رجعت
>>> print("output:",results)
output:  قلت لهم السّلامُ عَلَيكُمْ وَرَحْمَةُ اللهِ تَعَالَى وبركاته ثم رجعت

Detect candidate collocations in phrase

The candidate collocation doesn’t exists in the database, this feature is used to extract collocations based on rules. It returns a rule code, 100 as default (no collocation)

>>> text = u"ظهر رئيس الوزراء السيد عبد الملك بن عامر ومعه أمير دولة غرناطة ونهر النيل انطلاق السباق"
>>> wordlist = araby.tokenize(text)
>>> previous = "__"
>>> for wrd in wordlist:
...     wlist = [previous, wrd]
...     results  = mydict.is_possible_collocation(wlist, lenght = 2)
...     print("inuput:", wlist)
...     print("output:", results)
...     previous  = wrd
...
inuput: ['__', 'ظهر']
output: 100
inuput: ['ظهر', 'رئيس']
output: 100
inuput: ['رئيس', 'الوزراء']
output: 100
inuput: ['الوزراء', 'السيد']
output: 20
inuput: ['السيد', 'عبد']
output: 100
inuput: ['عبد', 'الملك']
output: 15
inuput: ['الملك', 'بن']
output: 100
inuput: ['بن', 'عامر']
output: 15
inuput: ['عامر', 'ومعه']
output: 100
inuput: ['ومعه', 'أمير']
output: 100
inuput: ['أمير', 'دولة']
output: 100
inuput: ['دولة', 'غرناطة']
output: 10
inuput: ['غرناطة', 'ونهر']
output: 100
inuput: ['ونهر', 'النيل']
output: 100
inuput: ['النيل', 'انطلاق']
output: 100
inuput: ['انطلاق', 'السباق']
output: 100
>>>

[requirement]

1- pyarabic
2. sqlite

Data Structure:

Colocations database

CREATE TABLE "collocations" (
    "id" INTEGER PRIMARY KEY  NOT NULL ,
    "vocalized" VARCHAR,
    "unvocalized" VARCHAR,
    "rule" VARCHAR,
    "category" VARCHAR,
    "note" VARCHAR,
    "first" VARCHAR,
    "second" VARCHAR
    );

CSV Structure:

  1. id : id unique in the database

  2. vocalized : vocalized collocation

  3. unvocalized : unvocalized collocation

  4. rule : the extraction rule number

  5. category : collocation category

  6. note :

  7. first: first word

  8. second: second word

Project details


Release history Release notifications | RSS feed

This version

0.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

maskouk_pysqlite-0.1.tar.gz (2.5 MB view hashes)

Uploaded source

Built Distributions

maskouk_pysqlite-0.1-py3-none-any.whl (6.0 MB view hashes)

Uploaded py3

maskouk_pysqlite-0.1-py2-none-any.whl (6.0 MB view hashes)

Uploaded py2

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page