maskouk: Arabic Dictionary for Collocations - python + sqlite
Project description
Arabic collocations library and data for Python +SQLite API
Developpers: Taha Zerrouki: http://tahadz.com taha dot zerrouki at gmail dot com
Feature s |
value |
---|---|
Authors |
|
Release |
0.1 |
License |
|
Tracker |
|
Website |
|
Source |
|
Downloa d |
|
Feedbac ks |
|
Account s |
[@Twitter](https://twitter.com/linuxscout) [@Sourceforge](http://sourceforge.net/projects/maskouk/) |
Description
Maskouk is a database of arab ic collocations extracted from Wikipedia.
Arabic wikipedia data base 2011-Jun-21.
install
pip install maskouk-pysqlite
Usage
import
>>> import pyarabic.araby as araby
>>> import maskouk.collocations as msk
>>> mydict = msk.CollocationClass()
Test if collocation exists in database
>>> wlist = [u"كرة", u"القدم"]
>>> # test if collocation exists
>>> results = mydict.is_collocated(wlist)
>>> print("inuput:", wlist)
>>> print("output:",results)
inuput: ['كرة', 'القدم']
output: كرة القدم
>>> wlist = [u"شمس", u"النهار"]
>>> results = mydict.is_collocated(wlist)
>>> print("inuput:", wlist)
>>> print("output:",results)
inuput: ['شمس', 'النهار']
output: False
Test if a word has collocations in database
>>> # get all collocations for a specific word
>>> word1 = u"كرة"
>>> results = mydict.is_collocated_word(word1)
>>> print("inuput:", word1)
>>> print("output:",results)
inuput: كرة
output: {'القدم': 'كُرَة الْقَدَمِ'}
>>>
>>> word = u"بيت"
>>> # get all collocations for a specific word
>>> results = mydict.is_collocated_word(word)
>>> print("inuput:", word)
>>> print("output:",results)
inuput: بيت
output: {'العدة': 'بَيْت الْعِدَّةِ', 'المستأجر': 'بَيْت الْمُسْتَأْجِرِ', 'المشتري': 'بَيْتِ الْمُشْتَرِي', 'الرجل': 'بَيْت الرَّجُلِ', 'البناء': 'بَيْت الْبِنَاءِ', 'الزوج': 'بَيْت الزَّوْجِ', 'المال': 'بيت المال', 'المقدس': 'بَيْت الْمَقْدِسِ', 'البائع': 'بَيْت الْبَائِعِ', 'الخلاء': 'بَيْت الْخَلَاءِ', 'الأب': 'بَيْت الْأَبِ', 'الله': 'بَيْت اللّهِ'}
Detect collocation in a phrase
It can be presented asseparated lists or tagged forms
>>> # detect collocations in phrase
>>> text = u"لعبنا مباراة كرة القدم في بيت المقدس"
>>> wordlist = araby.tokenize(text)
>>> results = mydict.ngramfinder(2, wordlist)
>>> print("inuput:", text)
>>> print("output:",results)
inuput: لعبنا مباراة كرة القدم في بيت المقدس
output: ['لعبنا', 'مباراة', 'كرة القدم', 'في', 'بيت المقدس']
>>> # detect collocations in phrase
>>> text = u"لعبنا مباراة كرة القدم في بيت المقدس"
>>> wordlist = araby.tokenize(text)
>>> results = mydict.lookup(wordlist)
>>> print("inuput:", text)
>>> print("output:",results)
inuput: لعبنا مباراة كرة القدم في بيت المقدس
output: (['لعبنا', 'مباراة', 'كُرَة', 'الْقَدَمِ', 'في', 'بَيْت', 'الْمَقْدِسِ'], ['CO', 'CO', 'CB', 'CI', 'CO', 'CB', 'CI'])
>>>
detect long collocations in a phrase
Some collocations are too long to be used in a bigrams database like “بسم الله الرحمن الرحيم” “السلام عليكم ورحمة الله وبركاته” “أهلا وسهلا بكم”
>>> # get Long collocations
... text = u" قلت لهم السلام عليكم ورحمة الله تعالى وبركاته ثم رجعت"
>>> results = mydict.lookup4long_collocations(text)
>>> print("inuput:", text)
inuput: قلت لهم السلام عليكم ورحمة الله تعالى وبركاته ثم رجعت
>>> print("output:",results)
output: قلت لهم السّلامُ عَلَيكُمْ وَرَحْمَةُ اللهِ تَعَالَى وبركاته ثم رجعت
Detect candidate collocations in phrase
The candidate collocation doesn’t exists in the database, this feature is used to extract collocations based on rules. It returns a rule code, 100 as default (no collocation)
>>> text = u"ظهر رئيس الوزراء السيد عبد الملك بن عامر ومعه أمير دولة غرناطة ونهر النيل انطلاق السباق"
>>> wordlist = araby.tokenize(text)
>>> previous = "__"
>>> for wrd in wordlist:
... wlist = [previous, wrd]
... results = mydict.is_possible_collocation(wlist, lenght = 2)
... print("inuput:", wlist)
... print("output:", results)
... previous = wrd
...
inuput: ['__', 'ظهر']
output: 100
inuput: ['ظهر', 'رئيس']
output: 100
inuput: ['رئيس', 'الوزراء']
output: 100
inuput: ['الوزراء', 'السيد']
output: 20
inuput: ['السيد', 'عبد']
output: 100
inuput: ['عبد', 'الملك']
output: 15
inuput: ['الملك', 'بن']
output: 100
inuput: ['بن', 'عامر']
output: 15
inuput: ['عامر', 'ومعه']
output: 100
inuput: ['ومعه', 'أمير']
output: 100
inuput: ['أمير', 'دولة']
output: 100
inuput: ['دولة', 'غرناطة']
output: 10
inuput: ['غرناطة', 'ونهر']
output: 100
inuput: ['ونهر', 'النيل']
output: 100
inuput: ['النيل', 'انطلاق']
output: 100
inuput: ['انطلاق', 'السباق']
output: 100
>>>
[requirement]
1- pyarabic 2. sqlite
Data Structure:
Colocations database
CREATE TABLE "collocations" (
"id" INTEGER PRIMARY KEY NOT NULL ,
"vocalized" VARCHAR,
"unvocalized" VARCHAR,
"rule" VARCHAR,
"category" VARCHAR,
"note" VARCHAR,
"first" VARCHAR,
"second" VARCHAR
);
CSV Structure:
id : id unique in the database
vocalized : vocalized collocation
unvocalized : unvocalized collocation
rule : the extraction rule number
category : collocation category
note :
first: first word
second: second word
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for maskouk_pysqlite-0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 362aad9ba1fdecd8463d135e0a9d26b2ffe4c51f443e5b2c730ad59e0f2b0eeb |
|
MD5 | c5672e9aaee724d77fba90a73b94c63c |
|
BLAKE2b-256 | 7695ee0c9c682e3b03433e3e6afd19bfb47817f5181ba7aac47ec7e8fcdfd343 |
Hashes for maskouk_pysqlite-0.1-py2-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f6f933285e3f36991eec729312240b50d52627b1659d09b1d4cce124c4f77ae9 |
|
MD5 | 7d34eefad039bc09552798a0a01bd4bc |
|
BLAKE2b-256 | 97d1896f37e714221390542b7c657ad7aea9243df79853dcc5e98f0d153d2bb6 |