Skip to main content

maskouk: Arabic Dictionary for Collocations - python + sqlite

Project description

Arabic collocations library and data for Python +SQLite API maskouk logo

downloads downloads2

Developpers: Taha Zerrouki: http://tahadz.com taha dot zerrouki at gmail dot com

Feature s

value

Authors

Authors.md

Release

0.1

License

GPL

Tracker

linuxscout/maskouk/Issues

Website

http://maskouk.sourceforge.net

Source

Github

Downloa d

sourceforge

Feedbac ks

Comments

Account s

[@Twitter](https://twitter.com/linuxscout) [@Sourceforge](http://sourceforge.net/projects/maskouk/)

Description

Maskouk is a database of arab ic collocations extracted from Wikipedia.

Arabic wikipedia data base 2011-Jun-21.

install

pip install maskouk-pysqlite

Usage

import

>>> import pyarabic.araby as araby
>>> import maskouk.collocations as msk
>>> mydict = msk.CollocationClass()

Test if collocation exists in database

>>> wlist = [u"كرة", u"القدم"]
>>> # test if collocation exists
>>> results = mydict.is_collocated(wlist)
>>> print("inuput:", wlist)
>>> print("output:",results)
inuput: ['كرة', 'القدم']
output: كرة القدم
>>> wlist = [u"شمس", u"النهار"]
>>> results = mydict.is_collocated(wlist)
>>> print("inuput:", wlist)
>>> print("output:",results)
inuput: ['شمس', 'النهار']
output: False

Test if a word has collocations in database

>>> # get all collocations for a specific word
>>> word1 = u"كرة"
>>> results  = mydict.is_collocated_word(word1)
>>> print("inuput:", word1)
>>> print("output:",results)
inuput: كرة
output: {'القدم': 'كُرَة الْقَدَمِ'}
>>>
>>> word = u"بيت"
>>> # get all collocations for a specific word
>>> results  = mydict.is_collocated_word(word)
>>> print("inuput:", word)
>>> print("output:",results)
inuput: بيت
output: {'العدة': 'بَيْت الْعِدَّةِ', 'المستأجر': 'بَيْت الْمُسْتَأْجِرِ', 'المشتري': 'بَيْتِ الْمُشْتَرِي', 'الرجل': 'بَيْت الرَّجُلِ', 'البناء': 'بَيْت الْبِنَاءِ', 'الزوج': 'بَيْت الزَّوْجِ', 'المال': 'بيت المال', 'المقدس': 'بَيْت الْمَقْدِسِ', 'البائع': 'بَيْت الْبَائِعِ', 'الخلاء': 'بَيْت الْخَلَاءِ', 'الأب': 'بَيْت الْأَبِ', 'الله': 'بَيْت اللّهِ'}

Detect collocation in a phrase

It can be presented asseparated lists or tagged forms

>>> # detect collocations in phrase
>>> text = u"لعبنا مباراة كرة القدم في بيت المقدس"
>>> wordlist = araby.tokenize(text)
>>> results  = mydict.ngramfinder(2, wordlist)
>>> print("inuput:", text)
>>> print("output:",results)
inuput: لعبنا مباراة كرة القدم في بيت المقدس
output: ['لعبنا', 'مباراة', 'كرة القدم', 'في', 'بيت المقدس']
>>> # detect collocations in phrase
>>> text = u"لعبنا مباراة كرة القدم في بيت المقدس"
>>> wordlist = araby.tokenize(text)
>>> results   = mydict.lookup(wordlist)
>>> print("inuput:", text)
>>> print("output:",results)
inuput: لعبنا مباراة كرة القدم في بيت المقدس
output: (['لعبنا', 'مباراة', 'كُرَة', 'الْقَدَمِ', 'في', 'بَيْت', 'الْمَقْدِسِ'], ['CO', 'CO', 'CB', 'CI', 'CO', 'CB', 'CI'])
>>>

detect long collocations in a phrase

Some collocations are too long to be used in a bigrams database like “بسم الله الرحمن الرحيم” “السلام عليكم ورحمة الله وبركاته” “أهلا وسهلا بكم”

>>> # get Long collocations
... text = u" قلت لهم السلام عليكم ورحمة الله تعالى وبركاته ثم رجعت"
>>> results  = mydict.lookup4long_collocations(text)
>>> print("inuput:", text)
inuput:  قلت لهم السلام عليكم ورحمة الله تعالى وبركاته ثم رجعت
>>> print("output:",results)
output:  قلت لهم السّلامُ عَلَيكُمْ وَرَحْمَةُ اللهِ تَعَالَى وبركاته ثم رجعت

Detect candidate collocations in phrase

The candidate collocation doesn’t exists in the database, this feature is used to extract collocations based on rules. It returns a rule code, 100 as default (no collocation)

>>> text = u"ظهر رئيس الوزراء السيد عبد الملك بن عامر ومعه أمير دولة غرناطة ونهر النيل انطلاق السباق"
>>> wordlist = araby.tokenize(text)
>>> previous = "__"
>>> for wrd in wordlist:
...     wlist = [previous, wrd]
...     results  = mydict.is_possible_collocation(wlist, lenght = 2)
...     print("inuput:", wlist)
...     print("output:", results)
...     previous  = wrd
...
inuput: ['__', 'ظهر']
output: 100
inuput: ['ظهر', 'رئيس']
output: 100
inuput: ['رئيس', 'الوزراء']
output: 100
inuput: ['الوزراء', 'السيد']
output: 20
inuput: ['السيد', 'عبد']
output: 100
inuput: ['عبد', 'الملك']
output: 15
inuput: ['الملك', 'بن']
output: 100
inuput: ['بن', 'عامر']
output: 15
inuput: ['عامر', 'ومعه']
output: 100
inuput: ['ومعه', 'أمير']
output: 100
inuput: ['أمير', 'دولة']
output: 100
inuput: ['دولة', 'غرناطة']
output: 10
inuput: ['غرناطة', 'ونهر']
output: 100
inuput: ['ونهر', 'النيل']
output: 100
inuput: ['النيل', 'انطلاق']
output: 100
inuput: ['انطلاق', 'السباق']
output: 100
>>>

[requirement]

1- pyarabic
2. sqlite

Data Structure:

Colocations database

CREATE TABLE "collocations" (
    "id" INTEGER PRIMARY KEY  NOT NULL ,
    "vocalized" VARCHAR,
    "unvocalized" VARCHAR,
    "rule" VARCHAR,
    "category" VARCHAR,
    "note" VARCHAR,
    "first" VARCHAR,
    "second" VARCHAR
    );

CSV Structure:

  1. id : id unique in the database

  2. vocalized : vocalized collocation

  3. unvocalized : unvocalized collocation

  4. rule : the extraction rule number

  5. category : collocation category

  6. note :

  7. first: first word

  8. second: second word

Project details


Release history Release notifications | RSS feed

This version

0.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

maskouk_pysqlite-0.1.tar.gz (2.5 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

maskouk_pysqlite-0.1-py3-none-any.whl (6.0 MB view details)

Uploaded Python 3

maskouk_pysqlite-0.1-py2-none-any.whl (6.0 MB view details)

Uploaded Python 2

File details

Details for the file maskouk_pysqlite-0.1.tar.gz.

File metadata

  • Download URL: maskouk_pysqlite-0.1.tar.gz
  • Upload date:
  • Size: 2.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.4.2 requests/2.19.1 setuptools/39.2.0 requests-toolbelt/0.8.0 tqdm/4.19.9 CPython/2.7.12

File hashes

Hashes for maskouk_pysqlite-0.1.tar.gz
Algorithm Hash digest
SHA256 993dc1ff61aeb81628f8cbb85571cdb18821747e9945f0f7e569553450bfcfbd
MD5 e93dba80f824bf762cd41365fe4a3eae
BLAKE2b-256 441883609798b9c94b25bb98a347d6c7b0dde6f3de62a08db5908a755cc387e6

See more details on using hashes here.

File details

Details for the file maskouk_pysqlite-0.1-py3-none-any.whl.

File metadata

  • Download URL: maskouk_pysqlite-0.1-py3-none-any.whl
  • Upload date:
  • Size: 6.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.4.2 requests/2.19.1 setuptools/39.2.0 requests-toolbelt/0.8.0 tqdm/4.19.9 CPython/2.7.12

File hashes

Hashes for maskouk_pysqlite-0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 362aad9ba1fdecd8463d135e0a9d26b2ffe4c51f443e5b2c730ad59e0f2b0eeb
MD5 c5672e9aaee724d77fba90a73b94c63c
BLAKE2b-256 7695ee0c9c682e3b03433e3e6afd19bfb47817f5181ba7aac47ec7e8fcdfd343

See more details on using hashes here.

File details

Details for the file maskouk_pysqlite-0.1-py2-none-any.whl.

File metadata

  • Download URL: maskouk_pysqlite-0.1-py2-none-any.whl
  • Upload date:
  • Size: 6.0 MB
  • Tags: Python 2
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.4.2 requests/2.19.1 setuptools/39.2.0 requests-toolbelt/0.8.0 tqdm/4.19.9 CPython/2.7.12

File hashes

Hashes for maskouk_pysqlite-0.1-py2-none-any.whl
Algorithm Hash digest
SHA256 f6f933285e3f36991eec729312240b50d52627b1659d09b1d4cce124c4f77ae9
MD5 7d34eefad039bc09552798a0a01bd4bc
BLAKE2b-256 97d1896f37e714221390542b7c657ad7aea9243df79853dcc5e98f0d153d2bb6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page