An Arabic morphological analyzer and lexicon
Project description
# pyaramorph
*An Arabic morphological analyzer and lexicon*
## Introduction
**pyaramorph** is a morphological analyzer and lexicon for the Arabic
language. It is a loose port of the [Buckwalter Arabic Morphological
Analyzer Version
1.0](http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002L49),
though it does not implement all of that program’s functionality.
This software is supposed to provide quick successive analyses of single
words or short phrases. Buckwalter’s original Perl script only supported
input in the `cp1256` encoding, and I really did not want to refit it
for UTF-8. (Also, given how long it has been since I worked with Perl
and my preference for Python, it seemed worth it to do a Python
rewrite.) The Java port of the same script,
[AraMorph](http://www.nongnu.org/aramorph/), does accept UTF-8, but it
only processes specified input files, and its dictionary loading is
quite slow. It’s great for analyzing full texts, but not so much for
interactive analysis.
That’s why I wrote this port. The script itself is quite simple – Tim
Buckwalter really did all the hard work by putting together the
dictionary and table files – so all credit for the functionality
provided by this program should go to him! I have simply re-written the
program to suit my own needs.
## Requirements
1. [python](http://www.python.org/). Currently you need Python 3.
2. A terminal emulator with UTF-8 and BiDi support. I use
[mlterm](http://mlterm.sourceforge.net/) with
[unifont](http://www.unifoundry.com/index.html) or
[DejaVu Sans Mono](http://dejavu-fonts.org/wiki/Main_Page).
(Here are some old though perhaps useful
[setup instructions](http://lists.arabeyes.org/archives/general/2004/February/msg00004.html).)
3. Ability to type in UTF-8 Arabic text. Linux/Unix users can try the
Arabic layout included in my
[Classical Input Methods for M17N](https://bitbucket.org/alexlee/m17n-classical),
which are intended for use with
[IBus](https://github.com/ibus/ibus/wiki).
## Installation
You can install using `pip`, or from source with `python setup.py
install`.
## Usage
Once you have the software installed and your BiDi-enabled, UTF-8
capable terminal up and running, you simply need to run the `pyaramorph`
command. At the prompt, enter an Arabic word or phrase, using Unicode.
Words not written in the Arabic script will be ignored.
The session output below should give you an idea of how it works:
alexlee@sartorius:~$ pyaramorph
loading dictPrefixes ... loaded 299 entries
loading dictStems ... loaded 38600 lemmas and 82158 entries
loading dictSuffixes ... loaded 618 entries
Unicode Arabic Morphological Analyzer (press ctrl-d to exit)
$ كتب كتابا في المكتب
analysis for: كتب ktb
solution: (كَتَبَ kataba) [katab-u_1]
pos: katab/VERB_PERFECT+a/PVSUFF_SUBJ:3MS
gloss: ___ + write + he/it <verb>
solution: (كُتِبَ kutiba) [katab-u_1]
pos: kutib/VERB_PERFECT+a/PVSUFF_SUBJ:3MS
gloss: ___ + be written;be fated;be destined + he/it <verb>
solution: (كُتُب kutub) [kitAb_1]
pos: kutub/NOUN
gloss: ___ + books + ___
analysis for: كتابا ktAbA
solution: (كِتاباً kitAbAF) [kitAb_1]
pos: kitAb/NOUN+AF/NSUFF_MASC_SG_ACC_INDEF
gloss: ___ + book + [acc.indef.]
solution: (كِتابا kitAbA) [kitAb_1]
pos: kitAb/NOUN+A/NSUFF_MASC_DU_NOM_POSS
gloss: ___ + book + two
solution: (كُتّاباً kut~AbAF) [kut~Ab_1]
pos: kut~Ab/NOUN+AF/NSUFF_MASC_SG_ACC_INDEF
gloss: ___ + kuttab (village school);Quran school + [acc.indef.]
solution: (كُتّاباً kut~AbAF) [kAtib_1]
pos: kut~Ab/NOUN+AF/NSUFF_MASC_SG_ACC_INDEF
gloss: ___ + authors;writers + [acc.indef.]
analysis for: في fy
solution: (فِي fiy) [fiy_1]
pos: fiy/PREP
gloss: ___ + in + ___
solution: (فِيَّ fiy~a) [fiy_1]
pos: fiy/PREP+~a/PRON_1S
gloss: ___ + in + me
solution: (فِي fiy) [fiy_2]
pos: Viy/ABBREV
gloss: ___ + V. + ___
analysis for: المكتب Almktb
solution: (المَكْتَب Almakotab) [makotab_1]
pos: Al/DET+makotab/NOUN
gloss: the + bureau;office;department + ___
$
## Todo
Diacritics are ignored for now. It would be nice to use the
user-supplied diacritics to filter through the generated solutions. That
way if you enter something like `dar~ast` (دَرَّست), it won’t return any
results from the `daras` (دَرَس) root.
In his original Perl script, Buckwalter applies a number of spelling
substitutions if a given word does not generate any solutions. This
functionality should be easy to add, but I didn’t get around to it.
A simple GUI would be nice, for a better choice of fonts (like the
[SIL Arabic fonts](http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=ArabicFonts))
and for Windows support.
## Contact
If you have any comments, suggestions, fixes, contributions, etc., please
contact Alex Lee (alexlee at fastmail net). Thanks!
*An Arabic morphological analyzer and lexicon*
## Introduction
**pyaramorph** is a morphological analyzer and lexicon for the Arabic
language. It is a loose port of the [Buckwalter Arabic Morphological
Analyzer Version
1.0](http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002L49),
though it does not implement all of that program’s functionality.
This software is supposed to provide quick successive analyses of single
words or short phrases. Buckwalter’s original Perl script only supported
input in the `cp1256` encoding, and I really did not want to refit it
for UTF-8. (Also, given how long it has been since I worked with Perl
and my preference for Python, it seemed worth it to do a Python
rewrite.) The Java port of the same script,
[AraMorph](http://www.nongnu.org/aramorph/), does accept UTF-8, but it
only processes specified input files, and its dictionary loading is
quite slow. It’s great for analyzing full texts, but not so much for
interactive analysis.
That’s why I wrote this port. The script itself is quite simple – Tim
Buckwalter really did all the hard work by putting together the
dictionary and table files – so all credit for the functionality
provided by this program should go to him! I have simply re-written the
program to suit my own needs.
## Requirements
1. [python](http://www.python.org/). Currently you need Python 3.
2. A terminal emulator with UTF-8 and BiDi support. I use
[mlterm](http://mlterm.sourceforge.net/) with
[unifont](http://www.unifoundry.com/index.html) or
[DejaVu Sans Mono](http://dejavu-fonts.org/wiki/Main_Page).
(Here are some old though perhaps useful
[setup instructions](http://lists.arabeyes.org/archives/general/2004/February/msg00004.html).)
3. Ability to type in UTF-8 Arabic text. Linux/Unix users can try the
Arabic layout included in my
[Classical Input Methods for M17N](https://bitbucket.org/alexlee/m17n-classical),
which are intended for use with
[IBus](https://github.com/ibus/ibus/wiki).
## Installation
You can install using `pip`, or from source with `python setup.py
install`.
## Usage
Once you have the software installed and your BiDi-enabled, UTF-8
capable terminal up and running, you simply need to run the `pyaramorph`
command. At the prompt, enter an Arabic word or phrase, using Unicode.
Words not written in the Arabic script will be ignored.
The session output below should give you an idea of how it works:
alexlee@sartorius:~$ pyaramorph
loading dictPrefixes ... loaded 299 entries
loading dictStems ... loaded 38600 lemmas and 82158 entries
loading dictSuffixes ... loaded 618 entries
Unicode Arabic Morphological Analyzer (press ctrl-d to exit)
$ كتب كتابا في المكتب
analysis for: كتب ktb
solution: (كَتَبَ kataba) [katab-u_1]
pos: katab/VERB_PERFECT+a/PVSUFF_SUBJ:3MS
gloss: ___ + write + he/it <verb>
solution: (كُتِبَ kutiba) [katab-u_1]
pos: kutib/VERB_PERFECT+a/PVSUFF_SUBJ:3MS
gloss: ___ + be written;be fated;be destined + he/it <verb>
solution: (كُتُب kutub) [kitAb_1]
pos: kutub/NOUN
gloss: ___ + books + ___
analysis for: كتابا ktAbA
solution: (كِتاباً kitAbAF) [kitAb_1]
pos: kitAb/NOUN+AF/NSUFF_MASC_SG_ACC_INDEF
gloss: ___ + book + [acc.indef.]
solution: (كِتابا kitAbA) [kitAb_1]
pos: kitAb/NOUN+A/NSUFF_MASC_DU_NOM_POSS
gloss: ___ + book + two
solution: (كُتّاباً kut~AbAF) [kut~Ab_1]
pos: kut~Ab/NOUN+AF/NSUFF_MASC_SG_ACC_INDEF
gloss: ___ + kuttab (village school);Quran school + [acc.indef.]
solution: (كُتّاباً kut~AbAF) [kAtib_1]
pos: kut~Ab/NOUN+AF/NSUFF_MASC_SG_ACC_INDEF
gloss: ___ + authors;writers + [acc.indef.]
analysis for: في fy
solution: (فِي fiy) [fiy_1]
pos: fiy/PREP
gloss: ___ + in + ___
solution: (فِيَّ fiy~a) [fiy_1]
pos: fiy/PREP+~a/PRON_1S
gloss: ___ + in + me
solution: (فِي fiy) [fiy_2]
pos: Viy/ABBREV
gloss: ___ + V. + ___
analysis for: المكتب Almktb
solution: (المَكْتَب Almakotab) [makotab_1]
pos: Al/DET+makotab/NOUN
gloss: the + bureau;office;department + ___
$
## Todo
Diacritics are ignored for now. It would be nice to use the
user-supplied diacritics to filter through the generated solutions. That
way if you enter something like `dar~ast` (دَرَّست), it won’t return any
results from the `daras` (دَرَس) root.
In his original Perl script, Buckwalter applies a number of spelling
substitutions if a given word does not generate any solutions. This
functionality should be easy to add, but I didn’t get around to it.
A simple GUI would be nice, for a better choice of fonts (like the
[SIL Arabic fonts](http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=ArabicFonts))
and for Windows support.
## Contact
If you have any comments, suggestions, fixes, contributions, etc., please
contact Alex Lee (alexlee at fastmail net). Thanks!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pyaramorph-0.2.tar.gz
(1.1 MB
view details)
File details
Details for the file pyaramorph-0.2.tar.gz
.
File metadata
- Download URL: pyaramorph-0.2.tar.gz
- Upload date:
- Size: 1.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1903e4a4b2e07d8b768b23590fb030b91fe310eac2b4a38bba5a98254b3520b2 |
|
MD5 | a8e92a57a8766cc50b4408a72de9df24 |
|
BLAKE2b-256 | bd7ddb168a32e7a656c60c8a5b11917bb4b9ae33191bb4dc532c39cacdf93df5 |