Arabic language processing toolkit
Project description
Arabic Natural Language Toolkit (ANLTK)
ANLTK is a set of Arabic natural language processing tools. developed with focus on performance.
ANLTK is a C++ library, with python bindings.
Installation
for python :
pip install anltk
Building
Note: Currently only tested on Linux, prebuilt python wheels are available for Linux, Windows, Macos on pypi
Dependencies:
- utfcpp, automatically downloaded.
- utf8proc, automatically downlaoded.
- C++ Compiler that supports c++17.
- Python3, meson, ninja
pip install meson
pip install ninja
git clone --recurse-submodules https://github.com/Abdullah-AlAttar/anltk.git \
&& cd anltk/ \
&& meson build --buildtype=release -Dbuild_tests=false \
&& cd build \
&& ninja \
&& cd ../ \
&& python3 setup.py install
Usage Examples:
C++ API :
#include "anltk/anltk.hpp"
#include <iostream>
#include <string>
int main()
{
std::string ar_text = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ";
std::cout << anltk::transliterate(ar_text, anltk::CharMapping::AR2BW) << '\n';
// >bjd hwz HTy klmn sEfS qr$t vx* DZg
std::string text = "فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ.";
std::cout << anltk::remove_tashkeel(text) << '\n';
// فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
// Third paramters is a stop_list, charactres in this list won't be removed
std::cout << anltk::remove_non_alpha(text, " ") << '\n';
// فراشة ملونة تطير في البستان حلوة مهندمة تدهش الإنسان
}
Python API
import anltk
ar = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ"
bw = anltk.transliterate(ar, anltk.AR2BW)
print(bw)
# >bjd hwz HTy klmn sEfS qr$t vx* DZg
print(anltk.remove_tashkeel("فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ."))
# فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
For list of features see Features.md
Benchmarks
Processing a file containing 500000 Line, 6787731 Word, 112704541 Character. the task is to remove diacritics / transliterate to buckwalter
Buckwatler transliteration
Method | Time | ||
---|---|---|---|
anltk python-api | 1.379 seconds | ||
python camel_tools | 11.46 seconds |
Remove Diacritics
Method | Time | ||
---|---|---|---|
anltk python-api | 0.989 seconds | ||
python camel_tools | 4.892 seconds |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
anltk-0.5.6.tar.gz
(30.4 kB
view hashes)
Built Distributions
anltk-0.5.6-cp39-cp39-win_amd64.whl
(142.5 kB
view hashes)
anltk-0.5.6-cp39-cp39-win32.whl
(126.9 kB
view hashes)
anltk-0.5.6-cp38-cp38-win_amd64.whl
(145.8 kB
view hashes)
anltk-0.5.6-cp38-cp38-win32.whl
(126.9 kB
view hashes)
anltk-0.5.6-cp37-cp37m-win_amd64.whl
(145.0 kB
view hashes)
anltk-0.5.6-cp37-cp37m-win32.whl
(127.4 kB
view hashes)
anltk-0.5.6-cp36-cp36m-win_amd64.whl
(145.0 kB
view hashes)
anltk-0.5.6-cp36-cp36m-win32.whl
(127.3 kB
view hashes)
anltk-0.5.6-cp35-cp35m-win_amd64.whl
(145.0 kB
view hashes)
anltk-0.5.6-cp35-cp35m-win32.whl
(127.3 kB
view hashes)
Close
Hashes for anltk-0.5.6-pp37-pypy37_pp73-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a37c668af33f7dec929132fa473277274c9878d30035e6acc0d4fd86b1cfb10b |
|
MD5 | 0f1881c8a4d888cbb74781a6d23a5629 |
|
BLAKE2b-256 | cbe01faca8183644292e542003f71474c7544ce5d95f5b3085e3bf0f3fed936f |
Close
Hashes for anltk-0.5.6-pp37-pypy37_pp73-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2b8c335917408a169a4c38657a32eeb9dbe6523d6cc713808d0c0f23c6035f60 |
|
MD5 | 4f42723c570141c6fe27585ef981bde0 |
|
BLAKE2b-256 | 28e44118044101099588a4380475a48ec822aef9ea0313662a3747265fecf34b |
Close
Hashes for anltk-0.5.6-pp37-pypy37_pp73-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | dbf83d22a406b519cf217eb0d7f90bb167e1313f5f8f0c520169de95454abc84 |
|
MD5 | 28ca05539bcd83c533c2dcaf36db9fc6 |
|
BLAKE2b-256 | 644528b9652a82623f805393e363ff668869edae0e0d46bde1806e8220fd164a |
Close
Hashes for anltk-0.5.6-pp36-pypy36_pp73-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 11f90354a587b35d6669e1f557875204f631217847d23dca66412095b997f78f |
|
MD5 | 39e7821eb0d5dbf5311c475647fae24e |
|
BLAKE2b-256 | 534799da8207b5914e1aae5e1e4a5e69304f918f14a2502bf9482437f441c845 |
Close
Hashes for anltk-0.5.6-pp36-pypy36_pp73-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b035a887ac2cdbde67f5c503c9893071bd1d7aec5edc0eda8aa20754bbf3ee02 |
|
MD5 | 58832d44e8b4554e1919eaa372c48ec9 |
|
BLAKE2b-256 | 46f2ed93e85d07e5a976aff53772793e2e67a71920bf2ea00c4b344726fed7b8 |
Close
Hashes for anltk-0.5.6-pp36-pypy36_pp73-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9859b6d18fd0e21d1533d8419781c21799260f32213b2f0ce6869d473fafd0a5 |
|
MD5 | 32106ae25ca5f54366a47573f5602f70 |
|
BLAKE2b-256 | 53547b26416bf6b288adc92fb775589c92546038280da60e8ca5befefbfe4a36 |
Close
Hashes for anltk-0.5.6-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b26bac77651125e75d092d5fe9d21356773db2c0f047cec237059e74555f91a3 |
|
MD5 | d5a9a418a90137a50c61da6816afb0dd |
|
BLAKE2b-256 | c7dfbc03a3542f54d037fed0ee7fc14c504bafc140e994c3d5d4001403784433 |
Close
Hashes for anltk-0.5.6-cp39-cp39-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7ff7f37de1b701bd31ed950ab3f4ebf8e0f0625e1809649eab4b8a109c4fbb24 |
|
MD5 | cedcb4f9c75bafe510a585e47c000639 |
|
BLAKE2b-256 | 1d4f6545a71e948f0d30a7acd5b8900964c16eeba2774f7925a9ebfb3f1a91d3 |
Close
Hashes for anltk-0.5.6-cp39-cp39-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2c2b5a3d5efb9ea45bac0e739ca5b4c3e9f94ea27b505b0e39fea12dc343cac0 |
|
MD5 | 4b1f8adeaff70ba2ec5ff45be89ff95c |
|
BLAKE2b-256 | 3803fc84db0bb51e4ffbef4cb8764edcbc192f9e265a12e9a5442d212eb32e95 |
Close
Hashes for anltk-0.5.6-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5f11e6e037eb0ed640c5ab3fbf2c8350ac0e8227947605af8db536db451fd0f7 |
|
MD5 | 68c233081169234a92fbf248fe5cc928 |
|
BLAKE2b-256 | 41fe4b4de7bb1b85a56aff3fb23dd6e870e84dff3d45f165f59c5fa2d304109a |
Close
Hashes for anltk-0.5.6-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d9acb136ba7efc0d3835c79e81e2d2eab2e5a0f8820414150160ef73a7fb0220 |
|
MD5 | 09b3c2dd035fc9a6c33b53eacac5b8d7 |
|
BLAKE2b-256 | e191f0ee478e38637c1caad7796400020a057a41adf31d48c657636705855119 |
Close
Hashes for anltk-0.5.6-cp38-cp38-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6667d248c768b5b1b7851579df1a9200d5d7c60ba7923052925e3878ce165125 |
|
MD5 | 605ec8658bcf49f92bdcbcaad86431dd |
|
BLAKE2b-256 | 7dfd4a58e1c761ebcc40a1d885b66774caee098c3361b2468878093853e6cbff |
Close
Hashes for anltk-0.5.6-cp38-cp38-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 45050607a502db4f4e8c21ecca689a68992557f91dcc4dd741da99a51968eba9 |
|
MD5 | 10a77f814930b8e29c8aed26c348b2c3 |
|
BLAKE2b-256 | cb8570212df73c710fa33616aefabf8d2cbd0058f6178b16dc412c39468a529d |
Close
Hashes for anltk-0.5.6-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b91420db53ed0c366a3a2c472005c24ee55780a01825b4ad536fca518ba4ac48 |
|
MD5 | 1828793111a4841527a0d737a5895f71 |
|
BLAKE2b-256 | cb0e681bba9e175d4f0b09650e983d1dfdd927c8ba874fef75daba7fd8ce0dfd |
Close
Hashes for anltk-0.5.6-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8c2727788e11c499d3527ee30578ef1e45a9040de9d2cf56a0ade12fd9370356 |
|
MD5 | 2cce73c74602a1457c080fe0ad7f838f |
|
BLAKE2b-256 | 23ac87c6e2784e43dd56fd30022b98147df9d93188986f6f5678c3332b78e8c7 |
Close
Hashes for anltk-0.5.6-cp37-cp37m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 68d995f9085619daffd364ecffe1ce82c62ddd39397f5e384d3a19ac6a0e5004 |
|
MD5 | a3257a3a2d016cb6f30b8e14a94c99da |
|
BLAKE2b-256 | a7299d2ab7d0fd4b1544400e133b2b34e69af1b3ee78615349bf5b38053b8061 |
Close
Hashes for anltk-0.5.6-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9bafc14f1d9e8580cb7692141579012b6641ca2bc71576162c7e7a78fdd540bc |
|
MD5 | 4b783ca19accb5371d5f3c0d466c7984 |
|
BLAKE2b-256 | 741f57cc826a65b87d85f01fbf87b4639301bf34e973fb68767bf793719afafd |
Close
Hashes for anltk-0.5.6-cp37-cp37m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 63bc2676d89c435875bc88d6a3c37466bffc37b349b31104cd1ce019d1f3fa44 |
|
MD5 | d342a4ddb53c0ac849b80d31ee907a10 |
|
BLAKE2b-256 | ba06da8cdd124761d8e4d7853cfda9d06a77aea475b310f4e95522db8e92fcf3 |
Close
Hashes for anltk-0.5.6-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 31674d8990ffa18a8f94c8b1401a9c165a941cb61a11011c65106abe5409d1a1 |
|
MD5 | cab9e260fc98031ee3aa620d2ac21aef |
|
BLAKE2b-256 | 1e29b5f74e831463d7b759cd006bb1cccccbe36922c84447763eca8356683f69 |
Close
Hashes for anltk-0.5.6-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fc812cc7285a910fdfd6959f0b66a7db8fb75c7600aa9147d8836ebfa047b8a6 |
|
MD5 | a03a42d7b763ab52ca1ff8bd10f789e0 |
|
BLAKE2b-256 | 4e3fc6de225b50bc2a79ca5badddffd719c485cf95ff5aa33497fd60a8c66ad5 |
Close
Hashes for anltk-0.5.6-cp36-cp36m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5b8ad78a9f83067db2519275c0798b1fb6d5b089058618a505ef722318e107a1 |
|
MD5 | 1f703eded5256648294c7b4646f38bee |
|
BLAKE2b-256 | 4f6a69d1028d1acfaef078b61812ffb56846340d5d65e0b1073c91bec8032550 |
Close
Hashes for anltk-0.5.6-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 524631a7c35e2dc131cf87f32b2624803062e945df42ef1fb56013732f924c69 |
|
MD5 | 338098f6e3da6528ca59f44e1b9ea351 |
|
BLAKE2b-256 | 4b86ab860e6a1343ea6a73c35162bbabbfb1172a9e05515cbf9bdc255f24b39a |
Close
Hashes for anltk-0.5.6-cp36-cp36m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d6f134dd0f3119799823aa4e1b6a6449fcaae9485b61071a8feeaf91fc0a801f |
|
MD5 | 05f7a9cc984ae4f95cfe639f24af59bb |
|
BLAKE2b-256 | 8aee74b001ada72c7f599cedba767d21054cc54fa41483e003855843561b8978 |
Close
Hashes for anltk-0.5.6-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 02ff0345dbb70e5cd16a705b3a10c4ff4d62023b5e80f011eb79267041c1dc75 |
|
MD5 | e6af7e41dcf69e1a7a2088f23db5e4d6 |
|
BLAKE2b-256 | 37242ef04d89fc7ad91cb0e8798fee4ad163f16a4190a86c4684087ac8a0f61f |
Close
Hashes for anltk-0.5.6-cp35-cp35m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a6f30430908bcdf089ab4c9584fb6f9eb5574aafed9449c28aaf64c7fb49577f |
|
MD5 | 747e84d90d13de73efa266bd70ce52c0 |
|
BLAKE2b-256 | 6304276c9867697283096b8a5b27e797c0cda5fc9011021f1d7620c47dff7edf |
Close
Hashes for anltk-0.5.6-cp35-cp35m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3475730865a25465167cc12bcc422167908bd3ffcff01de1223d42776e966d41 |
|
MD5 | 33143a0c0c69ebc8f7ed7455eb84261c |
|
BLAKE2b-256 | dc247ebab6ddbf125ed71c6819cd5f8a2cf8f6f942859618b73998d2148b63ce |
Close
Hashes for anltk-0.5.6-cp35-cp35m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9e8d794ccfb90c2e211134c9ef966b2a2129c8e074852727abfa3b6a6e966a04 |
|
MD5 | 1316f2baf65e7bf28c8375a59c25babe |
|
BLAKE2b-256 | 7d76e03d91908fc5ef7263d83c03940a40d855b1d8cde878715c9a33345861f3 |
Close
Hashes for anltk-0.5.6-cp35-cp35m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1ba1efa12bd6f986e30e5dda8f8fb72a479bef68071c6622bd2b77ce7cf0321e |
|
MD5 | efec2572e2ebaf455acd0df1e8a47de3 |
|
BLAKE2b-256 | bb5f295f8b1881a8cbc58e1b53f2ded3b102328c46de831fab669a300ca384b4 |
Close
Hashes for anltk-0.5.6-cp35-cp35m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 312683c1ccabaca05aab0b2684d9fd07ac5fad3ded4aec05ce9472b0b3c7a594 |
|
MD5 | 6b6390d0a5b274e5d4f2b42d7df70e0e |
|
BLAKE2b-256 | e8bbb33bdd73b74402712b71ffef6a155564c0b48a2ffaa2e3ad468a82cd6ffd |