Arabic language processing toolkit
Project description
Arabic Natural Language Toolkit (ANLTK)
ANLTK is a set of Arabic natural language processing tools. developed with focus on performance.
ANLTK is a C++ library, with python bindings.
Installation
for python :
pip install pybind11
pip install anltk
Building
Note: Currently only tested on Linux, prebuilt python wheels are available for Linux, Windows, Macos on pypi
Dependencies:
- utfcpp, automatically downloaded.
- utf8proc, automatically downlaoded.
- C++ Compiler that supports c++17.
- Python3, meson, ninja
pip install meson
pip install ninja
git clone --recurse-submodules https://github.com/Abdullah-AlAttar/anltk.git \
&& cd anltk/ \
&& meson build --buildtype=release -Dbuild_tests=false \
&& cd build \
&& ninja \
&& cd ../ \
&& python3 setup.py install
Usage Examples:
C++ API :
#include "anltk/anltk.hpp"
#include <iostream>
#include <string>
int main()
{
std::string ar_text = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ";
std::cout << anltk::transliterate(ar_text, anltk::CharMapping::AR2BW) << '\n';
// >bjd hwz HTy klmn sEfS qr$t vx* DZg
std::string text = "فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ.";
std::cout << anltk::remove_tashkeel(text) << '\n';
// فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
// Third paramters is a stop_list, charactres in this list won't be removed
std::cout << anltk::remove_non_alpha(text, " ") << '\n';
// فراشة ملونة تطير في البستان حلوة مهندمة تدهش الإنسان
}
Python API
import anltk
ar = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ"
bw = anltk.transliterate(ar, anltk.AR2BW)
print(bw)
# >bjd hwz HTy klmn sEfS qr$t vx* DZg
print(anltk.remove_tashkeel("فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ."))
# فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
For list of features see Features.md
Benchmarks
Processing a file containing 500000 Line, 6787731 Word, 112704541 Character. the task is to remove diacritics / transliterate to buckwalter
Buckwatler transliteration
Method | Time | ||
---|---|---|---|
anltk python-api | 1.379 seconds | ||
python camel_tools | 11.46 seconds |
Remove Diacritics
Method | Time | ||
---|---|---|---|
anltk python-api | 0.989 seconds | ||
python camel_tools | 4.892 seconds |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
anltk-0.5.3.tar.gz
(23.8 kB
view hashes)
Built Distributions
anltk-0.5.3-cp39-cp39-win_amd64.whl
(130.3 kB
view hashes)
anltk-0.5.3-cp39-cp39-win32.whl
(114.5 kB
view hashes)
anltk-0.5.3-cp38-cp38-win_amd64.whl
(132.8 kB
view hashes)
anltk-0.5.3-cp38-cp38-win32.whl
(114.4 kB
view hashes)
anltk-0.5.3-cp37-cp37m-win_amd64.whl
(132.1 kB
view hashes)
anltk-0.5.3-cp37-cp37m-win32.whl
(115.2 kB
view hashes)
anltk-0.5.3-cp36-cp36m-win_amd64.whl
(132.1 kB
view hashes)
anltk-0.5.3-cp36-cp36m-win32.whl
(115.2 kB
view hashes)
anltk-0.5.3-cp35-cp35m-win_amd64.whl
(132.1 kB
view hashes)
anltk-0.5.3-cp35-cp35m-win32.whl
(115.2 kB
view hashes)
Close
Hashes for anltk-0.5.3-pp37-pypy37_pp73-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e8ce92722b22533bff4ce455b4b87937a990d3a4a87095d6c6aba467adcd88e6 |
|
MD5 | a6399c35fc7cb8c60c0376ca1675bbd6 |
|
BLAKE2b-256 | a56576604ae874bb804ce7734250258d8be9b52d829c77432ce61c9212be0841 |
Close
Hashes for anltk-0.5.3-pp37-pypy37_pp73-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d704c3f9263e5c8fe96ea5e7252f0db6cfc22298739beb0d0fdb339418b075ae |
|
MD5 | 0911f8ddbd4d7c7ece604d81f888928c |
|
BLAKE2b-256 | e395d52be18811cd9f0ee7a1615b0e404811993e42b55463e2f91c4a5667a974 |
Close
Hashes for anltk-0.5.3-pp37-pypy37_pp73-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 819b3562265b964f08e8cca59cbe6d3aa1c4df2a91dccd6a9cfe6aee965fddf9 |
|
MD5 | 7e90129f62516c8538c62aa18ff6817d |
|
BLAKE2b-256 | 5d5a83d96118a82dcd96bfb2a5ca847a68c1122903af9b1e636b7aa5f9943e8e |
Close
Hashes for anltk-0.5.3-pp36-pypy36_pp73-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8bca7493bcc356d0bc0f6287690235fe8035bf334784a05bdfaa663bc4714765 |
|
MD5 | 71b2b4c359f238a0364ebd69e55d70f7 |
|
BLAKE2b-256 | f2e5e1d474bdfbcd147a342e6a0f9d4296e59efead69aaaa9b6a810c01707c07 |
Close
Hashes for anltk-0.5.3-pp36-pypy36_pp73-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ec8420603c92ae9f276d813045e008955157cd2bc7baab6e38a33411505fe642 |
|
MD5 | d2fd6fa0be887ad40657233cbec774cc |
|
BLAKE2b-256 | c8adfe2006b1002fd08d6d51288dee18229d2255479b6931f439a6dc19e99136 |
Close
Hashes for anltk-0.5.3-pp36-pypy36_pp73-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4bfa28e808150dbf2f8da085f57ff96ceec5aa022d68fd524dcefaf625a08338 |
|
MD5 | d159aa6719048c49e1f482c3ff40a61d |
|
BLAKE2b-256 | aacca1b37c8f726a9abf47c9c99339a224ba76c6ea892897fc3c7a4dcf8024c0 |
Close
Hashes for anltk-0.5.3-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f18e53a2fd56a075ab19c4d864f5f7c67ad1eef97ea086192988f623a119d117 |
|
MD5 | 9aac234872d60b615481984cc9c25bed |
|
BLAKE2b-256 | 9942e36e79403f72e4f7802cc25ff6502846fa0e0d2f5e1396d4abb536eb523c |
Close
Hashes for anltk-0.5.3-cp39-cp39-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6424ce4588af6a5093dafc9563752550a88bc2cff948aab3aad9496fe5703efe |
|
MD5 | 219157c670a9ac923ddbb7aa9bb783af |
|
BLAKE2b-256 | 7109ac03f8a294ee1139038aa1b6ff7d483952721a80fc329d257292a3c95dc8 |
Close
Hashes for anltk-0.5.3-cp39-cp39-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b2d63b600cf11703f6d73ac1dbfd4282c8611a7c8e17cc2ff32b9cb3c665b746 |
|
MD5 | 7f128d0ab8c34537477ddc23a96e886a |
|
BLAKE2b-256 | bb8cd682724df8bc6ebc5ea13c3bc6034ee6297a616b2e1ad3991328c61b2d25 |
Close
Hashes for anltk-0.5.3-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 58327e85e03670a2e25c882358876891805d6473ab825ac2beb72f0d016771a0 |
|
MD5 | 79974e7b8987b09485aec725aec5ac9b |
|
BLAKE2b-256 | c1eb519c5924184ede4c2d569e2db3d7b1c72050e814177ce877a21741270ffe |
Close
Hashes for anltk-0.5.3-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c9da9eea34e4f33f32959c7abccc503915090ff5fa6ed71c5f0232b3cae996f3 |
|
MD5 | f12559c3e2c00bd557b9b9c252687b43 |
|
BLAKE2b-256 | 9f73eaad6d933ea12169c6a717d47cd6374b8a95523c4a8a34e40b324f95a092 |
Close
Hashes for anltk-0.5.3-cp38-cp38-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4bfbe8824c8217295cd5e1cf387f68013310b550379ed9ca86704c25e8063cae |
|
MD5 | 0c2fd1cf5a8ed8181c1094d834a0b729 |
|
BLAKE2b-256 | 1252c7b35b89fc18768b54a6d72054a8dc4829404582c51142c0b5b33d626e03 |
Close
Hashes for anltk-0.5.3-cp38-cp38-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a5a20b55c557970d8bd3e1120dae7be82c295de2a0cb0fda91b900c92bf5ab6b |
|
MD5 | 5c21296fd8a40fd1df592cab5f4271d4 |
|
BLAKE2b-256 | cb54c0d1a19bb23d1624b43ec5605c294fa20dd9d8b0638bd7b892062df441f3 |
Close
Hashes for anltk-0.5.3-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fbf37aade8eb9c8594ab7df97ba5ed9bbffa3bc73aa21ed1e5f9b723b762d738 |
|
MD5 | dd72d3f9aa20362b1bc48665b383e538 |
|
BLAKE2b-256 | 69164afe9f413e0c28d121d51513badd2fe0da82f8d3a7d8008b8980425ccd07 |
Close
Hashes for anltk-0.5.3-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 96642ec1b17c531589b26f1072c40c53752f782d0e52c36a46408f8cbb65a9a4 |
|
MD5 | 9472bf55e948c005e15542b636445959 |
|
BLAKE2b-256 | 0dd47724361466871ac1f2cdd1e9539cf762563e3d6423e3814988b4ef1b0ca6 |
Close
Hashes for anltk-0.5.3-cp37-cp37m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 687ea0bfaef151dbd04a2093cd0883a8dea95fb27984ff51235545d581e0283d |
|
MD5 | 3d067be8b030f1de8f990e1b60a10e75 |
|
BLAKE2b-256 | 87d5ddd1f4cb99c2ab270fa6acbcc513c4c5203eac8f555cc9fde537ffdcb926 |
Close
Hashes for anltk-0.5.3-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 845076a7b1e697fe2ae0f1c28d014dbba8080941159951f4b4b4f9bca19ae5d9 |
|
MD5 | 29be322c98556078bd26d971876c8005 |
|
BLAKE2b-256 | 145135060e96ed6304c16fae4faf8a21de4c56484e0dc9cf6aed6425a6315880 |
Close
Hashes for anltk-0.5.3-cp37-cp37m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | be7bf7b8cd7ada4c0dc2db4ecdb10554bf7947bcfd09f3a7ed76960ae30e158d |
|
MD5 | 68176456792948e5a5d0687d13c0e030 |
|
BLAKE2b-256 | 66596e52bacb14c26a3161aba17cdc4f0fcdaff9d7448fff1039f6cf94750822 |
Close
Hashes for anltk-0.5.3-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | aadc05e033b5d43eb8d32ace50206fc6a4a404c515825652f4a707db4d05502b |
|
MD5 | 2635c7b688dd149bd0a28e2b7a070ac7 |
|
BLAKE2b-256 | 2a218bf6bd8604fd03ac18b510f4d323e42a7a02d40d356803c3aa814032930b |
Close
Hashes for anltk-0.5.3-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b8693c76126ffaed0e4fc7a424b7e13d19652b90bd9ef14ed367f298cc26d568 |
|
MD5 | 559f9f206385330180a431df668304a2 |
|
BLAKE2b-256 | 347c0a2aa8c47f33c775c1b8e65ba8da17665d9d5d2543a342bd49fecb83b300 |
Close
Hashes for anltk-0.5.3-cp36-cp36m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 686f17103b5b5d940865d6677b89ab41f49892b3aff471b97e18d4906f174418 |
|
MD5 | 6773df87b8e8885ce6e5d4836a78235f |
|
BLAKE2b-256 | 62c2835f3023249c4013261c7a1931aca5553e1fd1c085bbf1a8fbf06936b5f1 |
Close
Hashes for anltk-0.5.3-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a458486d92a9f90ebb5d02ff931666449aaf972af3d95e951cbbc2d4ec3ac2c2 |
|
MD5 | 141d5dec2998d9697919dc4f56f6f048 |
|
BLAKE2b-256 | 36713983689026140ba65a9aa7cdb3eeebbdff8b1be3f9070d7e3178c8e955ab |
Close
Hashes for anltk-0.5.3-cp36-cp36m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 99226f8d3c7e45051fee6ad93882ff0a71ef85598a892b47a1db9b917b8974a5 |
|
MD5 | 405ea078fc0dd55c069aec893457895a |
|
BLAKE2b-256 | 2ce96d8be5f93ca55244b2625c7a90e8236d39ad6179f4f67b67c792ca467b8e |
Close
Hashes for anltk-0.5.3-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1539efc96f3af253747df7f93ac2a3e6a5935657dd77e0448e02a31d6184be39 |
|
MD5 | 17dead99afa5a5be9ab09a0598d3943a |
|
BLAKE2b-256 | 1885c2720b05afbc941b7f5d651f1510e097513cfbf6fc0c9b4a4ab8147e95ef |
Close
Hashes for anltk-0.5.3-cp35-cp35m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2700f7a84761fccd88784ae1b37612080952231e2c90675a644c60384ead4ea5 |
|
MD5 | d9b065fbcf6bee2813336acf3ad24202 |
|
BLAKE2b-256 | d7a868102878ee946bb627ed6696efbdb62586b3d9357603c1e57cd28701a94c |
Close
Hashes for anltk-0.5.3-cp35-cp35m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 86025051990409f693277592e8526a73e872906501b1c036d4f36b9cd5500d6d |
|
MD5 | 3f9bf198c6660834e71017b991c4a565 |
|
BLAKE2b-256 | e0a8d6afb6126dbc41516fbfce78683183726542b47a447e7af88474afebd9df |
Close
Hashes for anltk-0.5.3-cp35-cp35m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | de3d871b28dd237d53e2b8f8b3c59df76e52db4bee6ea926ed008a5d122dc9cb |
|
MD5 | 88bfc70044adcbc4556e4c1a08728b5a |
|
BLAKE2b-256 | ca64c4c2cd3c563aa767032f55e576733fa33d97b8fa7b3440deeb5e4f86389b |
Close
Hashes for anltk-0.5.3-cp35-cp35m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e5859f0fb13a152e862cd9e8df666ee561cf7df95ad870ac2080b016e6dc341d |
|
MD5 | 642a757b637ed51478a58857ca6a8105 |
|
BLAKE2b-256 | f31f537beaf22c25ecb9e5d98d81a6ba0a774803c3898675d9f36e41be0af9d3 |
Close
Hashes for anltk-0.5.3-cp35-cp35m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 531ceddbea56c74ee4b32160ffbc7e75849a767fb8607a244b168bddd70bd9a5 |
|
MD5 | d0fd9021fffaad4cb3a2f97fa1671063 |
|
BLAKE2b-256 | 03e7b3ed1846fd506f75a2a893309249156ea8babf28bf2efce652a7b818f41b |