Arabic language processing toolkit
Project description
Arabic Natural Language Toolkit (ANLTK)
ANLTK is a set of Arabic natural language processing tools. developed with focus on performance.
ANLTK is a C++ library, with python bindings.
Installation
for python :
pip install pybind11
pip install anltk
Building
Note: Currently only tested on Linux
The Library depends on https://github.com/nemtrif/utfcpp.git, which is cloned automatically.
you also need a modern C++ Compiler, which supports C++17
also meson and ninja needs to be installed.
simply with pip
pip install meson
pip install ninja
git clone --recurse-submodules https://github.com/Abdullah-AlAttar/anltk.git \
&& cd anltk/ \
&& meson build --buildtype=release -Dbuild_tests=false \
&& cd build \
&& ninja \
&& cd ../ \
&& python3 setup.py install
Usage Examples:
C++ API :
#include "anltk/anltk.hpp"
#include <iostream>
#include <string>
int main()
{
std::string ar_text = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ";
std::cout << anltk::transliterate(ar_text, anltk::CharMapping::AR2BW) << '\n';
// >bjd hwz HTy klmn sEfS qr$t vx* DZg
std::string text = "فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ.";
std::cout << anltk::remove_tashkeel(text) << '\n';
// فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
// Third paramters is a stop_list, charactres in this list won't be removed
std::cout << anltk::remove_non_alpha(text, " ") << '\n';
// فراشة ملونة تطير في البستان حلوة مهندمة تدهش الإنسان
}
Python API
import anltk
ar = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ"
bw = anltk.transliterate(ar, anltk.AR2BW)
print(bw)
# >bjd hwz HTy klmn sEfS qr$t vx* DZg
print(anltk.remove_tashkeel("فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ."))
# فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
For list of features see Features.md
Benchmarks
Processing a file containing 500000 Line, 6787731 Word, 112704541 Character. the task is to remove diacritics / transliterate to buckwalter
Buckwatler transliteration
Method | Time | ||
---|---|---|---|
anltk python-api | 1.379 seconds | ||
python (camel_tools) | 11.46 seconds |
Remove Diacritics
Method | Time | ||
---|---|---|---|
anltk python-api | 0.989 seconds | ||
python (camel_tools) | 4.892 seconds |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
anltk-0.5.0.tar.gz
(23.9 kB
view hashes)
Built Distributions
anltk-0.5.0-cp39-cp39-win_amd64.whl
(124.5 kB
view hashes)
anltk-0.5.0-cp39-cp39-win32.whl
(109.9 kB
view hashes)
anltk-0.5.0-cp38-cp38-win_amd64.whl
(126.9 kB
view hashes)
anltk-0.5.0-cp38-cp38-win32.whl
(109.6 kB
view hashes)
anltk-0.5.0-cp37-cp37m-win_amd64.whl
(126.3 kB
view hashes)
anltk-0.5.0-cp37-cp37m-win32.whl
(110.5 kB
view hashes)
anltk-0.5.0-cp36-cp36m-win_amd64.whl
(126.3 kB
view hashes)
anltk-0.5.0-cp36-cp36m-win32.whl
(110.5 kB
view hashes)
anltk-0.5.0-cp35-cp35m-win_amd64.whl
(126.3 kB
view hashes)
anltk-0.5.0-cp35-cp35m-win32.whl
(110.5 kB
view hashes)
Close
Hashes for anltk-0.5.0-pp37-pypy37_pp73-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d6749155af788da140ec19bb3e9b759392db213ea1edc938f4851126f76e2797 |
|
MD5 | 3e5f540d9952526bd0f8990f7eddbb8d |
|
BLAKE2b-256 | d28fb7fb9c6631fd8dee8d6d3e0fb693d44dd787723ee869e613e9799005ce6e |
Close
Hashes for anltk-0.5.0-pp37-pypy37_pp73-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2b505f5fab81d80392400b59a6a8fa4edea4cda29fe08bee5201a68cb015298a |
|
MD5 | 2f6c1c9f4d77184a90b284f6d53277f6 |
|
BLAKE2b-256 | 9a570f7a33f5493a9897fbe2241b398483f785b9344b8793665248ae094920b9 |
Close
Hashes for anltk-0.5.0-pp37-pypy37_pp73-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d631adcf19399e81f3262ee8add73517597b01d120ca4dfaa2df2cd7b34d98fb |
|
MD5 | 3310e049475451051e3ce773aded107a |
|
BLAKE2b-256 | b296b7551068fd0509f41bf86bdfa650c155e9a10a93025b5d765081958cf503 |
Close
Hashes for anltk-0.5.0-pp36-pypy36_pp73-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | add172cdbb185a35e83cd5579aa85418532258e6c80c202c2288b167bae7a765 |
|
MD5 | a2984d968d141b5704d45069d48e49ca |
|
BLAKE2b-256 | 3c36996212643a459864b2a4a7d3791385a0518df7368f416e25ae62cf6b7b9c |
Close
Hashes for anltk-0.5.0-pp36-pypy36_pp73-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ba56ea73e4aec25f4ddae654f4c5a8ebba930fec3a5b37f688222a755f4215e0 |
|
MD5 | a3551940c3360af2858283dbd36c6b1d |
|
BLAKE2b-256 | 0b7f161a0a2def199a8d52d867804ce380948185a00bd0933b406e23481c13a7 |
Close
Hashes for anltk-0.5.0-pp36-pypy36_pp73-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f77823b6783d3996b498078a1380f7f86fd1a1e66660637371bcee6377d64e83 |
|
MD5 | dd5a9ee0323c9a097466edfef200958c |
|
BLAKE2b-256 | 0594aeb0e1b5580d92d515509bd1a023456044304059ac4248cad6b7ca065ffc |
Close
Hashes for anltk-0.5.0-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b3fe26b6816d17e91f97024eb2b3bd0e4c0eaabf5049fafe65a950b542791b0f |
|
MD5 | 73ca11925e25931f0c16fbc24829652f |
|
BLAKE2b-256 | 163daa449f2049497965be3d958c6698cc8c8df3561b8d3d4d8d47c459c6bda9 |
Close
Hashes for anltk-0.5.0-cp39-cp39-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 196023906597316597ec047e9f6b8cc598f2ca1f0245a5f9237a700faef19a7a |
|
MD5 | c983eed646c07958585678aa1dbfdc65 |
|
BLAKE2b-256 | dd1b985f294ec495a1cf53579a8775152ff39132cc70559f3388729ee62162ad |
Close
Hashes for anltk-0.5.0-cp39-cp39-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c05cfefd449c42609520200106eadf247d7433d84af8a67cebf13a1a14f78c88 |
|
MD5 | f682ecfd2e23473f02bf5404b502ae61 |
|
BLAKE2b-256 | a8fde2507e66456ed058df88e01fd45f3892bb1bcb8d1f73aa38d84d2ba6f55f |
Close
Hashes for anltk-0.5.0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d3beef5c9252e10a65dccaf7e7b1a29a0acc7ff09953204f300e9d680cdfb039 |
|
MD5 | 5f84a9f764ce8b371a0838cca71cb336 |
|
BLAKE2b-256 | be670b75552420355bbd59dfcdb70434ef2512aaa45caf76651c534f93714560 |
Close
Hashes for anltk-0.5.0-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 28e97d0fcfbc9e6cd1746e93aabf153dc6f3f20ad28197f8b64e514aa8ab85ef |
|
MD5 | 380552fd9e326b0ec626ee56712daf25 |
|
BLAKE2b-256 | e8d3778fb0a7f10951b08f795a258d6189259c4421b2e035a0571ac88ff88e6f |
Close
Hashes for anltk-0.5.0-cp38-cp38-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8f8d1553ba093c406025a7271cc8c8bb2425adb98a9e08ddaaa74ffc46cb705a |
|
MD5 | 56ce97a0bfaefdfacb9feaa7c50ef0c4 |
|
BLAKE2b-256 | 07f37be11970c56cfb64957f75bdab9cdc9179dc3130bd622fc767b05b0c856b |
Close
Hashes for anltk-0.5.0-cp38-cp38-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 50cbdbf1302c8dfde21c44c9f022e644948c6c7cdb05c5a888850648705ac864 |
|
MD5 | e58917e774e9635293ac9f65041b17ae |
|
BLAKE2b-256 | f13b1c1884b4d853a766eaa8fa5c7618d0e6a3e4990c047fcc8dcb2b3110b967 |
Close
Hashes for anltk-0.5.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e7434d42cc042bcac822f2741ca0ca55461fc85ca33ede2f3e5b9ffa6e8b874e |
|
MD5 | 7c3edac454d17cd9b8ec799d0ff23288 |
|
BLAKE2b-256 | 8cdeacf04da4a026ff795bc8207536789952ae7f51b688355907b4de20a15553 |
Close
Hashes for anltk-0.5.0-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 91afb3e67ab583f84335be2430f509f47f8fbd56f4ca5e3b188f4c792ae65316 |
|
MD5 | 486d29e627fcc3cbbf0ec3be30708d4f |
|
BLAKE2b-256 | 000194f123d8c5865b8e62c0b2a03cce1f1bf91df7eaa52eed318893c89fa1eb |
Close
Hashes for anltk-0.5.0-cp37-cp37m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4b7d934cddc8662dac5c280b6b4bcaeee411eaf870992129d23b0d1480aec22c |
|
MD5 | b7d9fe689d646562787d869ba840f128 |
|
BLAKE2b-256 | 685e7c5b5ff75c91497f834a2825e7516cdc07de5ff7612ededa49e39770dcb5 |
Close
Hashes for anltk-0.5.0-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8078684a1da9152885b5ecd8df75e3b0ce8a6d0a7fdec0aa81f03b94d857707e |
|
MD5 | e5216e2912a659540f5a18610d654f5c |
|
BLAKE2b-256 | ffdf362214600c516109d7d6d857f9e95531db483b2be86c4594cb78e8e817fb |
Close
Hashes for anltk-0.5.0-cp37-cp37m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d6f54735d3b28d391d966ec03ec5fff3850b45028634fda1d134fc5aa87b05d6 |
|
MD5 | 1e9b1d7315c24ba094be223bacfb1be7 |
|
BLAKE2b-256 | a2d7fe6649e838126b1a069416d0c9d5b088ff9005d029ff0f8335125fa60ca2 |
Close
Hashes for anltk-0.5.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6eb61974c511bb7f0e3fdcff9b73831321620e8a55cb915a994658c6bcf184f4 |
|
MD5 | 60daed04623395443667b2dd873fafb9 |
|
BLAKE2b-256 | 0a9c4d9ebb3183f0f20709f7057c7b5c58a641981d27b2083b6cba77d38e74f3 |
Close
Hashes for anltk-0.5.0-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e26f3ccf293139911bdd88ecc731d54ce6184b8a8313b67141582e1311999b7d |
|
MD5 | dbb09f6a359504ddfa2e1ff3c5c6cc06 |
|
BLAKE2b-256 | 33e7e44a4c8b0c5f725ee0c95745779ad6b8820f2f09d957f0c13b3c3e01deb2 |
Close
Hashes for anltk-0.5.0-cp36-cp36m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9bab162bfd6badd99ff00b3a7df98b47717e96f058a09b7a39b09b22c2f420bb |
|
MD5 | 4d3f27e829699662ea6801085aff8d74 |
|
BLAKE2b-256 | 3088fadbf3009bf8907622d276151908b6c60e2561f92ac2f66c90edf6b781a1 |
Close
Hashes for anltk-0.5.0-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e06d9f80d4e724a6c2d8dcc4c0a040c0d76578bc41d5b812a99431530f765df7 |
|
MD5 | df5bdbd6f61a00cb6d529d6eedad6ad2 |
|
BLAKE2b-256 | 45c470d6d296715d9e51f3c5652b9148744f272fa8e54c7bc3868a940de2b629 |
Close
Hashes for anltk-0.5.0-cp36-cp36m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 712a6f50f603694928a22dd01e46ee5610ad6719e6f67fa98d953270d4cab084 |
|
MD5 | a57efe7badac736a6a1c728dc303ab9d |
|
BLAKE2b-256 | 969d16c26282d6762e867eaf22170233836bf8ea7f07bc8ccc75b8b1212dc186 |
Close
Hashes for anltk-0.5.0-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | da7f613a4e8dcaae3bfa5baa0ef1e4acdc17fe3b67179051b29b9ba570688824 |
|
MD5 | ab631083bb853247a2a55da1e5fdf8de |
|
BLAKE2b-256 | 3411e6bf88ab41bfe27f819d28dd4eb79bbef4704bcfca06b70db541f30d60ea |
Close
Hashes for anltk-0.5.0-cp35-cp35m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b61afe75f4e441ae951c73aad541befd2916ebd49b36b021e42b70579d52cbc9 |
|
MD5 | 28e3a10b7ab0bcae40127066fc7ef2fd |
|
BLAKE2b-256 | 762e38c33030a9762dc36d0fdb9cfe091afda806c661a6c22dda90541f204d68 |
Close
Hashes for anltk-0.5.0-cp35-cp35m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a2c9ea52585d71e017089112994e23f99dd4816326dad13ec08706a33eaaac12 |
|
MD5 | f97c477662313a4026839c362a6095ed |
|
BLAKE2b-256 | 4ff39dd9c7a76dd832c2c8ea55f0c4e9b3cd05dfc861ad23221ec5dd03ae38b7 |
Close
Hashes for anltk-0.5.0-cp35-cp35m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c3a79a1eb291322a0721270cb327be3c8166d69b718c05039cfe08b1ff675f4c |
|
MD5 | 411b631ff27df65563218659c2116c77 |
|
BLAKE2b-256 | 99326289595f0cc3ded8f25cd1f80b30542fdb74a8cdb55ae36f52c996dbe3b0 |
Close
Hashes for anltk-0.5.0-cp35-cp35m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 132475beba4c59865954a976bd7ee06a49a34a611a9ee63e1523dfd8cb04a32a |
|
MD5 | b2a353e148fc26d838984c9d88165b0d |
|
BLAKE2b-256 | de3b79597cd74f0125ae313a2f6e64106cde1b8944aad74d9f149efbb975fedc |
Close
Hashes for anltk-0.5.0-cp35-cp35m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0c80bcbeee58a233bc0583c181a2cc2204c00ad84fc9ab1fcc9c3e701bc5c1a1 |
|
MD5 | a3872802eaf8d1bf7ebf6de5bc2aef38 |
|
BLAKE2b-256 | fc4f02b52a3b8cbd9741daaad9f2093ad5ac2c5c735aeebad8ad4c32699f44fc |