Arabic language processing toolkit
Project description
Arabic Natural Language Toolkit (ANLTK)
ANLTK is a set of Arabic natural language processing tools. developed with focus on performance.
ANLTK is a C++ library, with python bindings.
Installation
for python :
pip install pybind11
pip install anltk
Building
Note: Currently only tested on Linux, prebuilt python wheels are avialables for Linux, Windows, Macos on (pypi)[https://pypi.org/project/anltk/]
Dependencies:
- utfcpp, automatically downloaded.
- utf8proc, automatically downlaoded.
- C++ Compiler that supports c++17.
- Python3, meson, ninja
pip install meson
pip install ninja
git clone --recurse-submodules https://github.com/Abdullah-AlAttar/anltk.git \
&& cd anltk/ \
&& meson build --buildtype=release -Dbuild_tests=false \
&& cd build \
&& ninja \
&& cd ../ \
&& python3 setup.py install
Usage Examples:
C++ API :
#include "anltk/anltk.hpp"
#include <iostream>
#include <string>
int main()
{
std::string ar_text = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ";
std::cout << anltk::transliterate(ar_text, anltk::CharMapping::AR2BW) << '\n';
// >bjd hwz HTy klmn sEfS qr$t vx* DZg
std::string text = "فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ.";
std::cout << anltk::remove_tashkeel(text) << '\n';
// فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
// Third paramters is a stop_list, charactres in this list won't be removed
std::cout << anltk::remove_non_alpha(text, " ") << '\n';
// فراشة ملونة تطير في البستان حلوة مهندمة تدهش الإنسان
}
Python API
import anltk
ar = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ"
bw = anltk.transliterate(ar, anltk.AR2BW)
print(bw)
# >bjd hwz HTy klmn sEfS qr$t vx* DZg
print(anltk.remove_tashkeel("فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ."))
# فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
For list of features see Features.md
Benchmarks
Processing a file containing 500000 Line, 6787731 Word, 112704541 Character. the task is to remove diacritics / transliterate to buckwalter
Buckwatler transliteration
Method | Time | ||
---|---|---|---|
anltk python-api | 1.379 seconds | ||
python camel_tools | 11.46 seconds |
Remove Diacritics
Method | Time | ||
---|---|---|---|
anltk python-api | 0.989 seconds | ||
python camel_tools | 4.892 seconds |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
anltk-0.5.1.tar.gz
(23.1 kB
view hashes)
Built Distributions
anltk-0.5.1-cp39-cp39-win_amd64.whl
(125.4 kB
view hashes)
anltk-0.5.1-cp39-cp39-win32.whl
(110.5 kB
view hashes)
anltk-0.5.1-cp38-cp38-win_amd64.whl
(127.9 kB
view hashes)
anltk-0.5.1-cp38-cp38-win32.whl
(110.4 kB
view hashes)
anltk-0.5.1-cp37-cp37m-win_amd64.whl
(127.4 kB
view hashes)
anltk-0.5.1-cp37-cp37m-win32.whl
(111.2 kB
view hashes)
anltk-0.5.1-cp36-cp36m-win_amd64.whl
(127.4 kB
view hashes)
anltk-0.5.1-cp36-cp36m-win32.whl
(111.2 kB
view hashes)
anltk-0.5.1-cp35-cp35m-win_amd64.whl
(127.4 kB
view hashes)
anltk-0.5.1-cp35-cp35m-win32.whl
(111.2 kB
view hashes)
Close
Hashes for anltk-0.5.1-pp37-pypy37_pp73-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7223ea21ef740d0478f436a6aced05dab7cb171a3af5b12ffc0a7d33440b1589 |
|
MD5 | e09ce200219fcc53bc63670f4f93ad39 |
|
BLAKE2b-256 | db17d5ce6c26ce54433e1fa7afe884359678373db61756acff5fbe93a69e092e |
Close
Hashes for anltk-0.5.1-pp37-pypy37_pp73-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8747db9a05d5a40d93283c5e5d50cafa17e31836c29c8ceae92eef1846b4ff08 |
|
MD5 | ae4397566c94c002311c6933c3de59b1 |
|
BLAKE2b-256 | 855294c9a80cada8e39ef19c225c1a34cdb6f66943a7a08cfa428f9fc318db95 |
Close
Hashes for anltk-0.5.1-pp37-pypy37_pp73-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b93358ea8cf6129debeb134ce7c678589fc60881572e889892ee3842132cd0bb |
|
MD5 | cf5391e31121001ae526d9e6c5974117 |
|
BLAKE2b-256 | 7c6cf4757c2530537a17e7058cdd94fee2f244e5b2fd4ade0fe0a5ddbe691d0b |
Close
Hashes for anltk-0.5.1-pp36-pypy36_pp73-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6d8c7b906b19fce51b95f14faa72dbdbb678f4d2d739a2f6030c21ff77b93bd8 |
|
MD5 | 76c88baaec3b603970aa9ef3b95465e6 |
|
BLAKE2b-256 | da5b4af8aff3f1a27d523b16cfc3692a5a183a2c135e924f2980a55a4c2c9033 |
Close
Hashes for anltk-0.5.1-pp36-pypy36_pp73-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b092a50e4c894079066203e73f4033a203d9ba3951fcfd9d71ab727bb57b81cc |
|
MD5 | 6ba05f764a74428ce316fb611a9e6f53 |
|
BLAKE2b-256 | 09ff572a31101ad3f924f8d886269cce18a726d16396f5417c8b60ee36444095 |
Close
Hashes for anltk-0.5.1-pp36-pypy36_pp73-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 40dbfdde8c91f29498a1a3b92148bbc3e1ea2f4451c673b8cc8906e878d359dc |
|
MD5 | 771d0babd30b90b190a06dd48f9b9197 |
|
BLAKE2b-256 | 616af4615c6c1a276a37d05c12aa749a0f406d44204c8cfdfeb9f1c5a21b1ee7 |
Close
Hashes for anltk-0.5.1-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f94693e61acb51c18b0a5ef733efdd3482e0b2532ab9b69ff01866bdd929943f |
|
MD5 | 494a31b9fb6a4ff32fb0f42f23a9de47 |
|
BLAKE2b-256 | 139279447dad216efe1e857689dce8295fb0f29c160933a51c4de9c9954a47cb |
Close
Hashes for anltk-0.5.1-cp39-cp39-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4b30f013ccbeaa9406705212b977df6631ebd9e668aa7abcb6b4ed536f591f79 |
|
MD5 | e649485221f7d08839cb25047cdefe7f |
|
BLAKE2b-256 | e8f806ccd5315f414551feaa667d762992591e800acf583e5bb482bde0ec210b |
Close
Hashes for anltk-0.5.1-cp39-cp39-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ce06077ae34b0f6b40aa453ee0a7a72ae6af647308bc3bfbcd5c772e73f298d4 |
|
MD5 | 86dbaf2204d1234045d21a91b2715054 |
|
BLAKE2b-256 | cea768a6d38fc56aeee98c2a185df896a791ba9e34b5bf1122b0e75ad3deffee |
Close
Hashes for anltk-0.5.1-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 14669457e7d441a2c05f7f4ce324d1b4f71c1befdb411c3142b7dad2b59f882b |
|
MD5 | a526c3537b66b745a807a7c4210b91e8 |
|
BLAKE2b-256 | 69f2ccf88f3b1ea41455fb9659fbf8444a5597a3b69e3061e19a23ecc4a4a9af |
Close
Hashes for anltk-0.5.1-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 259c0f01b004cd14e7304c9074d83417d479bf83a0e62f86771feb7db63c4760 |
|
MD5 | cff62c91655cfd497ec8bd12cf632518 |
|
BLAKE2b-256 | 70cbac077a25f8a23b5be37fb0f8fe07fa24e32bad83c42372497f6bb37e6158 |
Close
Hashes for anltk-0.5.1-cp38-cp38-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 98a03786570346d9d9945490c407a7c4973bc7ba3887f069a02d25974f467828 |
|
MD5 | bc1b9c8f154fc081e45a8fd6c5fe74a7 |
|
BLAKE2b-256 | dc069d76870707573a6223a33fce846d9171514db01eb10fc124530a598402c6 |
Close
Hashes for anltk-0.5.1-cp38-cp38-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8685e90246088fbf05f71705e92c3c356956b61012cc11e7988d7a30c4bddca5 |
|
MD5 | c3e3aeaa6e1358491ad720d1f1d2a596 |
|
BLAKE2b-256 | 3d971f23b438c8ba2ba978a2ea4170a8bf0ca9ba9fff0fea1cf8352488dc8116 |
Close
Hashes for anltk-0.5.1-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 214caef89669893e97944a398fbcd583e89c79c64e90aab7aaee7d3844346eca |
|
MD5 | 1ad2de91d429d4839b150400549aa8d3 |
|
BLAKE2b-256 | 78d3594725b083e889e656422d8faaf514d5635029de18fac3cdc4a1c532686d |
Close
Hashes for anltk-0.5.1-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 204b854f2a668ca330c9e90af35941f290a666a35fa2ba78e6dc5d52a19e9ba1 |
|
MD5 | 26f08de1421927bf8f88bc49c6d3252b |
|
BLAKE2b-256 | ed2aea702535e58c8dfb9c632791fd0869f509c0dd7db371b79d4c4771673f9f |
Close
Hashes for anltk-0.5.1-cp37-cp37m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9f42b7df7209821253818776460e9f0f969aa4c2ebb9e79ab90898a9c30d4fd7 |
|
MD5 | c55e917a5dbbcb037adf1e2fb65612ea |
|
BLAKE2b-256 | 075694e86a54ff1d9d8761bdf2fb4507df7cb4d8c7fde25b3584853155247ed7 |
Close
Hashes for anltk-0.5.1-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c2c77ef74c94430e3f5724af86663c79d4f68ee02888dc4e700f336e3ae25419 |
|
MD5 | 5a8f11a09a68567387b5c18ba66338bf |
|
BLAKE2b-256 | 388cd820e505ec8d809ca9058edb19d82dde2252e9743e9b629b3e828bbedf69 |
Close
Hashes for anltk-0.5.1-cp37-cp37m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ca98d7f6dfe0b5f0d26d1283bdb987a99f591d8967a761ac85c1f7ed95640627 |
|
MD5 | 740def997493d793c7eb0bbf2a1716b2 |
|
BLAKE2b-256 | f88b14ae04d2597736bc62d9374900ab1b948455d772e8533fbf679fdbfecafc |
Close
Hashes for anltk-0.5.1-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1fb2e816a4f42527dfcffab9bf804b48fcecf9e18c0563be6753bfcb55e03b84 |
|
MD5 | 1a26f31d239d6a199b6ccb89abf1a301 |
|
BLAKE2b-256 | dbf62ba04ffe43c50267dd6babc944eb324396039b46b12b691473a535c07e6e |
Close
Hashes for anltk-0.5.1-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 14c4f6bf73832c7b26a22cbd1e84ebec1e005baccc0e7548ec566f23965f1b9e |
|
MD5 | 24df5a62765d385cf04af75c53dfa07b |
|
BLAKE2b-256 | cfee6aab3079ef42eed7dee172546a604553949471627c5b47ad7f1728b9ed32 |
Close
Hashes for anltk-0.5.1-cp36-cp36m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 948eeb10748335c2eea1a04a520209bc23f9c1287d395353b618da4be2b600b4 |
|
MD5 | e51a9e1079ac82e06253dfc7a396dc36 |
|
BLAKE2b-256 | 4e3c561940df4c618ed2317e13800808bf91fbda46a45b0a309afbac36169dae |
Close
Hashes for anltk-0.5.1-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b0406d392d307954f3828d275b8e3ccf03eeb408a888003f21e03adbdf8f9c72 |
|
MD5 | 05504a23bfd27026db10d939f4f9eb36 |
|
BLAKE2b-256 | 7b1c495a376586fded986e352d43c0f4db15a2fa18756ee0901a3db8769d36ea |
Close
Hashes for anltk-0.5.1-cp36-cp36m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ac8b7a06b9ee20d57da38fffbf6ec15e57c9a8eeede22d33414143f8ba8de254 |
|
MD5 | 46ad4b4228ade0133ef27d238f6e00b1 |
|
BLAKE2b-256 | 7a5b0de27531f21b4550407f204b1b0e32b0e4e2cfe4f327de2ebf7697540bf4 |
Close
Hashes for anltk-0.5.1-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 37a3ea2fcd35f49f56051aa318f9e286d29179eb970086861b3457bc0ad540ac |
|
MD5 | 123e37276ae0498a7c6277dcb4107f98 |
|
BLAKE2b-256 | 4ef55273afa99b3697319dd3a61ec0249917190c80a181e89127230714752128 |
Close
Hashes for anltk-0.5.1-cp35-cp35m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 800b7b13d68761a5a74ea220886732b42d737bfd48faa1beec7d8067243dc355 |
|
MD5 | f59edfd530473f60f89d1278c343e7bd |
|
BLAKE2b-256 | be9835b2d59aede041cfe06efdd678b0b7edb58010bdb05eff11f026e2b0a2b1 |
Close
Hashes for anltk-0.5.1-cp35-cp35m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | eb4c799f06ce42809fcbb12b395a21c77d65e504e50fa58017941304e9178ee0 |
|
MD5 | 43a5860930ecc9a44e8551fcef1b1a84 |
|
BLAKE2b-256 | 800c0be0acacf00788dbc4a77b3b2e7a255efd88859d4aebd790af7dfd57f585 |
Close
Hashes for anltk-0.5.1-cp35-cp35m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8b3e10d4bde05b9bfeb7eda2226c50af2940056257d5fc7b37df0121dece8563 |
|
MD5 | 85f513aa76a55e1acf12c18060b3c9db |
|
BLAKE2b-256 | b2a6a93d1e0e63bc3846739b2b3b2e189a8ffacb04b24b70e80f3632165c4624 |
Close
Hashes for anltk-0.5.1-cp35-cp35m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8fb517967d45095709a5f170ed0d37e741f7123bab7ecabb161596eaf6d96abb |
|
MD5 | 274b261dfc9a88d30c86816d1671df9d |
|
BLAKE2b-256 | 45e335009ad9916b75ba0a24f2d9cb9213424692feecb9fe63ef4387f33826ef |
Close
Hashes for anltk-0.5.1-cp35-cp35m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 736122512528d428b78b7ff490581d2f85c51f59f6ca7638dcc945f5cc9811fe |
|
MD5 | 72e6cac3c1ada174cfd86a6b1ec7cc7d |
|
BLAKE2b-256 | 19ac868e9578995aa408edfa298df973c990228b72e76ef0505bcadb42efd29a |