Arabic language processing toolkit
Project description
Arabic Natural Language Toolkit (ANLTK)
ANLTK is a set of Arabic natural language processing tools. developed with focus on performance.
ANLTK is a C++ library, with python bindings.
Installation
for python :
pip install anltk
Building
Note: Currently only tested on Linux, prebuilt python wheels are available for Linux, Windows, Macos on pypi
Dependencies:
- utfcpp, automatically downloaded.
- utf8proc, automatically downlaoded.
- C++ Compiler that supports c++17.
- Python3, meson, ninja
pip install meson
pip install ninja
git clone https://github.com/Abdullah-AlAttar/anltk.git \
&& cd anltk/ \
&& meson build --buildtype=release -Dbuild_tests=false \
&& cd build \
&& ninja \
&& cd ../ \
&& python3 setup.py install
Usage Examples:
C++ API :
#include "anltk/anltk.hpp"
#include <iostream>
#include <string>
int main()
{
std::string ar_text = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ";
std::cout << anltk::transliterate(ar_text, anltk::CharMapping::AR2BW) << '\n';
// >bjd hwz HTy klmn sEfS qr$t vx* DZg
std::string text = "فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ.";
std::cout << anltk::remove_tashkeel(text) << '\n';
// فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
// Third paramters is a stop_list, charactres in this list won't be removed
std::cout << anltk::remove_non_alpha(text, " ") << '\n';
// فراشة ملونة تطير في البستان حلوة مهندمة تدهش الإنسان
anltk::TafqitOptions opts;
std::cout<< anltk::tafqit(15000120, opts) <<'\n';
// خمسة عشر مليونًا ومائة وعشرون
}
Python API
import anltk
ar = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ"
bw = anltk.transliterate(ar, anltk.AR2BW)
print(bw)
# >bjd hwz HTy klmn sEfS qr$t vx* DZg
print(anltk.remove_tashkeel("فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ."))
# فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
print(anltk.tafqit(15000120))
# خمسة عشر مليونًا ومائة وعشرون
For list of features see Features.md
Benchmarks
Processing a file containing 500000 Line, 6787731 Word, 112704541 Character. the task is to remove diacritics / transliterate to buckwalter
Buckwatler transliteration
Method | Time | ||
---|---|---|---|
anltk python-api | 1.379 seconds | ||
python camel_tools | 11.46 seconds |
Remove Diacritics
Method | Time | ||
---|---|---|---|
anltk python-api | 0.989 seconds | ||
python camel_tools | 4.892 seconds |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
anltk-0.5.14.tar.gz
(163.6 kB
view hashes)
Built Distributions
anltk-0.5.14-cp39-cp39-win_amd64.whl
(152.6 kB
view hashes)
anltk-0.5.14-cp39-cp39-win32.whl
(135.7 kB
view hashes)
anltk-0.5.14-cp38-cp38-win_amd64.whl
(155.5 kB
view hashes)
anltk-0.5.14-cp38-cp38-win32.whl
(135.7 kB
view hashes)
anltk-0.5.14-cp37-cp37m-win_amd64.whl
(155.0 kB
view hashes)
anltk-0.5.14-cp37-cp37m-win32.whl
(136.5 kB
view hashes)
anltk-0.5.14-cp36-cp36m-win_amd64.whl
(155.1 kB
view hashes)
anltk-0.5.14-cp36-cp36m-win32.whl
(136.4 kB
view hashes)
anltk-0.5.14-cp35-cp35m-win_amd64.whl
(155.1 kB
view hashes)
anltk-0.5.14-cp35-cp35m-win32.whl
(136.4 kB
view hashes)
Close
Hashes for anltk-0.5.14-pp37-pypy37_pp73-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6efbae8f64ef83bc23cb29d39650fb3271762a28ddca69a61bfddca449c2a54e |
|
MD5 | a1b9832e7d33022a1f54d0e206b59df6 |
|
BLAKE2b-256 | 50f93d00c6a2405875f309b5a79b50f19b846c28420d817aee4837e309aea09a |
Close
Hashes for anltk-0.5.14-pp37-pypy37_pp73-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bff640fdad679df1b469c1ee2eaf95d24df4674dd5cfe8624271e3a62d473db1 |
|
MD5 | 4fb0a4a79579bd3e21397c29320de447 |
|
BLAKE2b-256 | eba961f58839f531c94ac3938423531b9cf6688e1d51f8402ba9671d52e17310 |
Close
Hashes for anltk-0.5.14-pp37-pypy37_pp73-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 58c4a480cb3f83cceba4f11525a1c934411b16ca711499dbb6fabc755bdefc15 |
|
MD5 | d84280be14a85910c85b30a27391a803 |
|
BLAKE2b-256 | 2a52e999ded67703e50d6e638ac4f10ece936f833082152563c5e04e6e9b7672 |
Close
Hashes for anltk-0.5.14-pp36-pypy36_pp73-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a01ab3563241ead37fc815590c8bb4f3b9becebb2906fd7a114ba7c37764a6b1 |
|
MD5 | 7c3b81b08a759e644a64995b66696abc |
|
BLAKE2b-256 | 81c45f6d87e720c9ae795b34ab21b356c5510f2e9fd35130fcc048a55f3e9c0a |
Close
Hashes for anltk-0.5.14-pp36-pypy36_pp73-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a1db38cde4b6b83c020cec44e791390519ec19a26942df872bd0dcf9486fe7b6 |
|
MD5 | 0c5df1ddc98b204af59d2e05bb520475 |
|
BLAKE2b-256 | 23f5de8cbc5a18e5aba19224afdfc58f74cdfbb37860d5036624817534cbf341 |
Close
Hashes for anltk-0.5.14-pp36-pypy36_pp73-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2e16b7d15275d3fab0844181b45214824fc2c9d90d8c6507156e1e382fae7131 |
|
MD5 | 54a0eb958cf1ae8a71cecb67d3f758c1 |
|
BLAKE2b-256 | ea9e503f12efe45de53b49032af9847ec6159ca2ee761ba9a46e364eb8bb1645 |
Close
Hashes for anltk-0.5.14-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 92b98f3ecb9fa055f72454258119f0e3a18f90e174cbff2c074029598e002a9c |
|
MD5 | 76709fc4fe74ac30ff4450e848e7dbac |
|
BLAKE2b-256 | 47d5fed34a9e9bf9d12724a7215fcd6acdc751269111bb682969a0d523794eec |
Close
Hashes for anltk-0.5.14-cp39-cp39-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 17bf49e4f6e4074c8d8c6a2341835485f5580703b17806778d4850dd1df854b3 |
|
MD5 | 88219de63613dc813f2901d95c3fa44e |
|
BLAKE2b-256 | dbbc3b829121c8f57dc184a1a73a870c55494a08600a7f09f320dfd562516e3d |
Close
Hashes for anltk-0.5.14-cp39-cp39-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 16cbc8aef2af62b5fbadee5dfcd15990d3fd80e06368fb00d4a29e47217be0b9 |
|
MD5 | 6ba8f4deaa8f2a783e26c929ea2eb39a |
|
BLAKE2b-256 | e8878041ffd70441770a0cfcca747757c7e2a07f7e3c56f7a687b7302de221b1 |
Close
Hashes for anltk-0.5.14-cp39-cp39-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bc73c8ad0a833b5e3ff5a28d3462a38f2c374fffc915df12f6b54561f7d7250a |
|
MD5 | d3f9e2cf835fdd2f0986149fab04aa42 |
|
BLAKE2b-256 | 85b4f9f7234550b1fdfef45d144ad1f7dd05894fcde2356a87d2ab9928226364 |
Close
Hashes for anltk-0.5.14-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9043f5ad11a737f017b6725c3fcdc0f3918cca4a1068542e1ee9f35b4369c5b8 |
|
MD5 | a482510f6133c7d98fbd43e3a3a5f652 |
|
BLAKE2b-256 | 3a46b41ee1e3c7fa0163803f70e2594dbaefacf5841876d2918f11782b488655 |
Close
Hashes for anltk-0.5.14-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5567dfac1cfb1542072eb69cf1bb9e075d5b0b3e036241a5acb4920352afbfb3 |
|
MD5 | 70594bd18f5ee02eade113d0303febc7 |
|
BLAKE2b-256 | 07f2ef152b7ecd7fe6269192d0be55a4ece92c4e1f1b9e00dd16a39ffc56f6a3 |
Close
Hashes for anltk-0.5.14-cp38-cp38-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8ab2a78e851b3ecba74ff7690868e378db2275f0b14ef3f71d01f5293fa449f8 |
|
MD5 | f4d3f3c3cd94be42c882f10dbcdaf099 |
|
BLAKE2b-256 | e01f85e86c64336bfcbf875706fe4e67c4a858a7a3f35e60cb27aa4125d07fbf |
Close
Hashes for anltk-0.5.14-cp38-cp38-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2f4ecc6fdfcac0cb626082f5ae9ec65ee2753dfa9a3c77ea0df894aa78252952 |
|
MD5 | 7ca432253e20c60a45bcad174eee390c |
|
BLAKE2b-256 | 5fb1da69bed9d30dbb36c62e507864dc6174d2007c0e1ffba09cbe843c54d022 |
Close
Hashes for anltk-0.5.14-cp38-cp38-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2d188ff98c113abd4c3ddee69d48b46359921102c3bebb24bb182b1d9f6dd884 |
|
MD5 | bccd47c536115e45b138436e5dca2aa9 |
|
BLAKE2b-256 | 2938e9162977237a76616dcc6f176f2a966878665c95e69c98af27a54d1cd886 |
Close
Hashes for anltk-0.5.14-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 190b4dc0a0d516661449f678786744833ebe77ea74946114b974de08e0d64c3d |
|
MD5 | f68ceea89a82d8bd23da48190e84e3d9 |
|
BLAKE2b-256 | 195ecf99f192e1f951448a16813780ad7464195ea928ee64c75a6a4858682ddc |
Close
Hashes for anltk-0.5.14-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c8d9e5bcf6f1dd1ddf81ea56d76ae9b18bb936ba53cbfac25884730e90901e57 |
|
MD5 | 64fa5915e9d5adc1bb59500819495522 |
|
BLAKE2b-256 | 3492f7f66e58fc96874d3badb39e9dfecb8b06c22c328c22cfe6e9b212b8254d |
Close
Hashes for anltk-0.5.14-cp37-cp37m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6a9f197b3f906c08622511c2a8713933a993d42518aeba55f78a09087efb906c |
|
MD5 | 59c358fd3eab990a6b0b78cfb461364e |
|
BLAKE2b-256 | a54c05433f3d7b02a52a01ea0d4abd67b76309326a92d3164948493f5c9908e7 |
Close
Hashes for anltk-0.5.14-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 53f102ab301a5872394fb45e90a1df0ed992359359854f1788b5fc798113456a |
|
MD5 | 431562714d9bae5538b3cc65e079af40 |
|
BLAKE2b-256 | e69593b26d1f443cacd0a9e1b9dbf91934e15dc2b6faf2e196394b70cb0f8324 |
Close
Hashes for anltk-0.5.14-cp37-cp37m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 586c6825be4ce19165bbaf79f961a7ce48c01b2c53ff46b7bf9a68c43efff3e8 |
|
MD5 | 5d19102053fa13f01422c40a791b8517 |
|
BLAKE2b-256 | 30d9b8536fac092c4fa81a342b0f6b3a0281a5845cb404a85da8b1bd6e82f191 |
Close
Hashes for anltk-0.5.14-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 02666542a33ec83cbd7211a996951b1cb01825655953dde57e16c40c423d91a3 |
|
MD5 | 74387943d0a1c93ea154acf58b0d8562 |
|
BLAKE2b-256 | 358b8a1df2230cd95851dad6606c39684f25ff2a975dbe776efaa29af4bf01f1 |
Close
Hashes for anltk-0.5.14-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 25cabf1af2b820a56c3ad70c53dacdaed4214da2afb14ff81e4ebc2de5077eaf |
|
MD5 | c97c8e36bee81459ce4dfa0c499b7c8f |
|
BLAKE2b-256 | 0870291a56290d70590f4417369a2df96dc4d8af372f2295fb79833478151488 |
Close
Hashes for anltk-0.5.14-cp36-cp36m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3969177f88b0f5ce531374c2c62267f0822f4f28120235d1977934c89721f87b |
|
MD5 | 024b07b5eea56739fdef040d14d4272d |
|
BLAKE2b-256 | 0a7f718d11bac9c0d01c8d21a5e6c4cb7faa9ead59fe1d18e3ca9574a407a26c |
Close
Hashes for anltk-0.5.14-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4bd4db00dd6cfd807e9d33c13f655f03f43844024f84ac7b9e61d7ed314af200 |
|
MD5 | 9959fb9ed4fd427ecdde937866669300 |
|
BLAKE2b-256 | 1a62bc191c8c3383831e953adaa023b3c7868fa4ca4c53d43902a9b13f58a4b3 |
Close
Hashes for anltk-0.5.14-cp36-cp36m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6818acc9e929df3de5a11f6cc2f7208a6dbcf0eda8226a59859b8fa1a86f0d1d |
|
MD5 | 06cf148cd8076f41ae1731501a0fc2ae |
|
BLAKE2b-256 | 35b53500b3386eef221a4debef3c1c0fcbf501258d44bfb3bc926fbe3ad0e4c0 |
Close
Hashes for anltk-0.5.14-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ee01f3886e559ca0dd4b2fe915c79ea69f4842ba2051dfa2a1c1579a43f485e3 |
|
MD5 | 28e35ff0a487de3322fe0725e3609482 |
|
BLAKE2b-256 | 5727c9b8b78f5cbdb159534f03acad587629316d7f520cd0c9901fbcb42e6ec6 |
Close
Hashes for anltk-0.5.14-cp35-cp35m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 72cecddc5167e39c502c905727f6d864ff2ffc7b3ec054109926d8de922a9d85 |
|
MD5 | 2bd092a6f241e79a0b6959b579428100 |
|
BLAKE2b-256 | 734f213d9fef2d638a33b36650aa56f450d0ea8a1ea56cf187131772ca8e5b91 |
Close
Hashes for anltk-0.5.14-cp35-cp35m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b6765c2e156631cd9031e029ee331c593759b4c4b9f7863e3e2b7136bb75e2e0 |
|
MD5 | 64c43651e6cdb33e45bcb7a543225909 |
|
BLAKE2b-256 | b3a74e895e2884bcb7bc803bd0b59c42a0850c30b393552f19c59f0eb2234b21 |
Close
Hashes for anltk-0.5.14-cp35-cp35m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b74adf5d252a2a0b8d4e75ccbc92f6b7d429f772e55c1f907ce5b219fef2bb08 |
|
MD5 | 08e40b859be01b66b7834f0a60d3243a |
|
BLAKE2b-256 | 9c0ed7c1521bceb85509608c427ce56172f982fc215b69e8a6a8b7832c718519 |
Close
Hashes for anltk-0.5.14-cp35-cp35m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9f8f30fe2d9ec757414931475dc4ff85ea9a0e38a7fd612e84c11b69a39c8ebf |
|
MD5 | 959a35a626a4a263b0fa6cfecccfbea3 |
|
BLAKE2b-256 | 26858535a9de14c3f300708f64676e3d633ae303fa9aa86c6a248dc1e54f19c8 |
Close
Hashes for anltk-0.5.14-cp35-cp35m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6fc741e5061b5e4789ecfd8b48f561582aa8771ec64537251e95414952840dea |
|
MD5 | 1b6993ed9a6aa709c9244938bfb8627c |
|
BLAKE2b-256 | da590e410a032706e1aaa7285a433b7316294e0bfc6db784a7556d84014eba31 |