Arabic language processing toolkit
Project description
Arabic Natural Language Toolkit (ANLTK)
ANLTK is a set of Arabic natural language processing tools. developed with focus on simplicity and performance.
ANLTK is a C++ library, with python bindings.
Installation
for python :
pip install anltk
Building
Note: Currently only tested on Linux, prebuilt python wheels are available for Linux, Windows, Macos on pypi
Dependencies:
- utfcpp, automatically downloaded.
- utf8proc, automatically downlaoded.
- C++ Compiler that supports c++17.
- Python3, meson, ninja
pip install meson
pip install ninja
git clone https://github.com/Abdullah-AlAttar/anltk.git \
&& cd anltk/ \
&& meson build --buildtype=release -Dbuild_tests=false \
&& cd build \
&& ninja \
&& cd ../ \
&& pip install -e .
Usage Examples:
C++ API :
#include "anltk/anltk.hpp"
#include <iostream>
#include <string>
int main()
{
std::string ar_text = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ";
std::cout << anltk::transliterate(ar_text, anltk::CharMapping::AR2BW) << '\n';
// >bjd hwz HTy klmn sEfS qr$t vx* DZg
std::string text = "فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ.";
std::cout << anltk::remove_tashkeel(text) << '\n';
// فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
// Third paramters is a stop_list, charactres in this list won't be removed
std::cout << anltk::remove_non_alpha(text, " ") << '\n';
// فراشة ملونة تطير في البستان حلوة مهندمة تدهش الإنسان
anltk::TafqitOptions opts;
std::cout<< anltk::tafqit(15000120, opts) <<'\n';
// خمسة عشر مليونًا ومائة وعشرون
}
Python API
import anltk
ar = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ"
bw = anltk.transliterate(ar, anltk.AR2BW)
print(bw)
# >bjd hwz HTy klmn sEfS qr$t vx* DZg
print(anltk.remove_tashkeel("فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ."))
# فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
print(anltk.tafqit(15000120))
# خمسة عشر مليونًا ومائة وعشرون
For list of features see Features.md
Benchmarks
Processing a file containing 500000 Line, 6787731 Word, 112704541 Character. the task is to remove diacritics / transliterate to buckwalter
Buckwatler transliteration
Method | Time | ||
---|---|---|---|
anltk python-api | 1.379 seconds | ||
python camel_tools | 11.46 seconds |
Remove Diacritics
Method | Time | ||
---|---|---|---|
anltk python-api | 0.989 seconds | ||
python camel_tools | 4.892 seconds |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
anltk-1.0.0.tar.gz
(163.9 kB
view hashes)
Built Distributions
anltk-1.0.0-cp39-cp39-win_amd64.whl
(154.1 kB
view hashes)
anltk-1.0.0-cp39-cp39-win32.whl
(137.0 kB
view hashes)
anltk-1.0.0-cp38-cp38-win_amd64.whl
(157.0 kB
view hashes)
anltk-1.0.0-cp38-cp38-win32.whl
(136.8 kB
view hashes)
anltk-1.0.0-cp37-cp37m-win_amd64.whl
(156.5 kB
view hashes)
anltk-1.0.0-cp37-cp37m-win32.whl
(137.7 kB
view hashes)
anltk-1.0.0-cp36-cp36m-win_amd64.whl
(156.5 kB
view hashes)
anltk-1.0.0-cp36-cp36m-win32.whl
(137.6 kB
view hashes)
anltk-1.0.0-cp35-cp35m-win_amd64.whl
(156.5 kB
view hashes)
anltk-1.0.0-cp35-cp35m-win32.whl
(137.6 kB
view hashes)
Close
Hashes for anltk-1.0.0-pp37-pypy37_pp73-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4d447b3657665c86fbbcbf7a4a8dd562170772809dbe1f6b6f6d2798ee3c5125 |
|
MD5 | 422a45731bde1cc90e254f7f3c20dc88 |
|
BLAKE2b-256 | c4a16959cf08f928a80c8df6633539acd3237dc69fb2278e30a8ec4f6874f432 |
Close
Hashes for anltk-1.0.0-pp37-pypy37_pp73-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 41d482788489f0a58485e0af587854f999299a87df7a9dfc01142dbe49d6d246 |
|
MD5 | 89c3da5fa360e53d3e29cccdc815b934 |
|
BLAKE2b-256 | ac40ae7027ecec620584d847679fc2ab2aed5fd8880e71b5608469f81bedb3bf |
Close
Hashes for anltk-1.0.0-pp36-pypy36_pp73-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6a0ddbc6942230575c1c9aa6d6c19728ee0f8396374a90936e618ae6f7f7afe4 |
|
MD5 | 1f9b70cf5c4229886f3f9a20045b1c6b |
|
BLAKE2b-256 | 1b7dc76e0a5d43c5afa26c9abf99560597cb343f8b42fd9e693c5d87602913ab |
Close
Hashes for anltk-1.0.0-pp36-pypy36_pp73-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 495ad60f35b49ab79743bb7ac4ebf13d801b803e8b49961f9a6e639a65324e4f |
|
MD5 | 7ae68785a65c3d304e11f0a7c11fd1c8 |
|
BLAKE2b-256 | e20eb8e91604f3dc52068eaabc80eaa8698af972b58c3d7f6f29205afd57412c |
Close
Hashes for anltk-1.0.0-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7dfa8e65cb648b61023b148550bad5d4b9943cf7f73b0fab87d016f99d8e89a1 |
|
MD5 | 8e29cb44e60a80f67ef47a540bff452f |
|
BLAKE2b-256 | 83be5d77ad764f0b50cd4d791220f1524e79673de02e84a4d5a39d31b4e4e306 |
Close
Hashes for anltk-1.0.0-cp39-cp39-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b30b042366cd670f928a330bf52a12d2bdaf08f3d4cae842a83c877a1bee6a46 |
|
MD5 | 48f2cb048ac05770aa5a25b6551d5c3a |
|
BLAKE2b-256 | 6dfdfaa194836fb462e9536ee5a91defd37710fef7a70c5b126d38460bfcad75 |
Close
Hashes for anltk-1.0.0-cp39-cp39-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e761e05aab98e605f95239d84ba980b19d20524c87def3a0d380778d81a11636 |
|
MD5 | 4bfaa419608599493f17cb678a504622 |
|
BLAKE2b-256 | 1ea46fd21851f118dac452cc079468cc671d78a79a49143929e0c57448da7b75 |
Close
Hashes for anltk-1.0.0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f6d92a762a60831e79bd2dc183dbd5ad825fdb454bc4a98d22000b0e8816bc61 |
|
MD5 | bb427dec9edc7122d56d67a3e2422158 |
|
BLAKE2b-256 | f717b7366f8b402496766c0f57fc568c35aac53cca094dda2744459b5d518fac |
Close
Hashes for anltk-1.0.0-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 722f88496952525ce4c0e5f314529bc477f9ada1dcd0fd789c17fec71eca0192 |
|
MD5 | a02fc87945dc44da7b8afc458200506b |
|
BLAKE2b-256 | 092e419e70250e0ddd2ffe6d6dcce8d6a7c0e8df49d31af42d3b8698b615533c |
Close
Hashes for anltk-1.0.0-cp38-cp38-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | dc567dcd0167bae4be91319c1c005ca44211877d45bebb64d5c5fc3c696f72e9 |
|
MD5 | ae72b83adb5d125605970c69f26a6157 |
|
BLAKE2b-256 | 6e7f2fce6055e767b121ff287fc775291bb4e91fb8bdadeeb393853922b3f509 |
Close
Hashes for anltk-1.0.0-cp38-cp38-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2fc1aa9bdcc9686b8ed7a33cd06fa0b127fdf4f01ce3f20c63be46264f1ca63d |
|
MD5 | c17cd694db98508d99ca90adb5fbcbd1 |
|
BLAKE2b-256 | bfb76b39b324f4fc12b0d56b06c44c852a536ae31b04f9a62f4acea48e5a77ab |
Close
Hashes for anltk-1.0.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ab5d2c9c55c024a36c59a67c5ecd5c41cb3f4a9345bd269d281fa6f4590978fa |
|
MD5 | b98d7c739599ab6448acda291588f91e |
|
BLAKE2b-256 | 47525c969f49d1cf7a1a1f4958a05c5d9bc7fd778056cf1cb9b108d6c55be709 |
Close
Hashes for anltk-1.0.0-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a6f575458f1140192f5dc2e8645b67c03e935fd2cb3d7823cc292c20ddbbed06 |
|
MD5 | 489cf4ebcee71515c31865eb7c1b079f |
|
BLAKE2b-256 | 13451c0e8344426aa29a9084d6ffed4d0967d8f7190b9df2c18d3c7bcb6b61ea |
Close
Hashes for anltk-1.0.0-cp37-cp37m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 790fa47db6a837e17a7873468d3dde5591b49cb17fa172ef6fbd634778186c20 |
|
MD5 | e84da845d8ef85c7bce6798e293df1f1 |
|
BLAKE2b-256 | 5b0226b505234806da6f847b4d728c55144ac5d9d6b0b74646997c37e4ebb17f |
Close
Hashes for anltk-1.0.0-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0a2d8ec8d17ab59997db383dfdb1fbc4fbbd2d441bbbb96b33b916542158b451 |
|
MD5 | ad5def9b5ebb1c4ef21d1a8916a0dafe |
|
BLAKE2b-256 | 792f4b7a1a7da3626c712a6c3ae32426e4011651e67bd3e68a792b45cca9f2ef |
Close
Hashes for anltk-1.0.0-cp37-cp37m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1ae94e34a40a47ef2ce496f719166a0b98c8bba8b52b3a63bc81d3386b5be9ab |
|
MD5 | c44f91d87b7c4cfb36c9f69fc4f2c9e8 |
|
BLAKE2b-256 | f526ae55615183bb83836a09c9b9c6a2df0f2982882a9b67070141268fa67138 |
Close
Hashes for anltk-1.0.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 71409b85b4ffc6134c7e7a7dfe6dd2dfb449322c3fe837230cc21852e3412c4c |
|
MD5 | 0965d1b0f6d6fafe0926f278e954dca1 |
|
BLAKE2b-256 | 8d0d1ca4ad99429fe7cb2e8f1ada364f56eeddd83789e6484b8698fb0a83deab |
Close
Hashes for anltk-1.0.0-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7a3422f28823a557e46a565c1f497349015d497853fbf4e934aec4d079f25ddc |
|
MD5 | 4dd64b0d13a53e91d305ecb31e91a6ca |
|
BLAKE2b-256 | a67e11a127a6c37d252c82447de677265164edc9e1dc572e8125e481517e2362 |
Close
Hashes for anltk-1.0.0-cp36-cp36m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 69edf42c5ff07ca4864d3b68007654d2868f2c0f04d80269d09783283b510394 |
|
MD5 | e11c9c0cf8e022d970566e65d06a7088 |
|
BLAKE2b-256 | 86967aadf887b55bb0ca1437e6ec3551df028789cb70e5ad3ed176cf8994fce6 |
Close
Hashes for anltk-1.0.0-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9dedfa7a4bb4d4c525fd6aa2ab011b38e22668289d450a2943fc753b9653c090 |
|
MD5 | fc2c345161cdd3ea490e0a3fb72636cb |
|
BLAKE2b-256 | 0d3de29ec75de6422e1faf122d5c224cf6a0739bcd5cd4373e02c9ff4e7841c4 |
Close
Hashes for anltk-1.0.0-cp36-cp36m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7c2ff5a3807a1b77f511c3d68de8fd023ae0daf6d90933458b4bccca79090df6 |
|
MD5 | 33619933483236249554a784dc857757 |
|
BLAKE2b-256 | 8d6243772a05144da48b787a65c23890adb040f22f274c68e35e42d0eed5a795 |
Close
Hashes for anltk-1.0.0-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ba843b9ae2a93059ee279b6731bc4c33195f999240a6501a2d87606af1c826e8 |
|
MD5 | e91ae312c333b3624abd5c285b2a0cd6 |
|
BLAKE2b-256 | 20d7ac4e1166d1106c7f3063fa82c66353531d781021d390573e5741ebb5fcf9 |
Close
Hashes for anltk-1.0.0-cp35-cp35m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8d258d7d69ba627d6d541f96d883ef37b453f0c707c83f47dd072cbcc9093590 |
|
MD5 | b8d3ddba6f0c0a15a4d8b362cb15dd21 |
|
BLAKE2b-256 | 1eab87f38db9355ccffa3b5103d61f6c95ac672a84acab6cdb00b7098106d212 |
Close
Hashes for anltk-1.0.0-cp35-cp35m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9aa42bd0efde97c1f66b17251d5ad9f31a7cde16d20eaead290d59b9d28c9ef9 |
|
MD5 | 065eefd45d12287e2a8ab4a350f1e69b |
|
BLAKE2b-256 | ad9c98614e4ab73f60cbc095c65a8dd3c77ea1a817c941ecd39e8f1972714553 |
Close
Hashes for anltk-1.0.0-cp35-cp35m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 495010fcbcaacd460104f5dd09a62e1d11bbc2a85e999f6903e149abcaaa40f3 |
|
MD5 | 2d2b6f6639822fb8864efd25de5b8e9a |
|
BLAKE2b-256 | dcad7fed589467abbaf7f097acdd4913adc5bd27eb0f8a07b85f2a17c8ac4ac3 |
Close
Hashes for anltk-1.0.0-cp35-cp35m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fb5e15eec49b25589f3dd156f2faf864600aa4d03ccbe53cab4bc518cee16553 |
|
MD5 | 1468501eaeb0b704f9bda774a7abbea6 |
|
BLAKE2b-256 | fbda30a72e0d9968b317e8360c683187eda2aa5798542e85b679aa50c7a52bd4 |