Arabic language processing toolkit
Project description
Arabic Natural Language Toolkit (ANLTK)
ANLTK is a set of Arabic natural language processing tools. developed with focus on performance.
ANLTK is a C++ library, with python bindings.
Installation
for python :
pip install anltk
Building
Note: Currently only tested on Linux, prebuilt python wheels are available for Linux, Windows, Macos on pypi
Dependencies:
- utfcpp, automatically downloaded.
- utf8proc, automatically downlaoded.
- C++ Compiler that supports c++17.
- Python3, meson, ninja
pip install meson
pip install ninja
git clone https://github.com/Abdullah-AlAttar/anltk.git \
&& cd anltk/ \
&& meson build --buildtype=release -Dbuild_tests=false \
&& cd build \
&& ninja \
&& cd ../ \
&& python3 setup.py install
Usage Examples:
C++ API :
#include "anltk/anltk.hpp"
#include <iostream>
#include <string>
int main()
{
std::string ar_text = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ";
std::cout << anltk::transliterate(ar_text, anltk::CharMapping::AR2BW) << '\n';
// >bjd hwz HTy klmn sEfS qr$t vx* DZg
std::string text = "فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ.";
std::cout << anltk::remove_tashkeel(text) << '\n';
// فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
// Third paramters is a stop_list, charactres in this list won't be removed
std::cout << anltk::remove_non_alpha(text, " ") << '\n';
// فراشة ملونة تطير في البستان حلوة مهندمة تدهش الإنسان
anltk::TafqitOptions opts;
std::cout<< anltk::tafqit(15000120, opts) <<'\n';
// خمسة عشر مليونًا ومائة وعشرون
}
Python API
import anltk
ar = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ"
bw = anltk.transliterate(ar, anltk.AR2BW)
print(bw)
# >bjd hwz HTy klmn sEfS qr$t vx* DZg
print(anltk.remove_tashkeel("فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ."))
# فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
print(anltk.tafqit(15000120))
# خمسة عشر مليونًا ومائة وعشرون
For list of features see Features.md
Benchmarks
Processing a file containing 500000 Line, 6787731 Word, 112704541 Character. the task is to remove diacritics / transliterate to buckwalter
Buckwatler transliteration
Method | Time | ||
---|---|---|---|
anltk python-api | 1.379 seconds | ||
python camel_tools | 11.46 seconds |
Remove Diacritics
Method | Time | ||
---|---|---|---|
anltk python-api | 0.989 seconds | ||
python camel_tools | 4.892 seconds |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
anltk-0.5.10.tar.gz
(163.6 kB
view hashes)
Built Distributions
anltk-0.5.10-cp39-cp39-win_amd64.whl
(147.4 kB
view hashes)
anltk-0.5.10-cp39-cp39-win32.whl
(131.7 kB
view hashes)
anltk-0.5.10-cp38-cp38-win_amd64.whl
(150.6 kB
view hashes)
anltk-0.5.10-cp38-cp38-win32.whl
(131.6 kB
view hashes)
anltk-0.5.10-cp37-cp37m-win_amd64.whl
(149.9 kB
view hashes)
anltk-0.5.10-cp37-cp37m-win32.whl
(132.1 kB
view hashes)
anltk-0.5.10-cp36-cp36m-win_amd64.whl
(149.9 kB
view hashes)
anltk-0.5.10-cp36-cp36m-win32.whl
(132.1 kB
view hashes)
anltk-0.5.10-cp35-cp35m-win_amd64.whl
(149.9 kB
view hashes)
anltk-0.5.10-cp35-cp35m-win32.whl
(132.1 kB
view hashes)
Close
Hashes for anltk-0.5.10-pp37-pypy37_pp73-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f7b8656c875288ef7b7db9d3d066ec0a7dafeafe0c546786a2d3c1441b91e54e |
|
MD5 | a7fc28369f0e30f9999e9ed1b40b907e |
|
BLAKE2b-256 | dce459a6f5f5ca9f16cd02ddca9fa3bff9b9d8cda3bf4bce313712fa42bf8447 |
Close
Hashes for anltk-0.5.10-pp37-pypy37_pp73-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ce26e91e0846a278db70630d845b3b6eb4f91b745718439ce7fa8813d6eda41c |
|
MD5 | 2ceeb72a98f1192d3d0aa5fe8f93ec27 |
|
BLAKE2b-256 | 7ef0a79e3d61fa59e355ef58b0e3004a845f0988fc707f32db4a64e31cc6e32f |
Close
Hashes for anltk-0.5.10-pp37-pypy37_pp73-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 231ce867b7982a32ca9ddf9a4cea37655fec6034f6ebd4d6297dbb8be4b12d21 |
|
MD5 | 8310677f4066b4c8b5c02c0a7f07fcf1 |
|
BLAKE2b-256 | b5e655f244e5c4b45330e71b82daa46d8eb570dc85bde0d71cb86a04c0d8e846 |
Close
Hashes for anltk-0.5.10-pp36-pypy36_pp73-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6c743712f861ef5840874f68b2e2c2bdcc25248083b78725b73d645625e5cf46 |
|
MD5 | 273d1a51851b69fedef2273c5723726c |
|
BLAKE2b-256 | 57a84d9a2377a52f69f55b36029e7012bbd22165af57395938d112dce0ec8842 |
Close
Hashes for anltk-0.5.10-pp36-pypy36_pp73-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 34e92fd001647d9f6258290bb52b2a1cf1b3b9c1a81cac50bf553a31ccc9dcf0 |
|
MD5 | f8ab08f827fe895295062f9129d0d25e |
|
BLAKE2b-256 | 651eb140506cc5578e8e17897372d9153d382d006f79ced8d1f2d7c76e4dbc9d |
Close
Hashes for anltk-0.5.10-pp36-pypy36_pp73-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b3e28d5c9ea14b5c2a9ca058711cf8a2d4c9e148ec6f401a1a55902c4e9f0a99 |
|
MD5 | a93805a7148f5fc7da0ecafd242a2223 |
|
BLAKE2b-256 | 1240e2112f1dc84af06ae4dcb4e54ebc980cfd5132d4da5d8bb038123b7c11d3 |
Close
Hashes for anltk-0.5.10-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 53982893a480f6b8c78bdcc9bc70f4660c2fbf2f94b77dff3aa748f7e6d4da1f |
|
MD5 | f45616f727dc446f7ebc3413531af7d7 |
|
BLAKE2b-256 | 286cede55bfee8dabdb070f8748616c2ea83fbe41623af72cab4824b9fe89c56 |
Close
Hashes for anltk-0.5.10-cp39-cp39-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b6a202d7178ed040591660d48ffd4d506f01f181eb5de349d37aa4d9327fad14 |
|
MD5 | c4b5aa6cf537f332fa6f6acbeb72c3f3 |
|
BLAKE2b-256 | d4b5f11fca543789e424562932da2589f099c5fa47f4602504d183fd71ff2f3b |
Close
Hashes for anltk-0.5.10-cp39-cp39-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f1cd8a1f02dca90d808274a4c878151c12a4eec954b532e11275142fe6826491 |
|
MD5 | fda9b508a8da4bf130e0e2657814cdfa |
|
BLAKE2b-256 | 402dfd171e25438cd3d9e908bd783e510696e620d96e4e207161115608548a86 |
Close
Hashes for anltk-0.5.10-cp39-cp39-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b621fc26e11edef40e8b5ea9d501783c2556e93e07adb47bca868f6a1fffcd58 |
|
MD5 | cebc1436fae7c8df1ebea15801878642 |
|
BLAKE2b-256 | 19100cfc64ca500ee4132f47bcb022507aa1450644ec29bef038d3bc2680e965 |
Close
Hashes for anltk-0.5.10-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0f68bc3c5a2987ed1d6b95e811ae3f671d14f3b17256376e1019201657d2d6b8 |
|
MD5 | 1bd611b6bea64c94e3bad8c401ebd4d2 |
|
BLAKE2b-256 | 42652a4da5ed21b72d2c39b62d7ffc6254ccbfd8c52903b26805097554e09fb3 |
Close
Hashes for anltk-0.5.10-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 082ad08c4a8c516be6d4656c096f23c9b57a9f380edd903a027d8c9e6d378fe8 |
|
MD5 | 66abb61a52979c724bc5062858628b22 |
|
BLAKE2b-256 | 6e47c5254034933437994d0d770d56d2fef2267c0acaf33cfdf058e738f5c856 |
Close
Hashes for anltk-0.5.10-cp38-cp38-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1cf94186b708981a747a970dbf3ca2d70f793e034a9e161c517bdf546106b7c4 |
|
MD5 | 5fb4b23f1b44e86ebeb3b06b3e2f07ac |
|
BLAKE2b-256 | 42e4ffd1191ac33b9cff27384dc2d62610292fb14e33d1b6e13b04bf5ec24142 |
Close
Hashes for anltk-0.5.10-cp38-cp38-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e8652ec4cc30f0b5fdf59605d132f6165756f767e358c670f5ddae08537b9a3b |
|
MD5 | 0197b6ad5c10099c0407c107f2067936 |
|
BLAKE2b-256 | 84360dc66f74eeeafe940a29a1f5c40a4705c80d81c7674c15fe457f69aadbdd |
Close
Hashes for anltk-0.5.10-cp38-cp38-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a83f0d01d486b776e35b7ac51cdfe878fcc321503a0faed3d7bae9d5af87f1b1 |
|
MD5 | 25b48c0b4742fc3932bcef3727f8d588 |
|
BLAKE2b-256 | 8a3dd071d0d9f60b4ef3ac4ae21e1775f93a6e26e6eeccff8d37633a7d59753f |
Close
Hashes for anltk-0.5.10-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b00b1bdffa596d7a37ef3ce0f7446109c92e389638e8e8b3384c08bd3c5acade |
|
MD5 | 49be90585145c7e382f197d1634ae2c7 |
|
BLAKE2b-256 | 80ebf1340fa51b50acb9a45ba9444991c66724743d7448a43dd1645536c6d936 |
Close
Hashes for anltk-0.5.10-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 02fd160e851f1ec6fce8b94113a8126d7b0578c3a575f233b11fd05d820674ba |
|
MD5 | 4843a3a9f715f36cb533dd15df7cc162 |
|
BLAKE2b-256 | d856ccfcc93f825f01e5fb239890b0476ea13472583e4dc4284e187cadce73bc |
Close
Hashes for anltk-0.5.10-cp37-cp37m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9afe184ed35c103c81b38405a8e06d943f3f6d990d4f6f433e620237e38b82a0 |
|
MD5 | 989296737fb0837fb981e9a288276e38 |
|
BLAKE2b-256 | e0c5df40d8e45c6d1cc5a5ea5e69757f5bd7319a8b2c8fd98f707d540b27310c |
Close
Hashes for anltk-0.5.10-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0a97a5ef56899b06ca19348a11bc5bf812a51537f9306eb70f30614664f64def |
|
MD5 | 3607651eeeddc1b3895f8ae57cdf37b6 |
|
BLAKE2b-256 | a855d3dace87cdf85c0acccee0ae2a041bc78c20e27dd58de496dc52e4f356b5 |
Close
Hashes for anltk-0.5.10-cp37-cp37m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9380c2939da76a72388a834f7e5cae6ffd2851cf16445a23e14c690deeded123 |
|
MD5 | e909f5e6aa2e07e65fd731056eefabdc |
|
BLAKE2b-256 | ba52ff0bfcc50ab5bcc50aeb66ddb498b42034ca4bcfdeb256ba7d59a62822c4 |
Close
Hashes for anltk-0.5.10-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3241fb137864939518486113fbd7240164063c82f503657fa0fde31705a55a50 |
|
MD5 | cf0603e86fd2bf6338b32c6430429dfa |
|
BLAKE2b-256 | f867ee2069023c72102150a8a6340876931d92b7e4acd23f4b7a7feefa8ed9bf |
Close
Hashes for anltk-0.5.10-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ece2611871df9174403cad277fd37749c1b06090cdb5633fdb4d9ce73428063d |
|
MD5 | e2869a36fe17f55f7bd5bc8d25f51fb2 |
|
BLAKE2b-256 | af96068de1a01937113b24643c2ea5fd9f41d9487679c492442e6de6c45bfa9e |
Close
Hashes for anltk-0.5.10-cp36-cp36m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 74ab59018b29efbc1e817141f4c5447696ba6daaa5c96e8ffc2b28e5a6098e1b |
|
MD5 | b5df9cdb16fd0c58c72539b8f532cc8e |
|
BLAKE2b-256 | b8b316b3660a261ef8327dc4c9d70bd449275602a5ed85bb26564555c9675d06 |
Close
Hashes for anltk-0.5.10-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 967aaab2719128611d4214784b2bb4c56f12a164f23daed40c064ad4c4e085b8 |
|
MD5 | 1d741fa4b2b07786fd793c66bc9dbf10 |
|
BLAKE2b-256 | a9ac347cf67b2ed5a308d80861afed391c44bc876d96bb6b9070f67d582e2be4 |
Close
Hashes for anltk-0.5.10-cp36-cp36m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 385db95adcfd28aa40bd4201025bce8bde39f80b7c0caa33313be6a6cc196b2a |
|
MD5 | 01677baad3c2326e752ae0f606d53198 |
|
BLAKE2b-256 | 111e1b85daa5515444fd71b2446b6c2283d16ae9eeb0d6d38dfa0fc1b52e44f8 |
Close
Hashes for anltk-0.5.10-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 425c3243e77bf8b243d7014a798fca58a809adcac7f847229540fec16e8d8730 |
|
MD5 | a67789d119e0efa7a738b2533df24404 |
|
BLAKE2b-256 | 1b4519074e61e5e98c8253c4139bff24a38115bf3e63bead57323d87bcce96c8 |
Close
Hashes for anltk-0.5.10-cp35-cp35m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 400f2a436332395053e17ef1ad3ed8c3c4fabb152a7f0d07c70651bb346a7ccb |
|
MD5 | f5c108bb4b2eaadbd4175151fcd00bc4 |
|
BLAKE2b-256 | 76e83688c4b561c1beec8628869e3c1ece8d553e3bd3375bee4e599436d836de |
Close
Hashes for anltk-0.5.10-cp35-cp35m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 14523cc4cdf2c9b69f6bd95188db57974076550a1f59c6fdc90a40a8e4e4d21a |
|
MD5 | 6dc97cb96b4f2c0fd32603d757fd2c38 |
|
BLAKE2b-256 | fb61a7f8911aebb21c839f010ac5a8702092993094ecc7586419c2c16a271c8f |
Close
Hashes for anltk-0.5.10-cp35-cp35m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 399f818a459d2f2bf7aefd9a0286723707e63a9008a11229f6f78f6afc883a63 |
|
MD5 | a442ee101f5df6ec45b1c5c2824d101c |
|
BLAKE2b-256 | 7c59af57001a229ceb7cde76e3da601b156967509886b5baf0efdf2d0e1f5526 |
Close
Hashes for anltk-0.5.10-cp35-cp35m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f110881883255dd3ff53f80e95ec6d1d5a7da0487c8fe19100ba9a22354a384a |
|
MD5 | ffcc2282476559515b3ea6bb52efb43c |
|
BLAKE2b-256 | 8d95e438df3456b9c88cdb93557f5e7a8018aa087b7f38a1360fe7244ceb48b5 |
Close
Hashes for anltk-0.5.10-cp35-cp35m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c64d013b56df65d40b9c4451f15e7400dae4476ff03eeab058eaf54bd16df316 |
|
MD5 | acf65841dbe266d0c46ed0b9d3103039 |
|
BLAKE2b-256 | 476b38f526edb23dc38b78469045431c717154ac2adaaedcc9c8306b4f5197a0 |