Arabic language processing toolkit
Project description
Arabic Natural Language Toolkit (ANLTK)
ANLTK is a set of Arabic natural language processing tools. developed with focus on performance.
ANLTK is a C++ library, with python bindings.
Installation
for python :
pip install anltk
Building
Note: Currently only tested on Linux, prebuilt python wheels are available for Linux, Windows, Macos on pypi
Dependencies:
- utfcpp, automatically downloaded.
- utf8proc, automatically downlaoded.
- C++ Compiler that supports c++17.
- Python3, meson, ninja
pip install meson
pip install ninja
git clone https://github.com/Abdullah-AlAttar/anltk.git \
&& cd anltk/ \
&& meson build --buildtype=release -Dbuild_tests=false \
&& cd build \
&& ninja \
&& cd ../ \
&& python3 setup.py install
Usage Examples:
C++ API :
#include "anltk/anltk.hpp"
#include <iostream>
#include <string>
int main()
{
std::string ar_text = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ";
std::cout << anltk::transliterate(ar_text, anltk::CharMapping::AR2BW) << '\n';
// >bjd hwz HTy klmn sEfS qr$t vx* DZg
std::string text = "فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ.";
std::cout << anltk::remove_tashkeel(text) << '\n';
// فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
// Third paramters is a stop_list, charactres in this list won't be removed
std::cout << anltk::remove_non_alpha(text, " ") << '\n';
// فراشة ملونة تطير في البستان حلوة مهندمة تدهش الإنسان
anltk::TafqitOptions opts;
std::cout<< anltk::tafqit(15000120, opts) <<'\n';
// خمسة عشر مليونًا ومائة وعشرون
}
Python API
import anltk
ar = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ"
bw = anltk.transliterate(ar, anltk.AR2BW)
print(bw)
# >bjd hwz HTy klmn sEfS qr$t vx* DZg
print(anltk.remove_tashkeel("فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ."))
# فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
print(anltk.tafqit(15000120))
# خمسة عشر مليونًا ومائة وعشرون
For list of features see Features.md
Benchmarks
Processing a file containing 500000 Line, 6787731 Word, 112704541 Character. the task is to remove diacritics / transliterate to buckwalter
Buckwatler transliteration
Method | Time | ||
---|---|---|---|
anltk python-api | 1.379 seconds | ||
python camel_tools | 11.46 seconds |
Remove Diacritics
Method | Time | ||
---|---|---|---|
anltk python-api | 0.989 seconds | ||
python camel_tools | 4.892 seconds |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
anltk-0.5.7.tar.gz
(162.1 kB
view hashes)
Built Distributions
anltk-0.5.7-cp39-cp39-win_amd64.whl
(142.6 kB
view hashes)
anltk-0.5.7-cp39-cp39-win32.whl
(127.0 kB
view hashes)
anltk-0.5.7-cp38-cp38-win_amd64.whl
(145.9 kB
view hashes)
anltk-0.5.7-cp38-cp38-win32.whl
(126.9 kB
view hashes)
anltk-0.5.7-cp37-cp37m-win_amd64.whl
(145.1 kB
view hashes)
anltk-0.5.7-cp37-cp37m-win32.whl
(127.4 kB
view hashes)
anltk-0.5.7-cp36-cp36m-win_amd64.whl
(145.1 kB
view hashes)
anltk-0.5.7-cp36-cp36m-win32.whl
(127.4 kB
view hashes)
anltk-0.5.7-cp35-cp35m-win_amd64.whl
(145.1 kB
view hashes)
anltk-0.5.7-cp35-cp35m-win32.whl
(127.4 kB
view hashes)
Close
Hashes for anltk-0.5.7-pp37-pypy37_pp73-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d766fb511cb7f625e2718cf59f172c4cd24711772c3ddbf2ca6b170e64401a09 |
|
MD5 | 4791d60586828d68f281d91f420d545f |
|
BLAKE2b-256 | 74b60493e4e8a16aff9cc1e414ddd8e4d8dc2c14e2262f68973b0592882bee93 |
Close
Hashes for anltk-0.5.7-pp37-pypy37_pp73-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 19153f6c641ed4dd3b07a9a8294e30a9fb677166984f558275d2dad9185775b7 |
|
MD5 | a425b5cc2a98f982c015f7ee328a6a5f |
|
BLAKE2b-256 | f440ba7c27960c78162cb864010b49157471898d7137f9fb39a340b0034f8b8e |
Close
Hashes for anltk-0.5.7-pp37-pypy37_pp73-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5f35467c6b52f7a5c7b92a11d62564d2c1d90bde25785fab07916758d559113a |
|
MD5 | 34283e79f4343ba33550e6eaaf666e14 |
|
BLAKE2b-256 | fe8a21d3a9a64114f13a8cbc3753db0e1b3f87107a8154cec46950e98b7dd733 |
Close
Hashes for anltk-0.5.7-pp36-pypy36_pp73-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 288e1c78505fe4fc733e61dccea31c707c2e116cc30d8d09f4b05ce49285cc45 |
|
MD5 | acc7142fa993991a21ab731fb6653cce |
|
BLAKE2b-256 | c1a9c93fe0503dadbfbce976fcb6c86a90c684b59235379a08bf29aeee5c4335 |
Close
Hashes for anltk-0.5.7-pp36-pypy36_pp73-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3ad97ade3f952527d97fddfba38daf2a92fd6f9d7e52cf26744434b81beaab6f |
|
MD5 | 7c1753bbc6a42e8f06b7112af84d86c7 |
|
BLAKE2b-256 | b78eae47bd6e08c1ceb98c7cb1e6a1cf848822cb7952c12ea2971d960e39d39d |
Close
Hashes for anltk-0.5.7-pp36-pypy36_pp73-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4cd1e6a58d91286a91e92d809d2d8721aeefc5301d0fe4bcd9abaad39f753452 |
|
MD5 | b6c379ae59bc379f21d2e12320e2cd63 |
|
BLAKE2b-256 | acf0f61cd7b947df2e7e4bfdbdb9445eff955772edff7bcfc5ae1c50a1746b48 |
Close
Hashes for anltk-0.5.7-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2b95e787c0288d97df51d190e600a8229852de9a8bf75e4830c11a60d439f434 |
|
MD5 | 28fb1a3674fe4033a35f06734b1445d9 |
|
BLAKE2b-256 | 98e371d876ee0effb04225b207adfa34ea5093667ed84a9a25834d751ed37444 |
Close
Hashes for anltk-0.5.7-cp39-cp39-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | eeec6896bc3a40a06ce45f26aefbfbbf3e4184726002e443b4a28492c76f9fca |
|
MD5 | 738153f6ee20cbfd448a00911ea6670c |
|
BLAKE2b-256 | 6494c475d78ed9ee19e15406c5c5e202777e8c8e911fe28b1651e50cfc7b2537 |
Close
Hashes for anltk-0.5.7-cp39-cp39-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 15bfd299c1777cbf742dc35378e6543d25a089386566ee8146357a8476c7dca2 |
|
MD5 | c8ac5725f120c3127e893e7aa973a272 |
|
BLAKE2b-256 | f99c958208b83335063d4876548718d3130da2e5ddcf6841eddd0f79ea704293 |
Close
Hashes for anltk-0.5.7-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0a83c8ea50b59bdf4854a1ff55758ed2ff6ceb5f69056693b50d1be43e6863a6 |
|
MD5 | b33cd2f98eb0500275b6bbda2cc091b2 |
|
BLAKE2b-256 | 63e6363423b01b188361b80693d5c48af766c2849c90ddedc0317155af053bbe |
Close
Hashes for anltk-0.5.7-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 11e0c70d7aa2ac63a8c68520f10bec23f65231708d5c23753620119056da788c |
|
MD5 | 65a5760171123da2de68df85272ab401 |
|
BLAKE2b-256 | adfa1db275b5e056eb4b49a87e9359635f2c07f00db43587ce676606c434dd75 |
Close
Hashes for anltk-0.5.7-cp38-cp38-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 60ab602468ca47187c801a58d02cf691ec4e002befce506923fda367b9620423 |
|
MD5 | 3c32eab6720e897d689ee97b33850e22 |
|
BLAKE2b-256 | 3dee61dda659d325a79882de65fa602a2e9d53e9bc8adbb246b986c768207d60 |
Close
Hashes for anltk-0.5.7-cp38-cp38-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cccfce90a110072ccefbb665b0f4db09ae1eff0f376cb719e8a3681fdaee5d26 |
|
MD5 | eac056178328da99f64288278684afaf |
|
BLAKE2b-256 | 7233e74c5abb2f893673daf8be063e609fd3d3bea71b2b55bd1d78cf6ee76459 |
Close
Hashes for anltk-0.5.7-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 764bf6e279a19de8817baa2faffbfda18e165d4f05b5c9671a3e862b2e031202 |
|
MD5 | 4316abdd88d8f7f2f185b4cc2d49f376 |
|
BLAKE2b-256 | c554181abedae2bca3039cf8aee02bbec5413701012124e9a5c10f120891dfc9 |
Close
Hashes for anltk-0.5.7-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 79952d4b24b21a5983762c48720e09713e11c2afb11862f28c64ef2530104d19 |
|
MD5 | 1819f8872d144af3bd79f4aaca6362ec |
|
BLAKE2b-256 | 57e409b6a269a930c62f596542dc3ab6f4152f105e63e705510099276f185fe3 |
Close
Hashes for anltk-0.5.7-cp37-cp37m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b462780be7cf3796e558c754b3bcaff5c882b2b9c72f7ff2ab1f90b451b6a96d |
|
MD5 | 3ebf41b34803e3f020efdbd254591875 |
|
BLAKE2b-256 | 63b9e46c41cf72f9ea3434c6187459bb874331136bba6a13a4698ff24d017782 |
Close
Hashes for anltk-0.5.7-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a8a25078cf7b9e21689c5e60d1ee656632654a19bda5cec1016d9ae4c72f1fc2 |
|
MD5 | c3d611b07220bd10fb4ab1a3811a8a16 |
|
BLAKE2b-256 | 487270b70e801f2bb1e4a6b662d2a533668904028ab2d0d26bbf65b3f2d1cb54 |
Close
Hashes for anltk-0.5.7-cp37-cp37m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | dcca9b93a291b73ac10fce47fbdddd0cc6ac9f245b0ab301c3265f1a2d088d92 |
|
MD5 | bf472d98d5a1b38c8b323426bf5cada3 |
|
BLAKE2b-256 | 1b2643030ec72bf91dadb92bcccba956bcebd8cdc86cb52b0abec9f2464be7b4 |
Close
Hashes for anltk-0.5.7-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6fd737528a3f6652c07326aeaea479b0c3aa091fa577d673c16a13f284b5fca6 |
|
MD5 | 585695ed152a6d789a9d6e44cc842e8a |
|
BLAKE2b-256 | 2835ec0f195dd479686d1f667e1facb513fe9d9718ed6d078a906b46745dd4ab |
Close
Hashes for anltk-0.5.7-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 79f3af938567ba14a02b6894e00f688359e9cf1f0e0ce4c95ea5fc925b585332 |
|
MD5 | 91fc33fad358df5c612b23a9663763fb |
|
BLAKE2b-256 | 1af38a7a724704f0ad33c70f66000148793b131bc23ef0bbf752b6b7fb407f54 |
Close
Hashes for anltk-0.5.7-cp36-cp36m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b9434fbe0ee300ab59da8ccae8d80565e132814998a0ef265c80781c773f0633 |
|
MD5 | c4b1ab036908f88fd37c912625980873 |
|
BLAKE2b-256 | 90da69611896bf71483a89ba4f61c3e094f6368a1bff8ee59c3ffdae0c427330 |
Close
Hashes for anltk-0.5.7-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4a782ad4f6bb39d3e0a1e91c8be8f753a56d99e14ce1cadd341b5d37f415499d |
|
MD5 | aac2e521889fd14b765d998e681537ea |
|
BLAKE2b-256 | 86324476e3e55f2e362484263d59627309b67148af0ac2ce6c26dc32158d056c |
Close
Hashes for anltk-0.5.7-cp36-cp36m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bb8ce429645b8687691eeda5017d8a87fb249ba0bd110893e7179585008c2e12 |
|
MD5 | c09f5247de5d7a6af51582457ba61e5f |
|
BLAKE2b-256 | 5009d47a31e85a36c90b12ae567b86c5a731884814aa017ceca697e71d9b4525 |
Close
Hashes for anltk-0.5.7-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c9f72356af5616323762975002e64cf3ee1bfa325ff3ba4dad4f06fa5c98fdbe |
|
MD5 | 5a958f502b1a173bcdc483013ab86587 |
|
BLAKE2b-256 | ef16e36b0fec2632c1814fc366e5d78340530fe329c55e1e12c30ea107fd0171 |
Close
Hashes for anltk-0.5.7-cp35-cp35m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cc25751d567c7b7da04e4ad31d90a6958e8e8b2c7ec17e1629a4d0835fbca1b0 |
|
MD5 | bac8967b42a7fd0a7f366372bb367b46 |
|
BLAKE2b-256 | 0bb394de6614df0f876aa3f936678f8c0ad51626f5827b32a01e9cfb6d08bc8d |
Close
Hashes for anltk-0.5.7-cp35-cp35m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1847cd0c6ab5e38e78551f8aa60aa3705dd16024610eaf598aedda80a89291a2 |
|
MD5 | 76b232a2a716fe4ed27a93a5babd1cdb |
|
BLAKE2b-256 | 27b21eb646ce2df8c0bc71e925604e9cd7c972cff1e91e1b6273215f1dd03b56 |
Close
Hashes for anltk-0.5.7-cp35-cp35m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ff1cd12ca5d28381a9783998a1e01d00967c99bfdf614e10dfd2afad5f663680 |
|
MD5 | c7c9f9c53c9f22c2d1ea5f9173e2c404 |
|
BLAKE2b-256 | fffc23bda64710f68c46079a3c06ce8e83608f3d516990c044fe0e5e3914d39b |
Close
Hashes for anltk-0.5.7-cp35-cp35m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9130659a13ec98ac6021e892bc100981789b60f3d9eceb977077d1aeba10f390 |
|
MD5 | bd259daf61574feb9f1590fd2576315d |
|
BLAKE2b-256 | 8796ba50662ecd1c5d61876ea820e5134de0128078d8a1658b33f1cbb4137b27 |
Close
Hashes for anltk-0.5.7-cp35-cp35m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c92ff6774f8ca01f35d3168bd158915a091c8d3df04581fa11e91c5d76c8709b |
|
MD5 | e6b48ada4b17f3fc4a201b95c3d24d86 |
|
BLAKE2b-256 | 75645650a008f291395595f360043573141f83cff3db7d6ff424d57498594489 |