Arabic language processing toolkit
Project description
Arabic Natural Language Toolkit (ANLTK)
ANLTK is a set of Arabic natural language processing tools. developed with focus on simplicity and performance.
ANLTK is a C++ library, with python bindings.
Installation
for python :
pip install anltk
Building
Note: Currently only tested on Linux, prebuilt python wheels are available for Linux, Windows, Macos on pypi
Dependencies:
- utfcpp, automatically downloaded.
- utf8proc, automatically downlaoded.
- C++ Compiler that supports c++17.
- Python3, meson, ninja
pip install meson
pip install ninja
git clone https://github.com/Abdullah-AlAttar/anltk.git \
&& cd anltk/ \
&& meson build --buildtype=release -Dbuild_tests=false \
&& cd build \
&& ninja \
&& cd ../ \
&& pip install -e .
Usage Examples:
C++ API :
#include "anltk/anltk.hpp"
#include <iostream>
#include <string>
int main()
{
std::string ar_text = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ";
std::cout << anltk::transliterate(ar_text, anltk::CharMapping::AR2BW) << '\n';
// >bjd hwz HTy klmn sEfS qr$t vx* DZg
std::string text = "فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ.";
std::cout << anltk::remove_tashkeel(text) << '\n';
// فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
// Third paramters is a stop_list, charactres in this list won't be removed
std::cout << anltk::remove_non_alpha(text, " ") << '\n';
// فراشة ملونة تطير في البستان حلوة مهندمة تدهش الإنسان
anltk::TafqitOptions opts;
std::cout<< anltk::tafqit(15000120, opts) <<'\n';
// خمسة عشر مليونًا ومائة وعشرون
}
Python API
import anltk
ar = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ"
bw = anltk.transliterate(ar, anltk.AR2BW)
print(bw)
# >bjd hwz HTy klmn sEfS qr$t vx* DZg
print(anltk.remove_tashkeel("فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ."))
# فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
print(anltk.tafqit(15000120))
# خمسة عشر مليونًا ومائة وعشرون
For list of features see Features.md
Benchmarks
Processing a file containing 500000 Line, 6787731 Word, 112704541 Character. the task is to remove diacritics / transliterate to buckwalter
Buckwatler transliteration
Method | Time | ||
---|---|---|---|
anltk python-api | 1.379 seconds | ||
python camel_tools | 11.46 seconds |
Remove Diacritics
Method | Time | ||
---|---|---|---|
anltk python-api | 0.989 seconds | ||
python camel_tools | 4.892 seconds |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
anltk-1.0.2.tar.gz
(163.8 kB
view hashes)
Built Distributions
anltk-1.0.2-cp39-cp39-win_amd64.whl
(166.1 kB
view hashes)
anltk-1.0.2-cp38-cp38-win_amd64.whl
(169.9 kB
view hashes)
anltk-1.0.2-cp37-cp37m-win_amd64.whl
(168.7 kB
view hashes)
anltk-1.0.2-cp36-cp36m-win_amd64.whl
(168.7 kB
view hashes)
Close
Hashes for anltk-1.0.2-pp37-pypy37_pp73-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ad6f588359881ea709aa8d1a1a58c1b9db97b0645d8ed25953443b4758aa432b |
|
MD5 | b13a15308c2f6b92815603fc234a093b |
|
BLAKE2b-256 | 0540e552c36036f52fa624380872d07f58abfb5eef12ef2d378382c4c7e06561 |
Close
Hashes for anltk-1.0.2-pp36-pypy36_pp73-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 03c57db1739b3b9e3e9e1287844cd4584aba78de81a8772e69ba834ffd57f664 |
|
MD5 | d44a0976d78c03ba385da93e57f80ac0 |
|
BLAKE2b-256 | ee958d49513c0c4c73014d1e2829025adea6f0d017a2a9801af186e49454e6df |
Close
Hashes for anltk-1.0.2-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8ef655ace3604cc87c12cef957f4f4ae4cad7cdd035b4acede7757e9361efe19 |
|
MD5 | 27d27607260b3cc5abf7f4773245e1f8 |
|
BLAKE2b-256 | f7d06107fd70cc580b3de4871d16b9d8fa2a9003ba991f69fdbf2d0f1e4a97b3 |
Close
Hashes for anltk-1.0.2-cp39-cp39-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a2a6d9a58403b6e9bfc76d9bc55b1a3eca5389df5f54192f5a0bd61a7b0955a6 |
|
MD5 | c69926f1442a29098d013a4765abbee6 |
|
BLAKE2b-256 | 4f440082916aabfea9f178c9cacf1275c0fd93e3998e143a183a83da45fde013 |
Close
Hashes for anltk-1.0.2-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c1b48b270ba89431b66b839e2d539266d1dd29a2441632ddb4fe179911c606c6 |
|
MD5 | 1dcbb86b458d20a25f591c435d5a4e43 |
|
BLAKE2b-256 | f7f235e061c56954f4f684e214d566ff263a95bd35e611833f8333a9cbc287ce |
Close
Hashes for anltk-1.0.2-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a3e052cecac73efa77ff4b7a04f4d2789f48b2ca9b908631eeb003097686a096 |
|
MD5 | fb0d03c6cf2de8bd7b829d69925df093 |
|
BLAKE2b-256 | b33c2927c47a75ebf6da7eea7a4f6b7b1b88ed862b7b30f2b51b2928d6dd11db |
Close
Hashes for anltk-1.0.2-cp38-cp38-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 309a720ff072ae159abf9b8ec93e530bdb706345bcbd9f4ad1bcc696fcc68b4b |
|
MD5 | 86ac4e86ff881a3aa7c4c759bf3abf9e |
|
BLAKE2b-256 | 604ac32048e3441fc61caf85773211c6ae5cd4547d506681ede0172085304d5e |
Close
Hashes for anltk-1.0.2-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1471a26ad2c61c8542c5a6b9b1c410c61fa3fa6634613216e1cade81feb9fb7a |
|
MD5 | 1ab87a8167b287c76426dd3a48ea3f36 |
|
BLAKE2b-256 | 235e97a6c8564c28b8ae1b622a92ba74afc4f1747738744fd0b85806a2192467 |
Close
Hashes for anltk-1.0.2-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e797910794589c3170471aaf5d5d927e85589b493c4912eef4666064a93f3b7e |
|
MD5 | 309d07100bf8c9a99fda5aaee03cfef1 |
|
BLAKE2b-256 | 838fdb43e2e015e1dffff16a9931badc866bc22d8bebee08cc5e32bdc2335412 |
Close
Hashes for anltk-1.0.2-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5d37889f40d8e22eccc8346ec1755e49ff1ffa10379923e95db06eb52b2ce170 |
|
MD5 | 7238bea49cd8348eb59a5141655ed4b3 |
|
BLAKE2b-256 | 42273b466f3842dc25c90f2dcd402430faa23e035f5aae3eb81988960c22c702 |
Close
Hashes for anltk-1.0.2-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0e0264bf892f9cc6e71321c7f38a2a15b9b636e5a5b0b991016ee9472edf7f87 |
|
MD5 | 855ae97e02db3861360b1c59ac158f41 |
|
BLAKE2b-256 | b51db909becb6e7e818e865e2528fea26d66e8d84118806a832ec96d3ca3a192 |
Close
Hashes for anltk-1.0.2-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8dba917c97ab6c26e7c48d9defa6ec53efae9127c3bd412b1e041d1934fca9eb |
|
MD5 | 8bba7a92f3385243ecaab9306fdce8b4 |
|
BLAKE2b-256 | 64519a4d19672b0eea7bfed3beafa141d4f0f8674c24729f3cc7a0136d9ebc4f |
Close
Hashes for anltk-1.0.2-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0c4f0d1da8223bff3c0c3ac360d817c6f7c8ac133ab8f8131600cd17562d82bd |
|
MD5 | f265d68a9201e1b076324b7d979fe86a |
|
BLAKE2b-256 | 29eab2b456a4ac0ff297d668d11ca432048bc0ad1c1e94059d5b5e47d45564e8 |
Close
Hashes for anltk-1.0.2-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b936cda1bf25193543c3b9ee95b078112faf72e7354f4ed5e42e123f478835ac |
|
MD5 | 6db7878cfccf9f724ccbe5b1be9a5867 |
|
BLAKE2b-256 | 8a09323266c07e673f94dd9372ddee83f70175ffab9c2274372a2676d3a176c8 |