Arabic language processing toolkit
Project description
Arabic Natural Language Toolkit (ANLTK)
ANLTK is a set of Arabic natural language processing tools. developed with focus on simplicity and performance.
ANLTK is a C++ library, with python bindings.
Installation
for python :
pip install anltk
Building
Note: Currently only tested on Linux, prebuilt python wheels are available for Linux, Windows, Macos on pypi
Dependencies:
- utfcpp, automatically downloaded.
- utf8proc, automatically downlaoded.
- C++ Compiler that supports c++17.
- Python3, meson, ninja
pip install meson
pip install ninja
git clone https://github.com/Abdullah-AlAttar/anltk.git \
&& cd anltk/ \
&& meson build --buildtype=release -Dbuild_tests=false \
&& cd build \
&& ninja \
&& cd ../ \
&& pip install -e .
Usage Examples:
C++ API :
#include "anltk/anltk.hpp"
#include <iostream>
#include <string>
int main()
{
std::string ar_text = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ";
std::cout << anltk::transliterate(ar_text, anltk::CharMapping::AR2BW) << '\n';
// >bjd hwz HTy klmn sEfS qr$t vx* DZg
std::string text = "فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ.";
std::cout << anltk::remove_tashkeel(text) << '\n';
// فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
// Third paramters is a stop_list, charactres in this list won't be removed
std::cout << anltk::remove_non_alpha(text, " ") << '\n';
// فراشة ملونة تطير في البستان حلوة مهندمة تدهش الإنسان
anltk::TafqitOptions opts;
std::cout<< anltk::tafqit(15000120, opts) <<'\n';
// خمسة عشر مليونًا ومائة وعشرون
}
Python API
import anltk
ar = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ"
bw = anltk.transliterate(ar, anltk.AR2BW)
print(bw)
# >bjd hwz HTy klmn sEfS qr$t vx* DZg
print(anltk.remove_tashkeel("فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ."))
# فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
print(anltk.tafqit(15000120))
# خمسة عشر مليونًا ومائة وعشرون
For list of features see Features.md
Benchmarks
Processing a file containing 500000 Line, 6787731 Word, 112704541 Character. the task is to remove diacritics / transliterate to buckwalter
Buckwatler transliteration
Method | Time | ||
---|---|---|---|
anltk python-api | 1.379 seconds | ||
python camel_tools | 11.46 seconds |
Remove Diacritics
Method | Time | ||
---|---|---|---|
anltk python-api | 0.989 seconds | ||
python camel_tools | 4.892 seconds |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
anltk-1.0.1.tar.gz
(163.9 kB
view hashes)
Built Distributions
anltk-1.0.1-cp39-cp39-win_amd64.whl
(154.1 kB
view hashes)
anltk-1.0.1-cp39-cp39-win32.whl
(137.0 kB
view hashes)
anltk-1.0.1-cp38-cp38-win_amd64.whl
(157.0 kB
view hashes)
anltk-1.0.1-cp38-cp38-win32.whl
(136.9 kB
view hashes)
anltk-1.0.1-cp37-cp37m-win_amd64.whl
(156.5 kB
view hashes)
anltk-1.0.1-cp37-cp37m-win32.whl
(137.7 kB
view hashes)
anltk-1.0.1-cp36-cp36m-win_amd64.whl
(156.5 kB
view hashes)
anltk-1.0.1-cp36-cp36m-win32.whl
(137.7 kB
view hashes)
anltk-1.0.1-cp35-cp35m-win_amd64.whl
(156.5 kB
view hashes)
anltk-1.0.1-cp35-cp35m-win32.whl
(137.7 kB
view hashes)
Close
Hashes for anltk-1.0.1-pp37-pypy37_pp73-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | dd47966a7937f0f7c1bcf56fa27a8a312bd740f7855be6ff58b277ab535dc9b0 |
|
MD5 | c22dbc49ab4bd8fa9096942496d2cffb |
|
BLAKE2b-256 | a109b85a50b468d59d697b0cd44ff96cf53ab487e0c5a76998fc980d44ca6ff7 |
Close
Hashes for anltk-1.0.1-pp37-pypy37_pp73-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 57783295019ae1aebf716c0159fb42e8542a52915b4cfae52b3a7d557917c4a7 |
|
MD5 | 20bb49b032d555aaf4b375ad056b7f0e |
|
BLAKE2b-256 | f4a40db73541380281d4acb8a07cc9cbfb4e7f9d8e3dfe69acf9409c8d73ebf7 |
Close
Hashes for anltk-1.0.1-pp36-pypy36_pp73-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 006eb71801e947853ca45e91f6bca0da4450e6b7a5ea10205ca392ab913ff812 |
|
MD5 | b5000a4cc71f46e5edbfca7821e90802 |
|
BLAKE2b-256 | 8b310cd31521e1763a49b7516d82cdb8615ca0dc1f2ead98f46a133f5a954d62 |
Close
Hashes for anltk-1.0.1-pp36-pypy36_pp73-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | be8f593889e130859d6bc64046b208cf253c6535c22b6803cd6b19b4a7771a71 |
|
MD5 | 0318bfb31b92017a1858b36b5b55cdfd |
|
BLAKE2b-256 | 5cf8ccd773864c98df0a31814c59a325d7864f8c5d981aaeb807025eb57b7e8c |
Close
Hashes for anltk-1.0.1-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 720e0b45ed8fbd4534f3e0e369cdc842f346ac8a1156defe2802095b4fefdf42 |
|
MD5 | 9bfd57f829fc22cb982dd5b9550468da |
|
BLAKE2b-256 | c20929d0ac2e2c5742866b8dae14bdca8179598944808438ce7a357119a96aef |
Close
Hashes for anltk-1.0.1-cp39-cp39-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b339d37f57269589e8b40d507b00e24c8d8c0adc9c96f421655643e8e344d232 |
|
MD5 | 424c00f19e0a853df6a105a8eb78d8df |
|
BLAKE2b-256 | 3aa6c52f63c68fabee9817935c8c2dbb3ec131b15ab61c3abdce58dcdd1bd1fb |
Close
Hashes for anltk-1.0.1-cp39-cp39-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0d946945c819f6dfd354def31ac04ef418e8e20863c6993a5cb1be2b2771ec9f |
|
MD5 | 9d351fb8ca44eeee0e71e10385cd2552 |
|
BLAKE2b-256 | 59274f41b4e0c74bcc35699b84c04ba117e8b8d8125f5d05730511143cb624b8 |
Close
Hashes for anltk-1.0.1-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3d1b362653071e0f0469249eae3be84e0d58600b28e1c6d0dfb68286dbf3816f |
|
MD5 | 1889fd14cbcc7c4265f545df810dd504 |
|
BLAKE2b-256 | e6c935d8a749d3de82ff88a002b23c2e42aced27bdb8efc0fd28f0929e5aaf54 |
Close
Hashes for anltk-1.0.1-cp38-cp38-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 73425d363ca760ca20bf2be39e07f3fb5d30a691093c4e4bc62a9a195e4cfec6 |
|
MD5 | 28749ac0e7cefeda6ae8cab8636340b2 |
|
BLAKE2b-256 | 59e0109741ddc8c4dea22bc09ab3c2a202b3097dd5edb65e0b8be5518e6c3f2d |
Close
Hashes for anltk-1.0.1-cp38-cp38-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9fb72fa77274debc9453655c325268879cb96e8a356df9d2ad50d0f463b8aa5b |
|
MD5 | 74748340036a07dc2049462ef68b6f34 |
|
BLAKE2b-256 | 37c6d6f5f99538787ced00bee502053d1ba9c793f4771a4d1f6176984e7fe104 |
Close
Hashes for anltk-1.0.1-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cf5d1dc70b36a7d0d5222436c555ef065aa3fb58f9987ef02437df51ac20ed7d |
|
MD5 | 927154975b7a9afc15c527fec372bcfe |
|
BLAKE2b-256 | 1252857cd257f013054cc93d068b1042fda15138321eb9135ceb7f9fdab3b03e |
Close
Hashes for anltk-1.0.1-cp37-cp37m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f3411f4168a453e039e7df94cd8992a547c6bcec618d9b366d60ea43e7021a4d |
|
MD5 | e5d623d2acc11a366eda8465e04be982 |
|
BLAKE2b-256 | 494c355d314070139b725a478f9a7fc8d369db00973bebef0265397c1c2dc645 |
Close
Hashes for anltk-1.0.1-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 420cd4400dd8068b2c06826e1d6d8d7455b32d47480acbfac84be6e164c12e8b |
|
MD5 | b3de62cdc66748a2a2df0cd32703ced7 |
|
BLAKE2b-256 | 4c1902b05bf1459cffa481689a30d3dc0b5e10d1836e08025632a63b4092b653 |
Close
Hashes for anltk-1.0.1-cp37-cp37m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bed043e41f79b690bfb29a42ac3fdfacb7b13603472dd5b1266ca4781ccc5920 |
|
MD5 | 9b659ed13ac23d7720cf64671dc77af8 |
|
BLAKE2b-256 | 060c28432b603191226bb8ede87346d7e3e536371da522dbf4bd59b690db99cd |
Close
Hashes for anltk-1.0.1-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 243a89652ac28405f012464f3214a16f58f5122f4382b8ec9c40ccea2b20f673 |
|
MD5 | 86a8afa201f931e4a6e7eeca28ed9b42 |
|
BLAKE2b-256 | 8cb3bbb09c0046b309c63cf304e6e793d0b1bcbd4bc9a6533756c4c9ad85eb26 |
Close
Hashes for anltk-1.0.1-cp36-cp36m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5d4592c6c70f3c986320cf333c3424dde3f5656a99d9f79f1be35d646c00b88c |
|
MD5 | d566c330d9d917ab76feda74dacf1619 |
|
BLAKE2b-256 | 829eb26d4536e85986fe67124d079761080e9754904194464c0bad05fba1c4a7 |
Close
Hashes for anltk-1.0.1-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0ffd56310849dec6a4404bc5eedbbcf0d47108c679e2b81973535b81a949e157 |
|
MD5 | d2c01705a2ecc3d1813476b0cc3fb0ff |
|
BLAKE2b-256 | 808f8ec4d564859ff134dcb5e52a787b83171679a00dc8e9e802c86da7874fab |
Close
Hashes for anltk-1.0.1-cp36-cp36m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fa6e0c4816cb92e326f636fb70bd2a8887e0726e5fc85d4be7b4355f142b4234 |
|
MD5 | 6f09e31965a9740bdf7d5d5d4a948d16 |
|
BLAKE2b-256 | 36071d3d9ab14b1f648acefdfc81768f3b6c58a6ad14f3772fef082336cc62cd |
Close
Hashes for anltk-1.0.1-cp35-cp35m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3a9ac86910364ce90369b884327c88ad3a8c9dc75ea97c215f7d46012cd23474 |
|
MD5 | 1604e7617bb10b07eba80ea62349ca1d |
|
BLAKE2b-256 | 33fd94387cc4edc67b038a0aa84c348b60dc831954410b0cfdb5a7dfb3cd5c96 |
Close
Hashes for anltk-1.0.1-cp35-cp35m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8d61907b79208f70546dc3e497a022dd74aa162cd43ef5317be3313a97bbcaa2 |
|
MD5 | c229d649b757ebf3a8cff884056f3978 |
|
BLAKE2b-256 | cd75352f31dcb56989763ff472378fa1ccf0de83557849f00af24aacd1c808a9 |
Close
Hashes for anltk-1.0.1-cp35-cp35m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d6bbfd1346be78010a6425faa376d2b7398cbd5b1fee0d1ae66f5a283f26aad6 |
|
MD5 | 8212aad6c379d9bc9962b6f0554cb7ac |
|
BLAKE2b-256 | 230744a698843415ee4b53c0f1278827f14f0e8e94febb9036113d45f1ec9f5f |
Close
Hashes for anltk-1.0.1-cp35-cp35m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 23df6ac81212e598e664d8b35cdde6b8c2dabcecac968d34f9c7b9d2e18567bc |
|
MD5 | 979a18be21111156834f79846cd7133c |
|
BLAKE2b-256 | 39afc15bb468335fc929e909f1e9db7a03e154896c3fa7ad00e8539388d1b4d8 |