Arabic language processing toolkit
Project description
Arabic Natural Language Toolkit (ANLTK)
ANLTK is a set of Arabic natural language processing tools. developed with focus on performance.
ANLTK is a C++ library, with python bindings.
Installation
for python :
pip install pybind11
pip install anltk
Building
Note: Currently only tested on Linux
The Library depends on https://github.com/nemtrif/utfcpp.git, which is cloned automatically.
you also need a modern C++ Compiler, which supports C++17
also meson and ninja needs to be installed.
simply with pip
pip install meson
pip install ninja
git clone --recurse-submodules https://github.com/Abdullah-AlAttar/anltk.git \
&& cd anltk/anltk \
&& meson build --buildtype=release -Dbuild_tests=false \
&& cd build \
&& ninja \
&& cd ../../ \
&& python3 setup.py install
Usage Examples:
C++ API :
#include "anltk/anltk.hpp"
#include <iostream>
#include <string>
int main()
{
std::string ar_text = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ";
std::cout << anltk::transliterate(ar_text, anltk::CharMapping::AR2BW) << '\n';
// >bjd hwz HTy klmn sEfS qr$t vx* DZg
std::string text = "فَ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ.";
std::cout << anltk::remove_tashkeel(text) << '\n';
// فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
// Third paramters is a stop_list, charactres in this list won't be removed
std::cout << anltk::remove_non_alpha(text, " ") << '\n';
// فراشة ملونة تطير في البستان حلوة مهندمة تدهش الإنسان
}
Python API
import anltk
ar = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ"
bw = anltk.transliterate(ar, anltk.AR2BW)
print(bw)
# >bjd hwz HTy klmn sEfS qr$t vx* DZg
print(anltk.remove_tashkeel("فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ."))
# فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
For list of features see Features.md
Benchmarks
Processing a file containing 2499995 line, 563522705 characters. the task is to remove diacritics.
Reading entire file into a string then a single call to remove_tashkeel:
Method | Time | ||
---|---|---|---|
anltk python-api | 5.001 seconds | ||
anltk cpp-api | 3.507 seconds | ||
python (camel_tools) | 23.46 seconds |
Processing the file line by line:
Method | Time | ||
---|---|---|---|
anltk python-api | 7.636 seconds | ||
anltk cpp-api | 3.601 seconds | ||
python (camel_tools) | 22.37 seconds |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
anltk-0.4.1.tar.gz
(177.7 kB
view hashes)
Built Distributions
anltk-0.4.1-py3.8-linux-x86_64.egg
(212.7 kB
view hashes)
Close
Hashes for anltk-0.4.1-py3.8-linux-x86_64.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | 587ff5ecdc4ab42037fb03bf1978760195d2efb351352a3ed0713e4462196962 |
|
MD5 | cacbd4165627788241094b5885e00d98 |
|
BLAKE2b-256 | b1ed80be0824939d70fbfd5ef9be4b53f087748c5aa2140e427fe900204bd1a1 |
Close
Hashes for anltk-0.4.1-pp37-pypy37_pp73-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e9c8966462c4868cf2e693a4c355e6cd25f500d34188ea3c4b95018257bc4de0 |
|
MD5 | b7d35a3879da9174e94e4ed90a9c4361 |
|
BLAKE2b-256 | 50e81f171f5011e93acab129c203704814d1e11a6d98b5a7b2d082b3fb0b6598 |
Close
Hashes for anltk-0.4.1-cp310-cp310-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d0592d4d0384fb48195a4022aabe7d15a20a16bbb3abba519ee7f5d913f448f9 |
|
MD5 | 4a89ccc78455d3b66909f7f5dd62fe17 |
|
BLAKE2b-256 | 4a9aa92bdfd798f08bcaa909582e09c30eb2d90b284e19e8beffc66878c936ae |
Close
Hashes for anltk-0.4.1-cp39-cp39-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 165f669f9484199de1f38aa73c0c06fc5b9f03cf5fc923e70c0a1778a745ebf7 |
|
MD5 | b5fbfb42a4bdb03412e2cf89e433aab6 |
|
BLAKE2b-256 | 1f5db1a5b2b5505c233dc4104fcc965d4e96ec71d4aa045036b8b2f41e60dcf6 |
Close
Hashes for anltk-0.4.1-cp38-cp38-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 86ee620d5097fbfabe8ec05acb60574895f25d07ccde5909de3e6723f484ce47 |
|
MD5 | 107b20563d4b3c776bdae27b160e1bd8 |
|
BLAKE2b-256 | 1402a72f7dbec1e361f3718f1fe96ac418f24635970efadf99e2d2b7dd7e7fd5 |
Close
Hashes for anltk-0.4.1-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9f94dd202b77914033253fba476ffdf0caea775bbe4e0195814fb5edd3149f88 |
|
MD5 | 95f2e1eb0675b170d09415905a1af0f8 |
|
BLAKE2b-256 | ceb032569bb447ef58545a093fd92cde81307d446f68763db2e4efae7dcc69ac |
Close
Hashes for anltk-0.4.1-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a9e2976d76208e3a7fa57c0db061e394f7120130a0c79d76259d9e5a7944ebc5 |
|
MD5 | 7412bf6e3e2f2a072304cb3bd36ebbb6 |
|
BLAKE2b-256 | af854e27abf063d7981ba94cb3e4d90c1273942fd2a4877306758b4b82807080 |