Arabic language processing toolkit
Project description
Arabic Natural Language Toolkit (ANLTK)
ANLTK is a set of Arabic natural language processing tools. developed with focus on performance.
ANLTK is a C++ library, with python bindings.
Installation
for python :
pip install pybind11
pip install anltk
Building
Note: Currently only tested on Linux
The Library depends on https://github.com/nemtrif/utfcpp.git, which is cloned automatically.
you also need a modern C++ Compiler, which supports C++17
also meson and ninja needs to be installed.
simply with pip
pip install meson
pip install ninja
git clone --recurse-submodules https://github.com/Abdullah-AlAttar/anltk.git \
&& cd anltk/anltk \
&& meson build --buildtype=release -Dbuild_tests=false \
&& cd build \
&& ninja \
&& cd ../../ \
&& python3 setup.py install
Usage Examples:
C++ API :
#include "anltk/anltk.hpp"
#include <iostream>
#include <string>
int main()
{
std::string ar_text = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ";
std::cout << anltk::transliterate(ar_text, anltk::CharMapping::AR2BW) << '\n';
// >bjd hwz HTy klmn sEfS qr$t vx* DZg
std::string text = "فَ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ.";
std::cout << anltk::remove_tashkeel(text) << '\n';
// فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
// Third paramters is a stop_list, charactres in this list won't be removed
std::cout << anltk::remove_non_alpha(text, " ") << '\n';
// فراشة ملونة تطير في البستان حلوة مهندمة تدهش الإنسان
}
Python API
import anltk
ar = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ"
bw = anltk.transliterate(ar, anltk.AR2BW)
print(bw)
# >bjd hwz HTy klmn sEfS qr$t vx* DZg
print(anltk.remove_tashkeel("فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ."))
# فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
For list of features see Features.md
Benchmarks
Processing a file containing 2499995 line, 563522705 characters. the task is to remove diacritics.
Reading entire file into a string then a single call to remove_tashkeel:
Method | Time | ||
---|---|---|---|
anltk python-api | 5.001 seconds | ||
anltk cpp-api | 3.507 seconds | ||
python (camel_tools) | 23.46 seconds |
Processing the file line by line:
Method | Time | ||
---|---|---|---|
anltk python-api | 7.636 seconds | ||
anltk cpp-api | 3.601 seconds | ||
python (camel_tools) | 22.37 seconds |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
anltk-0.4.2.tar.gz
(174.9 kB
view hashes)
Built Distributions
anltk-0.4.2-py3.6-linux-x86_64.egg
(216.8 kB
view hashes)
Close
Hashes for anltk-0.4.2-py3.6-linux-x86_64.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | aa79a85b98a306b7000bfbb3a4244bc0c8fe2ac959a99181b58da9fdd31541d3 |
|
MD5 | 7292d0f803b1e984ec86c657a718f741 |
|
BLAKE2b-256 | 6eb8c54a85234198cf348b250c43ebc6d30d5adcd6ec633f8c2971a7eb3ac881 |
Close
Hashes for anltk-0.4.2-pp37-pypy37_pp73-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d137557d60bc75048ab5bf0350e98255cca1559970b70b069d73c4010c124810 |
|
MD5 | d72e76776b9188579c5d323f511e0c6a |
|
BLAKE2b-256 | 61baf1e3e72190c1cdd3ff16df7d48d014bb6ebb912eab8f4e3144afde3d4a46 |
Close
Hashes for anltk-0.4.2-cp310-cp310-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b0d0a7e8e1614f1d2d3c1d4750a2c05a8e10c49c732132a5ebbc1b762617f50e |
|
MD5 | abd5ea303e1cc0445dac7f6c6d5f5403 |
|
BLAKE2b-256 | 3e3edad6d0faae257f5a652dcb5b3d6ab569289ae5ad713b9b8d97ad66b9b018 |
Close
Hashes for anltk-0.4.2-cp39-cp39-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d0d4eefd97a8d95f1ac218141bfb59388d71df34acf4699c33611d2f488e61e2 |
|
MD5 | 653b69d768153fc3be35af18bc1c7e41 |
|
BLAKE2b-256 | 0d58c1ef9602e4b5e643e874831ff75f02cc23f395717c6e6353fe7ce7b48dd3 |
Close
Hashes for anltk-0.4.2-cp38-cp38-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4298d26ec5842fd4311a018b911558cd801697d22a13397ce7fcd6a424600f8c |
|
MD5 | d325321ff5f5bfb4aa75e1624b97f2f6 |
|
BLAKE2b-256 | 0291c45cfe0f45813dba3ea7e5bcf0ec1d6ff314d126a5177d5d436c5ec737d0 |
Close
Hashes for anltk-0.4.2-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c5045081b5aac665cf362732ed0802863737b9711832e6acb02fc862f279e6ad |
|
MD5 | 156403892d7ef8a9b88caa496f3ff315 |
|
BLAKE2b-256 | 81112dc117110b818033b13f271946e77ab8b5762177f23c79adc684fb702f72 |
Close
Hashes for anltk-0.4.2-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 85905e59b14e32ec7a50dde045e0ff47a1fa1b5acb674c2c4e520d0acdacab50 |
|
MD5 | 883cbf1e1fe260cc1c5ecb7aa3d1b7ab |
|
BLAKE2b-256 | 3f77f5b4859a13874eb1e3406e208c17750131ff43b1adc5a71ac992f0911d69 |