Arabic language processing toolkit
Project description
Arabic Natural Language Toolkit (ANLTK)
ANLTK is a set of Arabic natural language processing tools. developed with focus on performance.
ANLTK is a C++ library, with python bindings.
Installation
for python :
pip install pybind11
pip install anltk
Building
Note: Currently only tested on Linux
The Library depends on https://github.com/nemtrif/utfcpp.git, which is cloned automatically.
you also need a modern C++ Compiler, which supports C++17
also meson and ninja needs to be installed.
simply with pip
pip install meson
pip install ninja
git clone --recurse-submodules https://github.com/Abdullah-AlAttar/anltk.git \
&& cd anltk/anltk \
&& meson build --buildtype=release -Dbuild_tests=false \
&& cd build \
&& ninja \
&& cd ../../ \
&& python3 setup.py install
Usage Examples:
C++ API :
#include "anltk/anltk.hpp"
#include <iostream>
#include <string>
int main()
{
std::string ar_text = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ";
std::cout << anltk::transliterate(ar_text, anltk::CharMapping::AR2BW) << '\n';
// >bjd hwz HTy klmn sEfS qr$t vx* DZg
std::string text = "فَ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ.";
std::cout << anltk::remove_tashkeel(text) << '\n';
// فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
// Third paramters is a stop_list, charactres in this list won't be removed
std::cout << anltk::remove_non_alpha(text, " ") << '\n';
// فراشة ملونة تطير في البستان حلوة مهندمة تدهش الإنسان
}
Python API
import anltk
ar = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ"
bw = anltk.transliterate(ar, anltk.AR2BW)
print(bw)
# >bjd hwz HTy klmn sEfS qr$t vx* DZg
print(anltk.remove_tashkeel("فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ."))
# فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
For list of features see Features.md
Benchmarks
Processing a file containing 2499995 line, 563522705 characters. the task is to remove diacritics.
Reading entire file into a string then a single call to remove_tashkeel:
Method | Time | ||
---|---|---|---|
anltk python-api | 5.001 seconds | ||
anltk cpp-api | 3.507 seconds | ||
python (camel_tools) | 23.46 seconds |
Processing the file line by line:
Method | Time | ||
---|---|---|---|
anltk python-api | 7.636 seconds | ||
anltk cpp-api | 3.601 seconds | ||
python (camel_tools) | 22.37 seconds |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file anltk-0.4.2.tar.gz
.
File metadata
- Download URL: anltk-0.4.2.tar.gz
- Upload date:
- Size: 174.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fda3d351fc1df94242ef0d621108535e33063f8676276b717dfd9db6f210111c |
|
MD5 | 1986cfa0b580043dfb57fd9f6cfac42d |
|
BLAKE2b-256 | 34ccf05914ced61c5deb2adcf588958028c53cb4ebc53d236789cc52582632bd |
File details
Details for the file anltk-0.4.2-py3.6-linux-x86_64.egg
.
File metadata
- Download URL: anltk-0.4.2-py3.6-linux-x86_64.egg
- Upload date:
- Size: 216.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | aa79a85b98a306b7000bfbb3a4244bc0c8fe2ac959a99181b58da9fdd31541d3 |
|
MD5 | 7292d0f803b1e984ec86c657a718f741 |
|
BLAKE2b-256 | 6eb8c54a85234198cf348b250c43ebc6d30d5adcd6ec633f8c2971a7eb3ac881 |
File details
Details for the file anltk-0.4.2-pp37-pypy37_pp73-manylinux2010_x86_64.whl
.
File metadata
- Download URL: anltk-0.4.2-pp37-pypy37_pp73-manylinux2010_x86_64.whl
- Upload date:
- Size: 409.2 kB
- Tags: PyPy, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d137557d60bc75048ab5bf0350e98255cca1559970b70b069d73c4010c124810 |
|
MD5 | d72e76776b9188579c5d323f511e0c6a |
|
BLAKE2b-256 | 61baf1e3e72190c1cdd3ff16df7d48d014bb6ebb912eab8f4e3144afde3d4a46 |
File details
Details for the file anltk-0.4.2-cp310-cp310-manylinux2010_x86_64.whl
.
File metadata
- Download URL: anltk-0.4.2-cp310-cp310-manylinux2010_x86_64.whl
- Upload date:
- Size: 207.3 kB
- Tags: CPython 3.10, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b0d0a7e8e1614f1d2d3c1d4750a2c05a8e10c49c732132a5ebbc1b762617f50e |
|
MD5 | abd5ea303e1cc0445dac7f6c6d5f5403 |
|
BLAKE2b-256 | 3e3edad6d0faae257f5a652dcb5b3d6ab569289ae5ad713b9b8d97ad66b9b018 |
File details
Details for the file anltk-0.4.2-cp39-cp39-manylinux2010_x86_64.whl
.
File metadata
- Download URL: anltk-0.4.2-cp39-cp39-manylinux2010_x86_64.whl
- Upload date:
- Size: 207.6 kB
- Tags: CPython 3.9, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d0d4eefd97a8d95f1ac218141bfb59388d71df34acf4699c33611d2f488e61e2 |
|
MD5 | 653b69d768153fc3be35af18bc1c7e41 |
|
BLAKE2b-256 | 0d58c1ef9602e4b5e643e874831ff75f02cc23f395717c6e6353fe7ce7b48dd3 |
File details
Details for the file anltk-0.4.2-cp38-cp38-manylinux2010_x86_64.whl
.
File metadata
- Download URL: anltk-0.4.2-cp38-cp38-manylinux2010_x86_64.whl
- Upload date:
- Size: 206.2 kB
- Tags: CPython 3.8, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4298d26ec5842fd4311a018b911558cd801697d22a13397ce7fcd6a424600f8c |
|
MD5 | d325321ff5f5bfb4aa75e1624b97f2f6 |
|
BLAKE2b-256 | 0291c45cfe0f45813dba3ea7e5bcf0ec1d6ff314d126a5177d5d436c5ec737d0 |
File details
Details for the file anltk-0.4.2-cp37-cp37m-manylinux2010_x86_64.whl
.
File metadata
- Download URL: anltk-0.4.2-cp37-cp37m-manylinux2010_x86_64.whl
- Upload date:
- Size: 210.6 kB
- Tags: CPython 3.7m, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c5045081b5aac665cf362732ed0802863737b9711832e6acb02fc862f279e6ad |
|
MD5 | 156403892d7ef8a9b88caa496f3ff315 |
|
BLAKE2b-256 | 81112dc117110b818033b13f271946e77ab8b5762177f23c79adc684fb702f72 |
File details
Details for the file anltk-0.4.2-cp36-cp36m-manylinux2010_x86_64.whl
.
File metadata
- Download URL: anltk-0.4.2-cp36-cp36m-manylinux2010_x86_64.whl
- Upload date:
- Size: 210.5 kB
- Tags: CPython 3.6m, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 85905e59b14e32ec7a50dde045e0ff47a1fa1b5acb674c2c4e520d0acdacab50 |
|
MD5 | 883cbf1e1fe260cc1c5ecb7aa3d1b7ab |
|
BLAKE2b-256 | 3f77f5b4859a13874eb1e3406e208c17750131ff43b1adc5a71ac992f0911d69 |