Arabic language processing toolkit
Project description
Arabic Natural Language Toolkit (ANLTK)
ANLTK is a set of Arabic natural language processing tools. developed with focus on performance.
ANLTK is a C++ library, with python bindings.
Building
Note: Currently only tested on Linux
The Library depends on https://github.com/nemtrif/utfcpp.git, which is cloned automatically.
you also need a modern C++ Compiler, which supports C++17
also meson and ninja needs to be installed.
simply with pip
pip3 install meson
pip3 install ninja
git clone --recurse-submodules https://github.com/Abdullah-AlAttar/anltk.git \
&& cd anltk/anltk \
&& meson build --default-library=shared --buildtype=release -Dbuild_tests=false \
&& cd build \
&& ninja \
&& cd ../../python \
&& python3 setup.py install
Usage Examples:
C++ API :
#include "anltk/anltk.hpp"
#include <iostream>
#include <string>
int main()
{
std::string ar_text = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ";
std::cout << anltk::transliterate(ar_text, anltk::CharMapping::AR2BW) << '\n';
// >bjd hwz HTy klmn sEfS qr$t vx* DZg
std::string text = "فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ.";
std::cout << anltk::remove_tashkeel(text) << '\n';
// فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
// Third paramters is a stop_list, charactres in this list won't be removed
std::cout << anltk::remove_non_alpha(text, " ") << '\n';
// فراشة ملونة تطير في البستان حلوة مهندمة تدهش الإنسان
}
Python API
import anltk
ar = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ"
bw = anltk.transliterate(ar, anltk.AR2BW)
print(bw)
# >bjd hwz HTy klmn sEfS qr$t vx* DZg
print(anltk.remove_tashkeel("فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ."))
# فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
Benchmarks
Processing a file containing 2499995 line, 563522705 characters. the task is to remove diacritics.
Reading entire file into a string then a single call to remove_tashkeel:
Method | Time | ||
---|---|---|---|
anltk python-api | 5.001 seconds | ||
anltk cpp-api | 3.507 seconds | ||
python (camel_tools) | 23.46 seconds |
Processing the file line by line:
Method | Time | ||
---|---|---|---|
anltk python-api | 7.636 seconds | ||
anltk cpp-api | 3.601 seconds | ||
python (camel_tools) | 22.37 seconds |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
anltk-0.1.4.tar.gz
(373.5 kB
view details)
Built Distribution
File details
Details for the file anltk-0.1.4.tar.gz
.
File metadata
- Download URL: anltk-0.1.4.tar.gz
- Upload date:
- Size: 373.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5c771db9fd4dfb086bdcca54781202da77bb9909fd834d6dd2d91b313ba754c1 |
|
MD5 | b64f7aa31bb29edf4ea043a28e7f577b |
|
BLAKE2b-256 | 38af3062a4f48fab6c864c79d748b2c462d7f3e06d24f1c100875f3ca4901c51 |
File details
Details for the file anltk-0.1.4-py3.8-linux-x86_64.egg
.
File metadata
- Download URL: anltk-0.1.4-py3.8-linux-x86_64.egg
- Upload date:
- Size: 1.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7b8fc2d67dec8f59fafc68020b0a642c7acc04a3295760087ce9fd6d40e7a4b5 |
|
MD5 | c23ef232dfb7ac621da1e5b182810e07 |
|
BLAKE2b-256 | 98acfd5e1de844ab6e2201632b0b8fe61962a6a60c527eca6cb41227118f937f |