Arabic language processing toolkit
Project description
Arabic Natural Language Toolkit (ANLTK)
ANLTK is a set of Arabic natural language processing tools. developed with focus on performance.
ANLTK is a C++ library, with python bindings.
Installation
for python :
pip install pybind11
pip install anltk
Building
Note: Currently only tested on Linux
The Library depends on https://github.com/nemtrif/utfcpp.git, which is cloned automatically.
you also need a modern C++ Compiler, which supports C++17
also meson and ninja needs to be installed.
simply with pip
pip install meson
pip install ninja
git clone --recurse-submodules https://github.com/Abdullah-AlAttar/anltk.git \
&& cd anltk/anltk \
&& meson build --buildtype=release -Dbuild_tests=false \
&& cd build \
&& ninja \
&& cd ../../ \
&& python3 setup.py install
Usage Examples:
C++ API :
#include "anltk/anltk.hpp"
#include <iostream>
#include <string>
int main()
{
std::string ar_text = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ";
std::cout << anltk::transliterate(ar_text, anltk::CharMapping::AR2BW) << '\n';
// >bjd hwz HTy klmn sEfS qr$t vx* DZg
std::string text = "فَ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ.";
std::cout << anltk::remove_tashkeel(text) << '\n';
// فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
// Third paramters is a stop_list, charactres in this list won't be removed
std::cout << anltk::remove_non_alpha(text, " ") << '\n';
// فراشة ملونة تطير في البستان حلوة مهندمة تدهش الإنسان
}
Python API
import anltk
ar = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ"
bw = anltk.transliterate(ar, anltk.AR2BW)
print(bw)
# >bjd hwz HTy klmn sEfS qr$t vx* DZg
print(anltk.remove_tashkeel("فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ."))
# فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
For list of features see Features.md
Benchmarks
Processing a file containing 2499995 line, 563522705 characters. the task is to remove diacritics.
Reading entire file into a string then a single call to remove_tashkeel:
Method | Time | ||
---|---|---|---|
anltk python-api | 5.001 seconds | ||
anltk cpp-api | 3.507 seconds | ||
python (camel_tools) | 23.46 seconds |
Processing the file line by line:
Method | Time | ||
---|---|---|---|
anltk python-api | 7.636 seconds | ||
anltk cpp-api | 3.601 seconds | ||
python (camel_tools) | 22.37 seconds |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
anltk-0.2.9.tar.gz
(171.1 kB
view details)
Built Distribution
anltk-0.2.9-py3.8-linux-x86_64.egg
(129.9 kB
view details)
File details
Details for the file anltk-0.2.9.tar.gz
.
File metadata
- Download URL: anltk-0.2.9.tar.gz
- Upload date:
- Size: 171.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4e7bd2e339a34ff8e217c9084aea12e614d11758d27865c6d4bbc3e748385f3d |
|
MD5 | 1b28da2b935624539568032d93b4b1a3 |
|
BLAKE2b-256 | 2a16842fe7e5cfdd1ac19c4eca38d8161199a20eba548ae4f1425f4a3c583de8 |
File details
Details for the file anltk-0.2.9-py3.8-linux-x86_64.egg
.
File metadata
- Download URL: anltk-0.2.9-py3.8-linux-x86_64.egg
- Upload date:
- Size: 129.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | dbbf3ddae1f7d41e7ec736c6ca26ad088b4a4e50d8bf30431161bfaecb79bb2d |
|
MD5 | 7f9bd789a33d1ad8bb581129e61969a4 |
|
BLAKE2b-256 | 66da669e82450a43f785cfb8dcf2711d0de361477e27b594ea0b054f06636626 |