Arabic language processing toolkit
Project description
Arabic Natural Language Toolkit (ANLTK)
ANLTK is a set of Arabic natural language processing tools. developed with focus on performance.
ANLTK is a C++ library, with python bindings.
Installation
for python :
pip install pybind11
pip install anltk
Building
Note: Currently only tested on Linux
The Library depends on https://github.com/nemtrif/utfcpp.git, which is cloned automatically.
you also need a modern C++ Compiler, which supports C++17
also meson and ninja needs to be installed.
simply with pip
pip install meson
pip install ninja
git clone --recurse-submodules https://github.com/Abdullah-AlAttar/anltk.git \
&& cd anltk/anltk \
&& meson build --buildtype=release -Dbuild_tests=false \
&& cd build \
&& ninja \
&& cd ../../ \
&& python3 setup.py install
Usage Examples:
C++ API :
#include "anltk/anltk.hpp"
#include <iostream>
#include <string>
int main()
{
std::string ar_text = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ";
std::cout << anltk::transliterate(ar_text, anltk::CharMapping::AR2BW) << '\n';
// >bjd hwz HTy klmn sEfS qr$t vx* DZg
std::string text = "فَ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ.";
std::cout << anltk::remove_tashkeel(text) << '\n';
// فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
// Third paramters is a stop_list, charactres in this list won't be removed
std::cout << anltk::remove_non_alpha(text, " ") << '\n';
// فراشة ملونة تطير في البستان حلوة مهندمة تدهش الإنسان
}
Python API
import anltk
ar = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ"
bw = anltk.transliterate(ar, anltk.AR2BW)
print(bw)
# >bjd hwz HTy klmn sEfS qr$t vx* DZg
print(anltk.remove_tashkeel("فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ."))
# فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
For list of features see Features.md
Benchmarks
Processing a file containing 2499995 line, 563522705 characters. the task is to remove diacritics.
Reading entire file into a string then a single call to remove_tashkeel:
Method | Time | ||
---|---|---|---|
anltk python-api | 5.001 seconds | ||
anltk cpp-api | 3.507 seconds | ||
python (camel_tools) | 23.46 seconds |
Processing the file line by line:
Method | Time | ||
---|---|---|---|
anltk python-api | 7.636 seconds | ||
anltk cpp-api | 3.601 seconds | ||
python (camel_tools) | 22.37 seconds |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
anltk-0.3.3.tar.gz
(173.1 kB
view details)
Built Distribution
anltk-0.3.3-py3.8-linux-x86_64.egg
(169.4 kB
view details)
File details
Details for the file anltk-0.3.3.tar.gz
.
File metadata
- Download URL: anltk-0.3.3.tar.gz
- Upload date:
- Size: 173.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fc7abc0c9a1c7b242abefe9566072aa306417cf7ed8a1de027bbc6761faa66e6 |
|
MD5 | 71b1ff11f56406aab32cf6f7e409ddee |
|
BLAKE2b-256 | 0e4181fe90221364f29217596aa3a8077ec3e051b95d2d867537123c2c7552f1 |
File details
Details for the file anltk-0.3.3-py3.8-linux-x86_64.egg
.
File metadata
- Download URL: anltk-0.3.3-py3.8-linux-x86_64.egg
- Upload date:
- Size: 169.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ff389e3b7cf30b9f1453411d2d298c6077b038ffc5b2e96aec2c96b589ef3cb7 |
|
MD5 | c1e23bae6d1c1ae42fc9769196cf1e8b |
|
BLAKE2b-256 | 19a513899110843564910c4832c1a2ea9dcad187f0055040d1332a20df6bcfb7 |