Arabic language processing toolkit
Project description
Arabic Natural Language Toolkit (ANLTK)
ANLTK is a set of Arabic natural language processing tools. developed with focus on performance.
ANLTK is a C++ library, with python bindings.
Installation
for python :
pip install pybind11
pip install anltk
Building
Note: Currently only tested on Linux
The Library depends on https://github.com/nemtrif/utfcpp.git, which is cloned automatically.
you also need a modern C++ Compiler, which supports C++17
also meson and ninja needs to be installed.
simply with pip
pip install meson
pip install ninja
git clone --recurse-submodules https://github.com/Abdullah-AlAttar/anltk.git \
&& cd anltk/ \
&& meson build --buildtype=release -Dbuild_tests=false \
&& cd build \
&& ninja \
&& cd ../ \
&& python3 setup.py install
Usage Examples:
C++ API :
#include "anltk/anltk.hpp"
#include <iostream>
#include <string>
int main()
{
std::string ar_text = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ";
std::cout << anltk::transliterate(ar_text, anltk::CharMapping::AR2BW) << '\n';
// >bjd hwz HTy klmn sEfS qr$t vx* DZg
std::string text = "فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ.";
std::cout << anltk::remove_tashkeel(text) << '\n';
// فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
// Third paramters is a stop_list, charactres in this list won't be removed
std::cout << anltk::remove_non_alpha(text, " ") << '\n';
// فراشة ملونة تطير في البستان حلوة مهندمة تدهش الإنسان
}
Python API
import anltk
ar = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ"
bw = anltk.transliterate(ar, anltk.AR2BW)
print(bw)
# >bjd hwz HTy klmn sEfS qr$t vx* DZg
print(anltk.remove_tashkeel("فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ."))
# فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
For list of features see Features.md
Benchmarks
Processing a file containing 500000 Line, 6787731 Word, 112704541 Character. the task is to remove diacritics / transliterate to buckwalter
Buckwatler transliteration
Method | Time | ||
---|---|---|---|
anltk python-api | 1.379 seconds | ||
python (camel_tools) | 11.46 seconds |
Remove Diacritics
Method | Time | ||
---|---|---|---|
anltk python-api | 0.989 seconds | ||
python (camel_tools) | 4.892 seconds |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file anltk-0.4.7.tar.gz
.
File metadata
- Download URL: anltk-0.4.7.tar.gz
- Upload date:
- Size: 23.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1d92a172d14e2ca6bc8c9bf3f9a6b86a700494bab907aca0859fb696e2ad8b14 |
|
MD5 | cf7029a6b5ac7c3398414a7aa5d197c3 |
|
BLAKE2b-256 | 680d2b3e6abf2de61e3e3683b35de74a189a0fffdb0b37192bae0d134a79e495 |
File details
Details for the file anltk-0.4.7-py3.6-linux-x86_64.egg
.
File metadata
- Download URL: anltk-0.4.7-py3.6-linux-x86_64.egg
- Upload date:
- Size: 216.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d27485e50c35d514cabd443c7abbf60203b1c6d5c8ee5a92c0dcba9c4ce17108 |
|
MD5 | fbe96cb8b2beac6ef3ae257f9fb6d70f |
|
BLAKE2b-256 | f2ae4024840a3adaf7d7cce5cd6d7764296dc0c713659e9a51aeb3c43b4c04e7 |
File details
Details for the file anltk-0.4.7-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
.
File metadata
- Download URL: anltk-0.4.7-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
- Upload date:
- Size: 207.3 kB
- Tags: Python 3, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 93732e622eb65a47efcf134cbcb952e70fed7a910f666d2feae2eb141456b8bf |
|
MD5 | 982df8f6dbc79427f0433122f0649d92 |
|
BLAKE2b-256 | 88c291bf02337d8c9d9ad615cac4b6d64f0b7086febe1d7d36446cf6d08bfaef |