Skip to main content

Arabic language processing toolkit

Project description

example workflow

Arabic Natural Language Toolkit (ANLTK)

ANLTK is a set of Arabic natural language processing tools. developed with focus on performance.

ANLTK is a C++ library, with python bindings.

Installation

for python :

pip install pybind11
pip install anltk

Building

Note: Currently only tested on Linux
The Library depends on https://github.com/nemtrif/utfcpp.git, which is cloned automatically.
you also need a modern C++ Compiler, which supports C++17 also meson and ninja needs to be installed.
simply with pip

pip install meson
pip install ninja
git clone --recurse-submodules https://github.com/Abdullah-AlAttar/anltk.git \
    && cd anltk/ \
    && meson build --buildtype=release -Dbuild_tests=false \
    && cd build \
    && ninja \
    && cd ../ \
    && python3 setup.py install

Usage Examples:

C++ API :

#include "anltk/anltk.hpp"
#include <iostream>
#include <string>

int main()
{

    std::string ar_text = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ";

    std::cout << anltk::transliterate(ar_text, anltk::CharMapping::AR2BW) << '\n';
    // >bjd hwz HTy klmn sEfS qr$t vx* DZg

    std::string text = "فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ.";

    std::cout << anltk::remove_tashkeel(text) << '\n';
    // فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.

    // Third paramters is a stop_list, charactres in this list won't be removed
    std::cout << anltk::remove_non_alpha(text, " ") << '\n';
    // فراشة ملونة تطير في البستان حلوة مهندمة تدهش الإنسان
}

Python API

import anltk


ar = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ"
bw = anltk.transliterate(ar, anltk.AR2BW)
print(bw)
# >bjd hwz HTy klmn sEfS qr$t vx* DZg

print(anltk.remove_tashkeel("فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ."))

# فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.

For list of features see Features.md

Benchmarks

Processing a file containing 500000 Line, 6787731 Word, 112704541 Character. the task is to remove diacritics / transliterate to buckwalter

Buckwatler transliteration

Method Time
anltk python-api 1.379 seconds
python (camel_tools) 11.46 seconds

Remove Diacritics

Method Time
anltk python-api 0.989 seconds
python (camel_tools) 4.892 seconds

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anltk-0.4.7.tar.gz (23.7 kB view details)

Uploaded Source

Built Distributions

anltk-0.4.7-py3.6-linux-x86_64.egg (216.5 kB view details)

Uploaded Source

anltk-0.4.7-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (207.3 kB view details)

Uploaded Python 3 manylinux: glibc 2.12+ x86-64

File details

Details for the file anltk-0.4.7.tar.gz.

File metadata

  • Download URL: anltk-0.4.7.tar.gz
  • Upload date:
  • Size: 23.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.4.7.tar.gz
Algorithm Hash digest
SHA256 1d92a172d14e2ca6bc8c9bf3f9a6b86a700494bab907aca0859fb696e2ad8b14
MD5 cf7029a6b5ac7c3398414a7aa5d197c3
BLAKE2b-256 680d2b3e6abf2de61e3e3683b35de74a189a0fffdb0b37192bae0d134a79e495

See more details on using hashes here.

File details

Details for the file anltk-0.4.7-py3.6-linux-x86_64.egg.

File metadata

  • Download URL: anltk-0.4.7-py3.6-linux-x86_64.egg
  • Upload date:
  • Size: 216.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.4.7-py3.6-linux-x86_64.egg
Algorithm Hash digest
SHA256 d27485e50c35d514cabd443c7abbf60203b1c6d5c8ee5a92c0dcba9c4ce17108
MD5 fbe96cb8b2beac6ef3ae257f9fb6d70f
BLAKE2b-256 f2ae4024840a3adaf7d7cce5cd6d7764296dc0c713659e9a51aeb3c43b4c04e7

See more details on using hashes here.

File details

Details for the file anltk-0.4.7-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for anltk-0.4.7-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 93732e622eb65a47efcf134cbcb952e70fed7a910f666d2feae2eb141456b8bf
MD5 982df8f6dbc79427f0433122f0649d92
BLAKE2b-256 88c291bf02337d8c9d9ad615cac4b6d64f0b7086febe1d7d36446cf6d08bfaef

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page