Skip to main content

Arabic language processing toolkit

Project description

example workflow

Arabic Natural Language Toolkit (ANLTK)

ANLTK is a set of Arabic natural language processing tools. developed with focus on performance.

ANLTK is a C++ library, with python bindings.

Installatio

for python :

pip install anltk

Building

Note: Currently only tested on Linux
The Library depends on https://github.com/nemtrif/utfcpp.git, which is cloned automatically.
you also need a modern C++ Compiler, which supports C++17 also meson and ninja needs to be installed.
simply with pip

pip install meson
pip install ninja
git clone --recurse-submodules https://github.com/Abdullah-AlAttar/anltk.git \
    && cd anltk/anltk \
    && meson build --buildtype=release -Dbuild_tests=false \
    && cd build \
    && ninja \
    && cd ../../python \
    && python3 setup.py install

Usage Examples:

C++ API :

#include "anltk/anltk.hpp"
#include <iostream>
#include <string>

int main()
{

    std::string ar_text = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ";

    std::cout << anltk::transliterate(ar_text, anltk::CharMapping::AR2BW) << '\n';
    // >bjd hwz HTy klmn sEfS qr$t vx* DZg

    std::string text = "فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ.";

    std::cout << anltk::remove_tashkeel(text) << '\n';
    // فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.

    // Third paramters is a stop_list, charactres in this list won't be removed
    std::cout << anltk::remove_non_alpha(text, " ") << '\n';
    // فراشة ملونة تطير في البستان حلوة مهندمة تدهش الإنسان
}

Python API

import anltk


ar = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ"
bw = anltk.transliterate(ar, anltk.AR2BW)
print(bw)
# >bjd hwz HTy klmn sEfS qr$t vx* DZg

print(anltk.remove_tashkeel("فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ."))

# فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.

Benchmarks

Processing a file containing 2499995 line, 563522705 characters. the task is to remove diacritics.

Reading entire file into a string then a single call to remove_tashkeel:

Method Time
anltk python-api 5.001 seconds
anltk cpp-api 3.507 seconds
python (camel_tools) 23.46 seconds

Processing the file line by line:

Method Time
anltk python-api 7.636 seconds
anltk cpp-api 3.601 seconds
python (camel_tools) 22.37 seconds

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anltk-0.1.5.tar.gz (373.5 kB view details)

Uploaded Source

Built Distribution

anltk-0.1.5-py3.8-linux-x86_64.egg (1.5 MB view details)

Uploaded Source

File details

Details for the file anltk-0.1.5.tar.gz.

File metadata

  • Download URL: anltk-0.1.5.tar.gz
  • Upload date:
  • Size: 373.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.1.5.tar.gz
Algorithm Hash digest
SHA256 19717e70a048271aee37cb9d70562d94644beec3aef9fa703b2baa3983369f8c
MD5 dad020c2f132ae4a596a138dc163e4e7
BLAKE2b-256 c7828302e689eff173563004a9b6de0b7149ba994ea56b54c71189167ab9ed87

See more details on using hashes here.

Provenance

File details

Details for the file anltk-0.1.5-py3.8-linux-x86_64.egg.

File metadata

  • Download URL: anltk-0.1.5-py3.8-linux-x86_64.egg
  • Upload date:
  • Size: 1.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.1.5-py3.8-linux-x86_64.egg
Algorithm Hash digest
SHA256 2cbc0bd87c88e534a5e5b95eb2cd6da8c710d63ed2da895bc8945c1542664894
MD5 9df7d213949c264471af39a45af980f1
BLAKE2b-256 ce88b5a86379d2d457eb84c743906e0eb7a0e5bba801efb0ef945b2d89c6cc56

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page