Skip to main content

Arabic language processing toolkit

Project description

example workflow

Arabic Natural Language Toolkit (ANLTK)

ANLTK is a set of Arabic natural language processing tools. developed with focus on performance.

ANLTK is a C++ library, with python bindings.

Installation

for python :

pip install pybind11
pip install anltk

Building

Note: Currently only tested on Linux
The Library depends on https://github.com/nemtrif/utfcpp.git, which is cloned automatically.
you also need a modern C++ Compiler, which supports C++17 also meson and ninja needs to be installed.
simply with pip

pip install meson
pip install ninja
git clone --recurse-submodules https://github.com/Abdullah-AlAttar/anltk.git \
    && cd anltk/anltk \
    && meson build --buildtype=release -Dbuild_tests=false \
    && cd build \
    && ninja \
    && cd ../../ \
    && python3 setup.py install

Usage Examples:

C++ API :

#include "anltk/anltk.hpp"
#include <iostream>
#include <string>

int main()
{

    std::string ar_text = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ";

    std::cout << anltk::transliterate(ar_text, anltk::CharMapping::AR2BW) << '\n';
    // >bjd hwz HTy klmn sEfS qr$t vx* DZg

    std::string text = "فَ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ.";

    std::cout << anltk::remove_tashkeel(text) << '\n';
    // فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.

    // Third paramters is a stop_list, charactres in this list won't be removed
    std::cout << anltk::remove_non_alpha(text, " ") << '\n';
    // فراشة ملونة تطير في البستان حلوة مهندمة تدهش الإنسان
}

Python API

import anltk


ar = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ"
bw = anltk.transliterate(ar, anltk.AR2BW)
print(bw)
# >bjd hwz HTy klmn sEfS qr$t vx* DZg

print(anltk.remove_tashkeel("فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ."))

# فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.

For list of features see Features.md

Benchmarks

Processing a file containing 2499995 line, 563522705 characters. the task is to remove diacritics.

Reading entire file into a string then a single call to remove_tashkeel:

Method Time
anltk python-api 5.001 seconds
anltk cpp-api 3.507 seconds
python (camel_tools) 23.46 seconds

Processing the file line by line:

Method Time
anltk python-api 7.636 seconds
anltk cpp-api 3.601 seconds
python (camel_tools) 22.37 seconds

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anltk-0.3.5.tar.gz (173.9 kB view details)

Uploaded Source

Built Distribution

anltk-0.3.5-py3.8-linux-x86_64.egg (175.3 kB view details)

Uploaded Source

File details

Details for the file anltk-0.3.5.tar.gz.

File metadata

  • Download URL: anltk-0.3.5.tar.gz
  • Upload date:
  • Size: 173.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.3.5.tar.gz
Algorithm Hash digest
SHA256 727739db0adcd5ca038612aa174740f66b094a837b9f50936cfdb6799d4eeb40
MD5 6231718dcb0b49a4cde15cab70c3c246
BLAKE2b-256 656482e17b16a4aa9c40764ec80b91503da07e3e234ad63c34c04c37701616f9

See more details on using hashes here.

File details

Details for the file anltk-0.3.5-py3.8-linux-x86_64.egg.

File metadata

  • Download URL: anltk-0.3.5-py3.8-linux-x86_64.egg
  • Upload date:
  • Size: 175.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.3.5-py3.8-linux-x86_64.egg
Algorithm Hash digest
SHA256 b64cee55f93131a43f96d14db2c7a41cd4137a5f92dc2169777046571fc5887a
MD5 7a53fbfcd1ca77df5d6c303ee694b289
BLAKE2b-256 95abe4d9d48bd2bdeb3162eb9b5b36ce969c21f6a9d73913b3b6c52af5e83991

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page