Skip to main content

Arabic language processing toolkit

Project description

example workflow

Arabic Natural Language Toolkit (ANLTK)

ANLTK is a set of Arabic natural language processing tools. developed with focus on performance.

ANLTK is a C++ library, with python bindings.

Installation

for python :

pip install anltk

Building

Note: Currently only tested on Linux
The Library depends on https://github.com/nemtrif/utfcpp.git, which is cloned automatically.
you also need a modern C++ Compiler, which supports C++17 also meson and ninja needs to be installed.
simply with pip

pip install meson
pip install ninja
git clone --recurse-submodules https://github.com/Abdullah-AlAttar/anltk.git \
    && cd anltk/anltk \
    && meson build --buildtype=release -Dbuild_tests=false \
    && cd build \
    && ninja \
    && cd ../../ \
    && python3 setup.py install

Usage Examples:

C++ API :

#include "anltk/anltk.hpp"
#include <iostream>
#include <string>

int main()
{

    std::string ar_text = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ";

    std::cout << anltk::transliterate(ar_text, anltk::CharMapping::AR2BW) << '\n';
    // >bjd hwz HTy klmn sEfS qr$t vx* DZg

    std::string text = "فَ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ.";

    std::cout << anltk::remove_tashkeel(text) << '\n';
    // فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.

    // Third paramters is a stop_list, charactres in this list won't be removed
    std::cout << anltk::remove_non_alpha(text, " ") << '\n';
    // فراشة ملونة تطير في البستان حلوة مهندمة تدهش الإنسان
}

Python API

import anltk


ar = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ"
bw = anltk.transliterate(ar, anltk.AR2BW)
print(bw)
# >bjd hwz HTy klmn sEfS qr$t vx* DZg

print(anltk.remove_tashkeel("فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ."))

# فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.

For list of features see Features.md

Benchmarks

Processing a file containing 2499995 line, 563522705 characters. the task is to remove diacritics.

Reading entire file into a string then a single call to remove_tashkeel:

Method Time
anltk python-api 5.001 seconds
anltk cpp-api 3.507 seconds
python (camel_tools) 23.46 seconds

Processing the file line by line:

Method Time
anltk python-api 7.636 seconds
anltk cpp-api 3.601 seconds
python (camel_tools) 22.37 seconds

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anltk-0.2.1.tar.gz (180.5 kB view details)

Uploaded Source

Built Distributions

anltk-0.2.1-py3.9-win-amd64.egg (86.1 kB view details)

Uploaded Source

anltk-0.2.1-py3.8-linux-x86_64.egg (1.5 MB view details)

Uploaded Source

File details

Details for the file anltk-0.2.1.tar.gz.

File metadata

  • Download URL: anltk-0.2.1.tar.gz
  • Upload date:
  • Size: 180.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.2.1.tar.gz
Algorithm Hash digest
SHA256 a23061a19dc21670112a6a71a9dad47026f36ee7571876615559f136d3a063bc
MD5 eca6782877aa4f8a050942941c97c5e8
BLAKE2b-256 e383f04582218627ab03b24bedc54a9b1618dded028312c74c317bf2d7b4612a

See more details on using hashes here.

File details

Details for the file anltk-0.2.1-py3.9-win-amd64.egg.

File metadata

  • Download URL: anltk-0.2.1-py3.9-win-amd64.egg
  • Upload date:
  • Size: 86.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.1 CPython/3.9.6

File hashes

Hashes for anltk-0.2.1-py3.9-win-amd64.egg
Algorithm Hash digest
SHA256 42c9437b7fb6f51700dfa9a827f00ab2f0a0f561b0ddc5ed5a385406f3d35c82
MD5 6ad3afb53b2622ee1f479c3f29626e99
BLAKE2b-256 eea67b5b0492a953b92e68785db783f6ce6764b2c45d0864c9b7a0306ca2f775

See more details on using hashes here.

File details

Details for the file anltk-0.2.1-py3.8-linux-x86_64.egg.

File metadata

  • Download URL: anltk-0.2.1-py3.8-linux-x86_64.egg
  • Upload date:
  • Size: 1.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.2.1-py3.8-linux-x86_64.egg
Algorithm Hash digest
SHA256 84de7ac0c49b1af05f7059034ecdbd25fe00df1e725d7d88107c2aed5a911d45
MD5 447af545a93063ce2a260f63c09aa489
BLAKE2b-256 2f5671bd92e05a0f6c69288075676e882b041a824a9a8f28264fe6b49def1a2a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page