Skip to main content

Arabic language processing toolkit

Project description

example workflow

Arabic Natural Language Toolkit (ANLTK)

ANLTK is a set of Arabic natural language processing tools. developed with focus on performance.

ANLTK is a C++ library, with python bindings.

Installation

for python :

pip install pybind11
pip install anltk

Building

Note: Currently only tested on Linux
The Library depends on https://github.com/nemtrif/utfcpp.git, which is cloned automatically.
you also need a modern C++ Compiler, which supports C++17 also meson and ninja needs to be installed.
simply with pip

pip install meson
pip install ninja
git clone --recurse-submodules https://github.com/Abdullah-AlAttar/anltk.git \
    && cd anltk/anltk \
    && meson build --buildtype=release -Dbuild_tests=false \
    && cd build \
    && ninja \
    && cd ../../ \
    && python3 setup.py install

Usage Examples:

C++ API :

#include "anltk/anltk.hpp"
#include <iostream>
#include <string>

int main()
{

    std::string ar_text = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ";

    std::cout << anltk::transliterate(ar_text, anltk::CharMapping::AR2BW) << '\n';
    // >bjd hwz HTy klmn sEfS qr$t vx* DZg

    std::string text = "فَ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ.";

    std::cout << anltk::remove_tashkeel(text) << '\n';
    // فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.

    // Third paramters is a stop_list, charactres in this list won't be removed
    std::cout << anltk::remove_non_alpha(text, " ") << '\n';
    // فراشة ملونة تطير في البستان حلوة مهندمة تدهش الإنسان
}

Python API

import anltk


ar = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ"
bw = anltk.transliterate(ar, anltk.AR2BW)
print(bw)
# >bjd hwz HTy klmn sEfS qr$t vx* DZg

print(anltk.remove_tashkeel("فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ."))

# فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.

For list of features see Features.md

Benchmarks

Processing a file containing 2499995 line, 563522705 characters. the task is to remove diacritics.

Reading entire file into a string then a single call to remove_tashkeel:

Method Time
anltk python-api 5.001 seconds
anltk cpp-api 3.507 seconds
python (camel_tools) 23.46 seconds

Processing the file line by line:

Method Time
anltk python-api 7.636 seconds
anltk cpp-api 3.601 seconds
python (camel_tools) 22.37 seconds

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

anltk-0.3.8-pp37-pypy37_pp73-manylinux2010_x86_64.whl (409.4 kB view details)

Uploaded PyPy manylinux: glibc 2.12+ x86-64

anltk-0.3.8-cp310-cp310-manylinux2010_x86_64.whl (207.3 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.12+ x86-64

anltk-0.3.8-cp39-cp39-manylinux2010_x86_64.whl (207.6 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.12+ x86-64

anltk-0.3.8-cp38-cp38-manylinux2010_x86_64.whl (206.2 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

anltk-0.3.8-cp37-cp37m-manylinux2010_x86_64.whl (210.7 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

anltk-0.3.8-cp36-cp36m-manylinux2010_x86_64.whl (210.5 kB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

File details

Details for the file anltk-0.3.8-pp37-pypy37_pp73-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.3.8-pp37-pypy37_pp73-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 409.4 kB
  • Tags: PyPy, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.3.8-pp37-pypy37_pp73-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 1e7762ab3f2f370d16c5ed69b6d9170f1bbd20c1739cfb62cfdbbd56f55b40ee
MD5 862f75ead1d791fee38f355558530ab4
BLAKE2b-256 fd841f2826b3a8d6046305b6cab1dbfb8bef8725d9b9b5ef6f45ea99d0786f00

See more details on using hashes here.

File details

Details for the file anltk-0.3.8-cp310-cp310-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.3.8-cp310-cp310-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 207.3 kB
  • Tags: CPython 3.10, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.3.8-cp310-cp310-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 da264b49995e60b8016ad2a53a84c03f24370a87f70ed0623a84b9de2817b214
MD5 8db6205d13ac9847bfb0e58df1744ac5
BLAKE2b-256 663bf168da6793f02f281e42f0edfa75136bc5f4af449b15e7aaa39ba3e0d113

See more details on using hashes here.

File details

Details for the file anltk-0.3.8-cp39-cp39-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.3.8-cp39-cp39-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 207.6 kB
  • Tags: CPython 3.9, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.3.8-cp39-cp39-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 31ad4d1f0c3a6b867aa4e483b199d4f57e518b641e8597caaeb0f2a431f2b2e0
MD5 9b5ec04eb6fab9b62743c80b3ba1c025
BLAKE2b-256 2ab81b9bc201779fd9d051d4e7c2deabe9b8f54e081bb20f0545972434494f7d

See more details on using hashes here.

File details

Details for the file anltk-0.3.8-cp38-cp38-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.3.8-cp38-cp38-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 206.2 kB
  • Tags: CPython 3.8, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.3.8-cp38-cp38-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 27c5b135467354a729c4aa18cf9019d7130edc535d52cc0690c47ecf630fa7da
MD5 981d61f18800f8ce7e37c46588eeec2b
BLAKE2b-256 0c287530a1c1946b4627b4d2fe6a59f8bd0fb107c10332b7c5f982ce9890fd0e

See more details on using hashes here.

File details

Details for the file anltk-0.3.8-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.3.8-cp37-cp37m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 210.7 kB
  • Tags: CPython 3.7m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.3.8-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 b23e059134a1b61338b53d681d8964f9217b27394296ac9fe7828f0863ef7052
MD5 57e3a9175040c82a544ca84bd2c4bece
BLAKE2b-256 4bb58411aaf4cc0751e2772eb24c2b8d8e135b043868c04f940d4c2b6c6bfa80

See more details on using hashes here.

File details

Details for the file anltk-0.3.8-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.3.8-cp36-cp36m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 210.5 kB
  • Tags: CPython 3.6m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.3.8-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 20b1b83b237cb7f6d5ba8fa931157ddcfba2ab847fac31219d77af2c766ad367
MD5 7d9188e59d37f52672b92801a9637371
BLAKE2b-256 2ca79098be9c4ae36f39a043c417fc1a6f794c3b3f93fda14324a98c7abc85c5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page