Skip to main content

Arabic language processing toolkit

Project description

example workflow

Arabic Natural Language Toolkit (ANLTK)

ANLTK is a set of Arabic natural language processing tools. developed with focus on performance.

ANLTK is a C++ library, with python bindings.

Installation

for python :

pip install pybind11
pip install anltk

Building

Note: Currently only tested on Linux
The Library depends on https://github.com/nemtrif/utfcpp.git, which is cloned automatically.
you also need a modern C++ Compiler, which supports C++17 also meson and ninja needs to be installed.
simply with pip

pip install meson
pip install ninja
git clone --recurse-submodules https://github.com/Abdullah-AlAttar/anltk.git \
    && cd anltk/anltk \
    && meson build --buildtype=release -Dbuild_tests=false \
    && cd build \
    && ninja \
    && cd ../../ \
    && python3 setup.py install

Usage Examples:

C++ API :

#include "anltk/anltk.hpp"
#include <iostream>
#include <string>

int main()
{

    std::string ar_text = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ";

    std::cout << anltk::transliterate(ar_text, anltk::CharMapping::AR2BW) << '\n';
    // >bjd hwz HTy klmn sEfS qr$t vx* DZg

    std::string text = "فَ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ.";

    std::cout << anltk::remove_tashkeel(text) << '\n';
    // فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.

    // Third paramters is a stop_list, charactres in this list won't be removed
    std::cout << anltk::remove_non_alpha(text, " ") << '\n';
    // فراشة ملونة تطير في البستان حلوة مهندمة تدهش الإنسان
}

Python API

import anltk


ar = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ"
bw = anltk.transliterate(ar, anltk.AR2BW)
print(bw)
# >bjd hwz HTy klmn sEfS qr$t vx* DZg

print(anltk.remove_tashkeel("فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ."))

# فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.

For list of features see Features.md

Benchmarks

Processing a file containing 2499995 line, 563522705 characters. the task is to remove diacritics.

Reading entire file into a string then a single call to remove_tashkeel:

Method Time
anltk python-api 5.001 seconds
anltk cpp-api 3.507 seconds
python (camel_tools) 23.46 seconds

Processing the file line by line:

Method Time
anltk python-api 7.636 seconds
anltk cpp-api 3.601 seconds
python (camel_tools) 22.37 seconds

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

anltk-0.4.0-pp37-pypy37_pp73-manylinux2010_x86_64.whl (409.4 kB view details)

Uploaded PyPy manylinux: glibc 2.12+ x86-64

anltk-0.4.0-cp310-cp310-manylinux2010_x86_64.whl (207.3 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.12+ x86-64

anltk-0.4.0-cp39-cp39-manylinux2010_x86_64.whl (207.6 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.12+ x86-64

anltk-0.4.0-cp38-cp38-manylinux2010_x86_64.whl (206.2 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

anltk-0.4.0-cp37-cp37m-manylinux2010_x86_64.whl (210.7 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

anltk-0.4.0-cp36-cp36m-manylinux2010_x86_64.whl (210.5 kB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

File details

Details for the file anltk-0.4.0-pp37-pypy37_pp73-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.4.0-pp37-pypy37_pp73-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 409.4 kB
  • Tags: PyPy, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.4.0-pp37-pypy37_pp73-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 d9966dd920355fc7725595aa8370f312ca2224c6e2a0aa9b89516fdee9727851
MD5 0baa2d92faf9769f8c778ee095ff7c79
BLAKE2b-256 71001736bfcd235cb7e06d8fcee1e4323640e70ad630017b1c203e6e704d8af3

See more details on using hashes here.

File details

Details for the file anltk-0.4.0-cp310-cp310-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.4.0-cp310-cp310-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 207.3 kB
  • Tags: CPython 3.10, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.4.0-cp310-cp310-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 f3efc6c4f15a456a49d65667692a9d1ea3bd5f2d81de585e874b10e15147d753
MD5 341f316550e213a026b582d35fa6f7ce
BLAKE2b-256 47271849d57853c705daa8c070f1240931612027b362c499f6f5395aa546460b

See more details on using hashes here.

File details

Details for the file anltk-0.4.0-cp39-cp39-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.4.0-cp39-cp39-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 207.6 kB
  • Tags: CPython 3.9, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.4.0-cp39-cp39-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 6048b79289d4cec3d3205b65ad809178390439c3ae7c7b7434ea5f79f37c0b8c
MD5 dbc229b3313e2f5094db3ee13d562bb0
BLAKE2b-256 0fcdb2df46bfe61bb343e6be27f06da07f046a8e3fcce1be56b48d7b9d8adefa

See more details on using hashes here.

File details

Details for the file anltk-0.4.0-cp38-cp38-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.4.0-cp38-cp38-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 206.2 kB
  • Tags: CPython 3.8, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.4.0-cp38-cp38-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 f6e7a08c145fc68255e74211049123a935a1d3a9dd5176ddf18abf8b43fafdb1
MD5 4ee546b7a799224e1792ee3cda685693
BLAKE2b-256 055f2270cc92dabd0d849455383fc79ec5d06c00a3b7a2d94eb7df9ab4487285

See more details on using hashes here.

File details

Details for the file anltk-0.4.0-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.4.0-cp37-cp37m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 210.7 kB
  • Tags: CPython 3.7m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.4.0-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 46f4d7cf9fed7bbbf3443144cd5f289b4ad9aa3f82861827462f88082c09a876
MD5 479d58aa143df1f758ee28e2ab49d924
BLAKE2b-256 d786b605d5d98086e14905239cd9af174d07c30d664eda089493c9485325c5bd

See more details on using hashes here.

File details

Details for the file anltk-0.4.0-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.4.0-cp36-cp36m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 210.5 kB
  • Tags: CPython 3.6m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.4.0-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 221735f82a6c4bd60dd3a961b4066b1497c10c9a295c03ce4df256b60b15abe8
MD5 f1eef15e22d44c98434a70446d40e519
BLAKE2b-256 e2f8c52ec1f03915620952551a6bfa1558b8fbe0ff913d0bbb07427266faeff8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page