Skip to main content

Arabic language processing toolkit

Project description

example workflow

Arabic Natural Language Toolkit (ANLTK)

ANLTK is a set of Arabic natural language processing tools. developed with focus on performance.

ANLTK is a C++ library, with python bindings.

Installation

for python :

pip install pybind11
pip install anltk

Building

Note: Currently only tested on Linux
The Library depends on https://github.com/nemtrif/utfcpp.git, which is cloned automatically.
you also need a modern C++ Compiler, which supports C++17 also meson and ninja needs to be installed.
simply with pip

pip install meson
pip install ninja
git clone --recurse-submodules https://github.com/Abdullah-AlAttar/anltk.git \
    && cd anltk/anltk \
    && meson build --buildtype=release -Dbuild_tests=false \
    && cd build \
    && ninja \
    && cd ../../ \
    && python3 setup.py install

Usage Examples:

C++ API :

#include "anltk/anltk.hpp"
#include <iostream>
#include <string>

int main()
{

    std::string ar_text = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ";

    std::cout << anltk::transliterate(ar_text, anltk::CharMapping::AR2BW) << '\n';
    // >bjd hwz HTy klmn sEfS qr$t vx* DZg

    std::string text = "فَ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ.";

    std::cout << anltk::remove_tashkeel(text) << '\n';
    // فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.

    // Third paramters is a stop_list, charactres in this list won't be removed
    std::cout << anltk::remove_non_alpha(text, " ") << '\n';
    // فراشة ملونة تطير في البستان حلوة مهندمة تدهش الإنسان
}

Python API

import anltk


ar = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ"
bw = anltk.transliterate(ar, anltk.AR2BW)
print(bw)
# >bjd hwz HTy klmn sEfS qr$t vx* DZg

print(anltk.remove_tashkeel("فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ."))

# فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.

For list of features see Features.md

Benchmarks

Processing a file containing 2499995 line, 563522705 characters. the task is to remove diacritics.

Reading entire file into a string then a single call to remove_tashkeel:

Method Time
anltk python-api 5.001 seconds
anltk cpp-api 3.507 seconds
python (camel_tools) 23.46 seconds

Processing the file line by line:

Method Time
anltk python-api 7.636 seconds
anltk cpp-api 3.601 seconds
python (camel_tools) 22.37 seconds

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

anltk-0.3.9-pp37-pypy37_pp73-manylinux2010_x86_64.whl (409.4 kB view details)

Uploaded PyPy manylinux: glibc 2.12+ x86-64

anltk-0.3.9-pp37-pypy37_pp73-manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (409.4 kB view details)

Uploaded PyPy manylinux: glibc 2.12+ x86-64 manylinux: glibc 2.17+ x86-64

anltk-0.3.9-cp310-cp310-manylinux2010_x86_64.whl (207.3 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.12+ x86-64

anltk-0.3.9-cp310-cp310-manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (207.3 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.12+ x86-64 manylinux: glibc 2.17+ x86-64

anltk-0.3.9-cp39-cp39-manylinux2010_x86_64.whl (207.6 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.12+ x86-64

anltk-0.3.9-cp39-cp39-manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (207.6 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.12+ x86-64 manylinux: glibc 2.17+ x86-64

anltk-0.3.9-cp38-cp38-manylinux2010_x86_64.whl (206.2 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

anltk-0.3.9-cp38-cp38-manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (206.3 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64 manylinux: glibc 2.17+ x86-64

anltk-0.3.9-cp37-cp37m-manylinux2010_x86_64.whl (210.7 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

anltk-0.3.9-cp37-cp37m-manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (409.4 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64 manylinux: glibc 2.17+ x86-64

anltk-0.3.9-cp36-cp36m-manylinux2010_x86_64.whl (210.5 kB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

anltk-0.3.9-cp36-cp36m-manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (210.5 kB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64 manylinux: glibc 2.17+ x86-64

File details

Details for the file anltk-0.3.9-pp37-pypy37_pp73-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.3.9-pp37-pypy37_pp73-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 409.4 kB
  • Tags: PyPy, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.3.9-pp37-pypy37_pp73-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 6bc8452985601f49172326899cbaf8fc4f1289de45f53d4bb701a4f9877043a7
MD5 06a04ef91f74af2e34d3f5abe28533b5
BLAKE2b-256 5e68a070d22201e5944e10a2c1d4b2c22f717fb83390931affa1b4b534d0df07

See more details on using hashes here.

File details

Details for the file anltk-0.3.9-pp37-pypy37_pp73-manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for anltk-0.3.9-pp37-pypy37_pp73-manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 bceae121d277e713171bc07068af2971def3818b1ef7a2a97bbf774a7165dfd0
MD5 dfda9ec3f74b066f74cc7a6e3adc263f
BLAKE2b-256 daba256ce2950b24ba695024c25a025d7c0e4c9c62aa7f99e0fb5ba43bf5c19d

See more details on using hashes here.

File details

Details for the file anltk-0.3.9-cp310-cp310-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.3.9-cp310-cp310-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 207.3 kB
  • Tags: CPython 3.10, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.3.9-cp310-cp310-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 f65f771b8784fb27e814bb169b951978cd7648722fdf2630c8df5e78a282c771
MD5 5e1435c7ae055eb8e36efd735a09f65a
BLAKE2b-256 d7ee24946c4df930b7707b242a80f45146af24dbacc84e4d63d823fc2dd3e4d1

See more details on using hashes here.

File details

Details for the file anltk-0.3.9-cp310-cp310-manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for anltk-0.3.9-cp310-cp310-manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2260acf099c9f99c3b81b97d39778311a6815e94b0ff333bf45387c44545944e
MD5 dfcf2b83f27f1443a220d5f67e658f9d
BLAKE2b-256 a94b527735b7569a295c055d00322b47754976a204c399da7ccd340250bc3f95

See more details on using hashes here.

File details

Details for the file anltk-0.3.9-cp39-cp39-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.3.9-cp39-cp39-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 207.6 kB
  • Tags: CPython 3.9, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.3.9-cp39-cp39-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 7cce6e8903de6c79536fe41d946cee7922526c7b728870c3b7d5110b60d5cfcb
MD5 2b8b1ec8b55a29934b44ff9e57362107
BLAKE2b-256 26898551a6f84d9ee0728446a043fea1cbb7cc5013be865ed55035b6819c77ce

See more details on using hashes here.

File details

Details for the file anltk-0.3.9-cp39-cp39-manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for anltk-0.3.9-cp39-cp39-manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 eba1907751be25994ab6504a13ab6acc1d8e6b89508daa260489b2f696e9cc22
MD5 3364d574fa208a3eea69473568ae9827
BLAKE2b-256 77ccc793c490115dd298cfb0bcca645a2741a8a63336b88e7722b080c6f54db4

See more details on using hashes here.

File details

Details for the file anltk-0.3.9-cp38-cp38-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.3.9-cp38-cp38-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 206.2 kB
  • Tags: CPython 3.8, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.3.9-cp38-cp38-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 e8ab06884d95347454d5bb76f2e5891ba53844f72397fde783aaaa189df348b9
MD5 d94e8b84079435ed64fbe6f37fc8e4bf
BLAKE2b-256 5dfd1cb8c710469a4c0a3a45af9e7dc20ca1d01859032575849062763964e9d3

See more details on using hashes here.

File details

Details for the file anltk-0.3.9-cp38-cp38-manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for anltk-0.3.9-cp38-cp38-manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f2367d2c90dabd70bb504a1e0c6d21232ad192c2c0db720d4130aca9bd1c4505
MD5 5900a196aba152b65108fd27922e42fe
BLAKE2b-256 a516632f9b65c471c808acf6ee6d3b4d16071c4565b374ab681405e733eb5ee8

See more details on using hashes here.

File details

Details for the file anltk-0.3.9-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.3.9-cp37-cp37m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 210.7 kB
  • Tags: CPython 3.7m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.3.9-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 853c6c7984bdaccb05f167b2eab299744e6e41b459c57f277affc4e3c1775f6c
MD5 7e9e5f8732ffaea3f00ad85b92545e08
BLAKE2b-256 5d1a9d2c36641ab7113b2a2d9211ea45a0d9e61d59d0bd4334743df3fbc4e710

See more details on using hashes here.

File details

Details for the file anltk-0.3.9-cp37-cp37m-manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for anltk-0.3.9-cp37-cp37m-manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 87e3be1230f3558fb0e04a259ac4a40b0848937b984600c3146faa9c905393ba
MD5 4057b784855b7442fd54f376a28932cb
BLAKE2b-256 dafad82021704a2505fcab77724fd8c1a2bb2dc86be8da378b12ef84351aa1da

See more details on using hashes here.

File details

Details for the file anltk-0.3.9-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.3.9-cp36-cp36m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 210.5 kB
  • Tags: CPython 3.6m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.3.9-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 b4c0a5b586cef67f4ba086bbb5fe0ccb03b3b9eba3fe641bd7c62539bc2b634a
MD5 b9458b8eb005e90191449520b0db912f
BLAKE2b-256 929add9bdbfd3530fb3b6121a240132da174692053b3c3b23d1d4e31160c1c4e

See more details on using hashes here.

File details

Details for the file anltk-0.3.9-cp36-cp36m-manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for anltk-0.3.9-cp36-cp36m-manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 113452b31429009ff7ff6a3925c7713ae345db82e7b6c1ee026e7c15bf94fed0
MD5 933db92236c566b692b987ad24647744
BLAKE2b-256 9cbcadc022be9f3bb858c39678c3478f40ea7c1bf0c364b25c29d000f1bc66c4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page