Skip to main content

Arabic language processing toolkit

Project description

example workflow

Arabic Natural Language Toolkit (ANLTK)

ANLTK is a set of Arabic natural language processing tools. developed with focus on performance.

ANLTK is a C++ library, with python bindings.

Installation

for python :

pip install pybind11
pip install anltk

Building

Note: Currently only tested on Linux
The Library depends on https://github.com/nemtrif/utfcpp.git, which is cloned automatically.
you also need a modern C++ Compiler, which supports C++17 also meson and ninja needs to be installed.
simply with pip

pip install meson
pip install ninja
git clone --recurse-submodules https://github.com/Abdullah-AlAttar/anltk.git \
    && cd anltk/anltk \
    && meson build --buildtype=release -Dbuild_tests=false \
    && cd build \
    && ninja \
    && cd ../../ \
    && python3 setup.py install

Usage Examples:

C++ API :

#include "anltk/anltk.hpp"
#include <iostream>
#include <string>

int main()
{

    std::string ar_text = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ";

    std::cout << anltk::transliterate(ar_text, anltk::CharMapping::AR2BW) << '\n';
    // >bjd hwz HTy klmn sEfS qr$t vx* DZg

    std::string text = "فَ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ.";

    std::cout << anltk::remove_tashkeel(text) << '\n';
    // فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.

    // Third paramters is a stop_list, charactres in this list won't be removed
    std::cout << anltk::remove_non_alpha(text, " ") << '\n';
    // فراشة ملونة تطير في البستان حلوة مهندمة تدهش الإنسان
}

Python API

import anltk


ar = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ"
bw = anltk.transliterate(ar, anltk.AR2BW)
print(bw)
# >bjd hwz HTy klmn sEfS qr$t vx* DZg

print(anltk.remove_tashkeel("فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ."))

# فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.

For list of features see Features.md

Benchmarks

Processing a file containing 2499995 line, 563522705 characters. the task is to remove diacritics.

Reading entire file into a string then a single call to remove_tashkeel:

Method Time
anltk python-api 5.001 seconds
anltk cpp-api 3.507 seconds
python (camel_tools) 23.46 seconds

Processing the file line by line:

Method Time
anltk python-api 7.636 seconds
anltk cpp-api 3.601 seconds
python (camel_tools) 22.37 seconds

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

anltk-0.3.7-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB view details)

Uploaded Python 3 manylinux: glibc 2.17+ x86-64

anltk-0.3.7-cp310-cp310-manylinux2010_x86_64.whl (207.3 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.12+ x86-64

anltk-0.3.7-cp38-cp38-manylinux2010_x86_64.whl (206.3 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

anltk-0.3.7-cp36-cp36m-manylinux2010_x86_64.whl (210.5 kB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

File details

Details for the file anltk-0.3.7-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for anltk-0.3.7-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8436bd76c204c42127bf49535415a3d6ec43ae45ff3845f4a7caa4b92dd18291
MD5 9752d4ba086d1087c1c530dc6fd1ee91
BLAKE2b-256 ebf9c2859152a8dd23261aa78696c170687d48ae36d68847b054c1c985d45b61

See more details on using hashes here.

File details

Details for the file anltk-0.3.7-cp310-cp310-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.3.7-cp310-cp310-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 207.3 kB
  • Tags: CPython 3.10, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.3.7-cp310-cp310-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 415f3e96c93697ef8fece8d2d4b07d69766dfe15a6e96df083169bd1e1171f70
MD5 c57179a140b4151934c04fbbf704c4aa
BLAKE2b-256 c4dff5bffd63bb2accaf5252b00db18873763434fe36d909e901be042ee1a197

See more details on using hashes here.

File details

Details for the file anltk-0.3.7-cp38-cp38-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.3.7-cp38-cp38-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 206.3 kB
  • Tags: CPython 3.8, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.3.7-cp38-cp38-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 487a1b7b717b4a93a0df7575104fa88f5a6444b0d9181bb5fc727835785559b9
MD5 e260d0f23c2dc55d9f78984a21e2946f
BLAKE2b-256 57550ee854e03fd8b5adbea3268cde6af761aaefd6a80344503dd31eea01280d

See more details on using hashes here.

File details

Details for the file anltk-0.3.7-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.3.7-cp36-cp36m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 210.5 kB
  • Tags: CPython 3.6m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.3.7-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 4d876032c97d0bcfdabc5e3eb8cfb53ac3328c699bc938679c279188311e3f91
MD5 0195f8dc6580b300017285480d67beaa
BLAKE2b-256 88bfe96b3822b32d1a9dc3e190f64db8c25ae12d43e60e6c98fb9f9efc8e5b99

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page