Skip to main content

Arabic language processing toolkit

Project description

example workflow

Arabic Natural Language Toolkit (ANLTK)

ANLTK is a set of Arabic natural language processing tools. developed with focus on performance.

ANLTK is a C++ library, with python bindings.

Installation

for python :

pip install pybind11
pip install anltk

Building

Note: Currently only tested on Linux
The Library depends on https://github.com/nemtrif/utfcpp.git, which is cloned automatically.
you also need a modern C++ Compiler, which supports C++17 also meson and ninja needs to be installed.
simply with pip

pip install meson
pip install ninja
git clone --recurse-submodules https://github.com/Abdullah-AlAttar/anltk.git \
    && cd anltk/anltk \
    && meson build --buildtype=release -Dbuild_tests=false \
    && cd build \
    && ninja \
    && cd ../../ \
    && python3 setup.py install

Usage Examples:

C++ API :

#include "anltk/anltk.hpp"
#include <iostream>
#include <string>

int main()
{

    std::string ar_text = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ";

    std::cout << anltk::transliterate(ar_text, anltk::CharMapping::AR2BW) << '\n';
    // >bjd hwz HTy klmn sEfS qr$t vx* DZg

    std::string text = "فَ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ.";

    std::cout << anltk::remove_tashkeel(text) << '\n';
    // فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.

    // Third paramters is a stop_list, charactres in this list won't be removed
    std::cout << anltk::remove_non_alpha(text, " ") << '\n';
    // فراشة ملونة تطير في البستان حلوة مهندمة تدهش الإنسان
}

Python API

import anltk


ar = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ"
bw = anltk.transliterate(ar, anltk.AR2BW)
print(bw)
# >bjd hwz HTy klmn sEfS qr$t vx* DZg

print(anltk.remove_tashkeel("فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ."))

# فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.

For list of features see Features.md

Benchmarks

Processing a file containing 2499995 line, 563522705 characters. the task is to remove diacritics.

Reading entire file into a string then a single call to remove_tashkeel:

Method Time
anltk python-api 5.001 seconds
anltk cpp-api 3.507 seconds
python (camel_tools) 23.46 seconds

Processing the file line by line:

Method Time
anltk python-api 7.636 seconds
anltk cpp-api 3.601 seconds
python (camel_tools) 22.37 seconds

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anltk-0.4.2.tar.gz (174.9 kB view details)

Uploaded Source

Built Distributions

anltk-0.4.2-py3.6-linux-x86_64.egg (216.8 kB view details)

Uploaded Source

anltk-0.4.2-pp37-pypy37_pp73-manylinux2010_x86_64.whl (409.2 kB view details)

Uploaded PyPy manylinux: glibc 2.12+ x86-64

anltk-0.4.2-cp310-cp310-manylinux2010_x86_64.whl (207.3 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.12+ x86-64

anltk-0.4.2-cp39-cp39-manylinux2010_x86_64.whl (207.6 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.12+ x86-64

anltk-0.4.2-cp38-cp38-manylinux2010_x86_64.whl (206.2 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

anltk-0.4.2-cp37-cp37m-manylinux2010_x86_64.whl (210.6 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

anltk-0.4.2-cp36-cp36m-manylinux2010_x86_64.whl (210.5 kB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

File details

Details for the file anltk-0.4.2.tar.gz.

File metadata

  • Download URL: anltk-0.4.2.tar.gz
  • Upload date:
  • Size: 174.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.4.2.tar.gz
Algorithm Hash digest
SHA256 fda3d351fc1df94242ef0d621108535e33063f8676276b717dfd9db6f210111c
MD5 1986cfa0b580043dfb57fd9f6cfac42d
BLAKE2b-256 34ccf05914ced61c5deb2adcf588958028c53cb4ebc53d236789cc52582632bd

See more details on using hashes here.

File details

Details for the file anltk-0.4.2-py3.6-linux-x86_64.egg.

File metadata

  • Download URL: anltk-0.4.2-py3.6-linux-x86_64.egg
  • Upload date:
  • Size: 216.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.4.2-py3.6-linux-x86_64.egg
Algorithm Hash digest
SHA256 aa79a85b98a306b7000bfbb3a4244bc0c8fe2ac959a99181b58da9fdd31541d3
MD5 7292d0f803b1e984ec86c657a718f741
BLAKE2b-256 6eb8c54a85234198cf348b250c43ebc6d30d5adcd6ec633f8c2971a7eb3ac881

See more details on using hashes here.

File details

Details for the file anltk-0.4.2-pp37-pypy37_pp73-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.4.2-pp37-pypy37_pp73-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 409.2 kB
  • Tags: PyPy, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.4.2-pp37-pypy37_pp73-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 d137557d60bc75048ab5bf0350e98255cca1559970b70b069d73c4010c124810
MD5 d72e76776b9188579c5d323f511e0c6a
BLAKE2b-256 61baf1e3e72190c1cdd3ff16df7d48d014bb6ebb912eab8f4e3144afde3d4a46

See more details on using hashes here.

File details

Details for the file anltk-0.4.2-cp310-cp310-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.4.2-cp310-cp310-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 207.3 kB
  • Tags: CPython 3.10, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.4.2-cp310-cp310-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 b0d0a7e8e1614f1d2d3c1d4750a2c05a8e10c49c732132a5ebbc1b762617f50e
MD5 abd5ea303e1cc0445dac7f6c6d5f5403
BLAKE2b-256 3e3edad6d0faae257f5a652dcb5b3d6ab569289ae5ad713b9b8d97ad66b9b018

See more details on using hashes here.

File details

Details for the file anltk-0.4.2-cp39-cp39-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.4.2-cp39-cp39-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 207.6 kB
  • Tags: CPython 3.9, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.4.2-cp39-cp39-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 d0d4eefd97a8d95f1ac218141bfb59388d71df34acf4699c33611d2f488e61e2
MD5 653b69d768153fc3be35af18bc1c7e41
BLAKE2b-256 0d58c1ef9602e4b5e643e874831ff75f02cc23f395717c6e6353fe7ce7b48dd3

See more details on using hashes here.

File details

Details for the file anltk-0.4.2-cp38-cp38-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.4.2-cp38-cp38-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 206.2 kB
  • Tags: CPython 3.8, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.4.2-cp38-cp38-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 4298d26ec5842fd4311a018b911558cd801697d22a13397ce7fcd6a424600f8c
MD5 d325321ff5f5bfb4aa75e1624b97f2f6
BLAKE2b-256 0291c45cfe0f45813dba3ea7e5bcf0ec1d6ff314d126a5177d5d436c5ec737d0

See more details on using hashes here.

File details

Details for the file anltk-0.4.2-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.4.2-cp37-cp37m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 210.6 kB
  • Tags: CPython 3.7m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.4.2-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 c5045081b5aac665cf362732ed0802863737b9711832e6acb02fc862f279e6ad
MD5 156403892d7ef8a9b88caa496f3ff315
BLAKE2b-256 81112dc117110b818033b13f271946e77ab8b5762177f23c79adc684fb702f72

See more details on using hashes here.

File details

Details for the file anltk-0.4.2-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.4.2-cp36-cp36m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 210.5 kB
  • Tags: CPython 3.6m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.4.2-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 85905e59b14e32ec7a50dde045e0ff47a1fa1b5acb674c2c4e520d0acdacab50
MD5 883cbf1e1fe260cc1c5ecb7aa3d1b7ab
BLAKE2b-256 3f77f5b4859a13874eb1e3406e208c17750131ff43b1adc5a71ac992f0911d69

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page