Skip to main content

Arabic language processing toolkit

Project description

example workflow

Arabic Natural Language Toolkit (ANLTK)

ANLTK is a set of Arabic natural language processing tools. developed with focus on performance.

ANLTK is a C++ library, with python bindings.

Installation

for python :

pip install pybind11
pip install anltk

Building

Note: Currently only tested on Linux
The Library depends on https://github.com/nemtrif/utfcpp.git, which is cloned automatically.
you also need a modern C++ Compiler, which supports C++17 also meson and ninja needs to be installed.
simply with pip

pip install meson
pip install ninja
git clone --recurse-submodules https://github.com/Abdullah-AlAttar/anltk.git \
    && cd anltk/ \
    && meson build --buildtype=release -Dbuild_tests=false \
    && cd build \
    && ninja \
    && cd ../ \
    && python3 setup.py install

Usage Examples:

C++ API :

#include "anltk/anltk.hpp"
#include <iostream>
#include <string>

int main()
{

    std::string ar_text = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ";

    std::cout << anltk::transliterate(ar_text, anltk::CharMapping::AR2BW) << '\n';
    // >bjd hwz HTy klmn sEfS qr$t vx* DZg

    std::string text = "فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ.";

    std::cout << anltk::remove_tashkeel(text) << '\n';
    // فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.

    // Third paramters is a stop_list, charactres in this list won't be removed
    std::cout << anltk::remove_non_alpha(text, " ") << '\n';
    // فراشة ملونة تطير في البستان حلوة مهندمة تدهش الإنسان
}

Python API

import anltk


ar = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ"
bw = anltk.transliterate(ar, anltk.AR2BW)
print(bw)
# >bjd hwz HTy klmn sEfS qr$t vx* DZg

print(anltk.remove_tashkeel("فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ."))

# فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.

For list of features see Features.md

Benchmarks

Processing a file containing 500000 Line, 6787731 Word, 112704541 Character. the task is to remove diacritics / transliterate to buckwalter

Buckwatler transliteration

Method Time
anltk python-api 1.379 seconds
python (camel_tools) 11.46 seconds

Remove Diacritics

Method Time
anltk python-api 0.989 seconds
python (camel_tools) 4.892 seconds

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anltk-0.4.8.tar.gz (23.7 kB view details)

Uploaded Source

Built Distributions

anltk-0.4.8-py3.6-linux-x86_64.egg (216.5 kB view details)

Uploaded Source

anltk-0.4.8-pp37-pypy37_pp73-manylinux2010_x86_64.whl (409.1 kB view details)

Uploaded PyPy manylinux: glibc 2.12+ x86-64

anltk-0.4.8-cp310-cp310-manylinux2010_x86_64.whl (207.3 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.12+ x86-64

anltk-0.4.8-cp39-cp39-manylinux2010_x86_64.whl (207.5 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.12+ x86-64

anltk-0.4.8-cp37-cp37m-manylinux2010_x86_64.whl (210.5 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

anltk-0.4.8-cp36-cp36m-manylinux2010_x86_64.whl (210.4 kB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

File details

Details for the file anltk-0.4.8.tar.gz.

File metadata

  • Download URL: anltk-0.4.8.tar.gz
  • Upload date:
  • Size: 23.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.4.8.tar.gz
Algorithm Hash digest
SHA256 ffe3fb79592be4e61519c6e2a34448f23f4575a3022151c33e0760420d7468a9
MD5 7321ad451c14ecc91759af3ad51490f6
BLAKE2b-256 5350b7008b440961840a14de9f05d5373ff89b1d4c54071d94aaf7dc56715a25

See more details on using hashes here.

File details

Details for the file anltk-0.4.8-py3.6-linux-x86_64.egg.

File metadata

  • Download URL: anltk-0.4.8-py3.6-linux-x86_64.egg
  • Upload date:
  • Size: 216.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.4.8-py3.6-linux-x86_64.egg
Algorithm Hash digest
SHA256 a7ac7b226808f806094ac0c3707838729c5d33227bd45e25118ef6a04dda57d4
MD5 aa31f41c7f36e9cf96a0f643f3062005
BLAKE2b-256 664f799a0aaea0db813bac874e1523c40c1af710230b48bebd23044c9a26dfda

See more details on using hashes here.

File details

Details for the file anltk-0.4.8-pp37-pypy37_pp73-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.4.8-pp37-pypy37_pp73-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 409.1 kB
  • Tags: PyPy, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.4.8-pp37-pypy37_pp73-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 77dcd6a696436e1ac19015df27980c9aa4a081fa742a60c5a1700f513251be25
MD5 0075310404db6e41aa82cad2ff4df3bb
BLAKE2b-256 7dc332570413aae435015a8994dfeec7b2b89f7fb4b9853c7f7e77fe90f2b774

See more details on using hashes here.

File details

Details for the file anltk-0.4.8-cp310-cp310-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.4.8-cp310-cp310-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 207.3 kB
  • Tags: CPython 3.10, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.4.8-cp310-cp310-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 08dd3b00c40367f8a65396c68736b56fcb500726d38fbab4bacf3769ad865b48
MD5 25b66a2760c398136938ea5cfb1ae437
BLAKE2b-256 75b6f06a914c1b839c2ea843bec1a9bc40cb3be155b7bd7c543761d5e6f1e4f1

See more details on using hashes here.

File details

Details for the file anltk-0.4.8-cp39-cp39-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.4.8-cp39-cp39-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 207.5 kB
  • Tags: CPython 3.9, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.4.8-cp39-cp39-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 9125cdacdeccbb58288290c27ddece83ed44f74adfb180feb220acc09e5092f2
MD5 734cdfc38bf0fa05aabf07056016a701
BLAKE2b-256 5c68cb1ddb2e17f448ec022a4016f71acf815b63c7e1411854b72c58891e9344

See more details on using hashes here.

File details

Details for the file anltk-0.4.8-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.4.8-cp37-cp37m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 210.5 kB
  • Tags: CPython 3.7m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.4.8-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 d11cc3b457fc6103f77b226bb492c4fd155ca4776f087a0686dcd8666687fc58
MD5 13800a5ebaba9548d300fb3c4713bb2f
BLAKE2b-256 d3db7f93791687e6b033c7e8636a33cf700b29454e3414f10b919127c270774f

See more details on using hashes here.

File details

Details for the file anltk-0.4.8-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.4.8-cp36-cp36m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 210.4 kB
  • Tags: CPython 3.6m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for anltk-0.4.8-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 5dd18f42f437e2310b4c0cfe893060fe6d6cea993919f1a59d90611a2da5c6ef
MD5 093c615890cac9a9afc9baee564448a7
BLAKE2b-256 d15de0be3f6dd01ddcc648dbfe616d992fdd85267d2584f03857f614dca81f08

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page