Skip to main content

Arabic language processing toolkit

Project description

example workflow

Arabic Natural Language Toolkit (ANLTK)

ANLTK is a set of Arabic natural language processing tools. developed with focus on performance.

ANLTK is a C++ library, with python bindings.

Installation

for python :

pip install pybind11
pip install anltk

Building

Note: Currently only tested on Linux
The Library depends on https://github.com/nemtrif/utfcpp.git, which is cloned automatically.
you also need a modern C++ Compiler, which supports C++17 also meson and ninja needs to be installed.
simply with pip

pip install meson
pip install ninja
git clone --recurse-submodules https://github.com/Abdullah-AlAttar/anltk.git \
    && cd anltk/ \
    && meson build --buildtype=release -Dbuild_tests=false \
    && cd build \
    && ninja \
    && cd ../ \
    && python3 setup.py install

Usage Examples:

C++ API :

#include "anltk/anltk.hpp"
#include <iostream>
#include <string>

int main()
{

    std::string ar_text = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ";

    std::cout << anltk::transliterate(ar_text, anltk::CharMapping::AR2BW) << '\n';
    // >bjd hwz HTy klmn sEfS qr$t vx* DZg

    std::string text = "فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ.";

    std::cout << anltk::remove_tashkeel(text) << '\n';
    // فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.

    // Third paramters is a stop_list, charactres in this list won't be removed
    std::cout << anltk::remove_non_alpha(text, " ") << '\n';
    // فراشة ملونة تطير في البستان حلوة مهندمة تدهش الإنسان
}

Python API

import anltk


ar = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ"
bw = anltk.transliterate(ar, anltk.AR2BW)
print(bw)
# >bjd hwz HTy klmn sEfS qr$t vx* DZg

print(anltk.remove_tashkeel("فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ."))

# فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.

For list of features see Features.md

Benchmarks

Processing a file containing 500000 Line, 6787731 Word, 112704541 Character. the task is to remove diacritics / transliterate to buckwalter

Buckwatler transliteration

Method Time
anltk python-api 1.379 seconds
python (camel_tools) 11.46 seconds

Remove Diacritics

Method Time
anltk python-api 0.989 seconds
python (camel_tools) 4.892 seconds

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anltk-0.4.9.tar.gz (28.2 kB view details)

Uploaded Source

Built Distributions

anltk-0.4.9-py3.6-linux-x86_64.egg (218.1 kB view details)

Uploaded Source

anltk-0.4.9-pp37-pypy37_pp73-manylinux2010_x86_64.whl (411.8 kB view details)

Uploaded PyPy manylinux: glibc 2.12+ x86-64

anltk-0.4.9-cp310-cp310-manylinux2010_x86_64.whl (209.0 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.12+ x86-64

anltk-0.4.9-cp39-cp39-manylinux2010_x86_64.whl (209.2 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.12+ x86-64

anltk-0.4.9-cp37-cp37m-manylinux2010_x86_64.whl (211.8 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

anltk-0.4.9-cp36-cp36m-manylinux2010_x86_64.whl (211.8 kB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

File details

Details for the file anltk-0.4.9.tar.gz.

File metadata

  • Download URL: anltk-0.4.9.tar.gz
  • Upload date:
  • Size: 28.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.5

File hashes

Hashes for anltk-0.4.9.tar.gz
Algorithm Hash digest
SHA256 f6f2a494cdd3d2da89e0ca7439afe5d38d446d1a2caa975b979aadb1a0443edc
MD5 56d09eb949763fc41f8b4e884c4d73fb
BLAKE2b-256 1a706d5c880549730cd96d97e63a43d6e74e9e4cbacf5f6200e5be85e2a98df7

See more details on using hashes here.

File details

Details for the file anltk-0.4.9-py3.6-linux-x86_64.egg.

File metadata

  • Download URL: anltk-0.4.9-py3.6-linux-x86_64.egg
  • Upload date:
  • Size: 218.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.5

File hashes

Hashes for anltk-0.4.9-py3.6-linux-x86_64.egg
Algorithm Hash digest
SHA256 c7d545eeb3cee99e78a28a9ee3fac230e4664741464670492a4d6ff1df34af6f
MD5 731619e5c0bcb63b9d8705be33b13957
BLAKE2b-256 c3d3bba75c5bef76288895a476ba9439f93d09e794533a4ddef620e1a58ec2c6

See more details on using hashes here.

File details

Details for the file anltk-0.4.9-pp37-pypy37_pp73-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.4.9-pp37-pypy37_pp73-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 411.8 kB
  • Tags: PyPy, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.5

File hashes

Hashes for anltk-0.4.9-pp37-pypy37_pp73-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 e9ebf3924d3aa199664dfe916f28dc38301afed760976889de4251f38dbb088d
MD5 2b34a2331dcaaa4fa635c0c3c95b9c48
BLAKE2b-256 83c5f8e3900e04b670bff69cc1908c75b493fee164e92e638a91537d8858c2ff

See more details on using hashes here.

File details

Details for the file anltk-0.4.9-cp310-cp310-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.4.9-cp310-cp310-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 209.0 kB
  • Tags: CPython 3.10, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.5

File hashes

Hashes for anltk-0.4.9-cp310-cp310-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 bf39515d0489305f0e3086092539d2822d89f2a0b45d19baaf193384c63b1590
MD5 6957c32eaa8d6065edfd35a18054699a
BLAKE2b-256 23c03dc1692ecd998392ab3cf86fc498355910c08ef909bb3e50b299a3c71811

See more details on using hashes here.

File details

Details for the file anltk-0.4.9-cp39-cp39-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.4.9-cp39-cp39-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 209.2 kB
  • Tags: CPython 3.9, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.5

File hashes

Hashes for anltk-0.4.9-cp39-cp39-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 c5e0237e49ba2b1c7d8601b6a1c04b813c77306828a63f99dc4ae58d22d3350e
MD5 ae4351585a8398834dde7585858c0597
BLAKE2b-256 ff5ccfc436a6df6fe72c89fb4475a4b30404cdeb9cafb4ea880a5b4c1f1e22f8

See more details on using hashes here.

File details

Details for the file anltk-0.4.9-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.4.9-cp37-cp37m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 211.8 kB
  • Tags: CPython 3.7m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.5

File hashes

Hashes for anltk-0.4.9-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 0ac4c6d50c7a16bfa38deb404fad4b57b22efa377f918bbb06a088cb976b5d02
MD5 d8246d4891e71d9c6b903ad4668d2c6c
BLAKE2b-256 1ab8aca1b4ec9f1699abea7edaeeb7fe3241c09d3d4a9b5eba8988c64c0e9232

See more details on using hashes here.

File details

Details for the file anltk-0.4.9-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: anltk-0.4.9-cp36-cp36m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 211.8 kB
  • Tags: CPython 3.6m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.5

File hashes

Hashes for anltk-0.4.9-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 b1a0b55d30f906c38df4323dc1551bc026bc255118433e1c759924768a2d5061
MD5 0dbfffece9ff8636227ab56ec51d140a
BLAKE2b-256 6617944daa410b913daa1f74c1d0221d00948e186c82dcf3c988a11b8c45d96b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page