Skip to main content

Arabic language processing toolkit

Project description

example workflow

Arabic Natural Language Toolkit (ANLTK)

ANLTK is a set of Arabic natural language processing tools. developed with focus on performance.

ANLTK is a C++ library, with python bindings.

Installation

for python :

pip install pybind11
pip install anltk

Building

Note: Currently only tested on Linux
The Library depends on https://github.com/nemtrif/utfcpp.git, which is cloned automatically.
you also need a modern C++ Compiler, which supports C++17 also meson and ninja needs to be installed.
simply with pip

pip install meson
pip install ninja
git clone --recurse-submodules https://github.com/Abdullah-AlAttar/anltk.git \
    && cd anltk/ \
    && meson build --buildtype=release -Dbuild_tests=false \
    && cd build \
    && ninja \
    && cd ../ \
    && python3 setup.py install

Usage Examples:

C++ API :

#include "anltk/anltk.hpp"
#include <iostream>
#include <string>

int main()
{

    std::string ar_text = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ";

    std::cout << anltk::transliterate(ar_text, anltk::CharMapping::AR2BW) << '\n';
    // >bjd hwz HTy klmn sEfS qr$t vx* DZg

    std::string text = "فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ.";

    std::cout << anltk::remove_tashkeel(text) << '\n';
    // فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.

    // Third paramters is a stop_list, charactres in this list won't be removed
    std::cout << anltk::remove_non_alpha(text, " ") << '\n';
    // فراشة ملونة تطير في البستان حلوة مهندمة تدهش الإنسان
}

Python API

import anltk


ar = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ"
bw = anltk.transliterate(ar, anltk.AR2BW)
print(bw)
# >bjd hwz HTy klmn sEfS qr$t vx* DZg

print(anltk.remove_tashkeel("فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ."))

# فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.

For list of features see Features.md

Benchmarks

Processing a file containing 500000 Line, 6787731 Word, 112704541 Character. the task is to remove diacritics / transliterate to buckwalter

Buckwatler transliteration

Method Time
anltk python-api 1.379 seconds
python (camel_tools) 11.46 seconds

Remove Diacritics

Method Time
anltk python-api 0.989 seconds
python (camel_tools) 4.892 seconds

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

anltk-0.4.6-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (207.3 kB view details)

Uploaded Python 3 manylinux: glibc 2.12+ x86-64

File details

Details for the file anltk-0.4.6-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for anltk-0.4.6-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 52184673a188c021f92e37e1679ef78f07af3de8a7ad95df1c14f2afc49c2d89
MD5 439d57b4c51bddfccaa65b3d7217b7da
BLAKE2b-256 510399ec386a484f7e83026b45340dea2aafa98f27850ff20427e2d0b8866b2e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page