Arabic language processing toolkit
Project description
Arabic Natural Language Toolkit (ANLTK)
ANLTK is a set of Arabic natural language processing tools. developed with focus on performance.
ANLTK is a C++ library, with python bindings.
Installation
for python :
pip install anltk
Building
Note: Currently only tested on Linux, prebuilt python wheels are available for Linux, Windows, Macos on pypi
Dependencies:
- utfcpp, automatically downloaded.
- utf8proc, automatically downlaoded.
- C++ Compiler that supports c++17.
- Python3, meson, ninja
pip install meson
pip install ninja
git clone https://github.com/Abdullah-AlAttar/anltk.git \
&& cd anltk/ \
&& meson build --buildtype=release -Dbuild_tests=false \
&& cd build \
&& ninja \
&& cd ../ \
&& python3 setup.py install
Usage Examples:
C++ API :
#include "anltk/anltk.hpp"
#include <iostream>
#include <string>
int main()
{
std::string ar_text = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ";
std::cout << anltk::transliterate(ar_text, anltk::CharMapping::AR2BW) << '\n';
// >bjd hwz HTy klmn sEfS qr$t vx* DZg
std::string text = "فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ.";
std::cout << anltk::remove_tashkeel(text) << '\n';
// فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
// Third paramters is a stop_list, charactres in this list won't be removed
std::cout << anltk::remove_non_alpha(text, " ") << '\n';
// فراشة ملونة تطير في البستان حلوة مهندمة تدهش الإنسان
anltk::TafqitOptions opts;
std::cout<< anltk::tafqit(15000120, opts) <<'\n';
// خمسة عشر مليونًا ومائة وعشرون
}
Python API
import anltk
ar = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ"
bw = anltk.transliterate(ar, anltk.AR2BW)
print(bw)
# >bjd hwz HTy klmn sEfS qr$t vx* DZg
print(anltk.remove_tashkeel("فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ."))
# فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.
print(anltk.tafqit(15000120))
# خمسة عشر مليونًا ومائة وعشرون
For list of features see Features.md
Benchmarks
Processing a file containing 500000 Line, 6787731 Word, 112704541 Character. the task is to remove diacritics / transliterate to buckwalter
Buckwatler transliteration
Method | Time | ||
---|---|---|---|
anltk python-api | 1.379 seconds | ||
python camel_tools | 11.46 seconds |
Remove Diacritics
Method | Time | ||
---|---|---|---|
anltk python-api | 0.989 seconds | ||
python camel_tools | 4.892 seconds |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
anltk-0.5.11.tar.gz
(164.0 kB
view hashes)
Built Distributions
anltk-0.5.11-cp39-cp39-win_amd64.whl
(159.4 kB
view hashes)
anltk-0.5.11-cp39-cp39-win32.whl
(139.7 kB
view hashes)
anltk-0.5.11-cp38-cp38-win_amd64.whl
(162.6 kB
view hashes)
anltk-0.5.11-cp38-cp38-win32.whl
(139.6 kB
view hashes)
anltk-0.5.11-cp37-cp37m-win_amd64.whl
(161.9 kB
view hashes)
anltk-0.5.11-cp37-cp37m-win32.whl
(140.3 kB
view hashes)
anltk-0.5.11-cp36-cp36m-win_amd64.whl
(161.9 kB
view hashes)
anltk-0.5.11-cp36-cp36m-win32.whl
(140.2 kB
view hashes)
anltk-0.5.11-cp35-cp35m-win_amd64.whl
(161.9 kB
view hashes)
anltk-0.5.11-cp35-cp35m-win32.whl
(140.2 kB
view hashes)
Close
Hashes for anltk-0.5.11-pp37-pypy37_pp73-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8211f3a9320d9f2d0da46d8171b944fbb4d0dc10600e248e214f25bc0289d0c1 |
|
MD5 | 9ef6c9e00fdda0eb697d6b94e62e7fe9 |
|
BLAKE2b-256 | b75d0d33cccfaf875ec80a54f7ebd70646c2df9bfc0d2cbaa989f6869fbf3ff5 |
Close
Hashes for anltk-0.5.11-pp37-pypy37_pp73-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4dfb45ee78eef45ad163ab6dce2933e71dfff83c72a55a61ce096da49156fd07 |
|
MD5 | 26a59b3489f68b682353c98e10305fcf |
|
BLAKE2b-256 | 1fc53c8bf1b33899a17cda1107a011c65c138945f2c040f01bd30146fb47d785 |
Close
Hashes for anltk-0.5.11-pp37-pypy37_pp73-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8fa7726b966e4b392b081d83d1eff818143ec67c310f01f2e22c51ac95464053 |
|
MD5 | 6e13ba5dbaa9b8773ffd64c0141212fe |
|
BLAKE2b-256 | adf54afb0ceb2757f6af3d243a528c124b6ff4440af999f06f592beb49f84f58 |
Close
Hashes for anltk-0.5.11-pp36-pypy36_pp73-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 45be7bd5c64ffa982f6ed23634a35a0d6e78c50c28fd6b7bba3b31480d071f17 |
|
MD5 | 9535659ea78d986045d31e9ef722b961 |
|
BLAKE2b-256 | 31dc4a09c5de025c0a210d051b45a55d8e3b50ce94a254038305ce11aeef5f28 |
Close
Hashes for anltk-0.5.11-pp36-pypy36_pp73-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4cfd42cf69b353e4239d26e356220a6813c2e626fe87305179605a180c0ea108 |
|
MD5 | 045c39be31c730b446e1a963b2ce63dd |
|
BLAKE2b-256 | fe3408a1bf7d0d4c32c819612bfba34a3634181642ff107c6a8f28b0193ec554 |
Close
Hashes for anltk-0.5.11-pp36-pypy36_pp73-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3640bab7e48b9f6530c9d2bf39228196a28922569bf0e0bae9c580d5e37a71e3 |
|
MD5 | d49de5d07e17e1baef7a27790c8cb12d |
|
BLAKE2b-256 | 97d741b58a6d95b44f23005f7b9461eb29cf1e8c94574af0b7e45f87bf1c12bb |
Close
Hashes for anltk-0.5.11-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 80c21e7cf5edd606e98756c0f58a166f79e3e8275f3a18f5b4459db9367eb7d4 |
|
MD5 | 432a4a0696164e55884b0702fa221a87 |
|
BLAKE2b-256 | ecafa9c9776cd35694e714ff0089e911bd851c97b5c8e0d2a7470fcfc36126e4 |
Close
Hashes for anltk-0.5.11-cp39-cp39-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b8f03dc5ed77e64abd9f9f584ea54077c8b32d2e2d13c8a172a86cbedc96d0eb |
|
MD5 | b75bc4af35f0ff8c977ea141ac2f6dd8 |
|
BLAKE2b-256 | 62479049453679eef03e1133c7d9599259092e2d7b8820763b4c8e03f6f32aaa |
Close
Hashes for anltk-0.5.11-cp39-cp39-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fad5c5363e335f8aeda0c78102aa0ee17846df9f0820fc8c04db794b2f7d42be |
|
MD5 | 22f471ed298e0c0f01b533f92d8dae53 |
|
BLAKE2b-256 | 36209fa3737543cf19375117ce6f2cf7ee539122a622d2c9c72a45c1bb462fe2 |
Close
Hashes for anltk-0.5.11-cp39-cp39-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 931a8728262cc2bac58c994235882afcc4cfd740bef18c4b96dcad1ce81d889b |
|
MD5 | f133fd5216e2fe9e1b4bef612370288e |
|
BLAKE2b-256 | d6d0d0b77322ef3941c17088e3d02dd3994573fd8ce0d73ecb382f7b246fd628 |
Close
Hashes for anltk-0.5.11-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e1abafb302c5a1fbaca86e5636b698bf6105b4acadc33d22bec7f9870a81f045 |
|
MD5 | abe41116f5a1849fef9e54086cd672c9 |
|
BLAKE2b-256 | 05fb7f92ae39ec3dada06a649259a0a3cbcac613f4eb771d64bf0fa69004d636 |
Close
Hashes for anltk-0.5.11-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cbad2ab63fb37e21b1287f797144993a0067d594febe190a26fffcb915fb7be7 |
|
MD5 | d95a6c5906112e977d4a5548680017b6 |
|
BLAKE2b-256 | 6ad9835bb57ddc9bd122c74cd7b3e65cb13c8a75ec5549af46c8e4f19d267909 |
Close
Hashes for anltk-0.5.11-cp38-cp38-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 377709c0ff25954bb5cb431032ad24cc94d3fa5db9ccb619df5d68d2d80fdfdf |
|
MD5 | 8e6bd24964c1232e3c4ee895d78c8f39 |
|
BLAKE2b-256 | 21234cd2be76d3a5d2ce08057b52f8d6d0d352d5b1bc6e42d70cbb6cd0237e47 |
Close
Hashes for anltk-0.5.11-cp38-cp38-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a6ba296bda040541270ac6001919e2a0436ac301a240e022f852f633bda00597 |
|
MD5 | 3c83988d4534c5c0aedd7b2e5d0afd31 |
|
BLAKE2b-256 | c4364ad372e7d98585eccf638df8d8574fd49c29dcc79e64e408b627088df5f8 |
Close
Hashes for anltk-0.5.11-cp38-cp38-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 13e18c64d8b9b6c051ae4b2128c53bfcdca8f5efc8fc2cc5082acc6f2b3910a6 |
|
MD5 | 91507f59e12f8286fcfb247765371dc0 |
|
BLAKE2b-256 | f9f311222c53ec3375be9bf458d5f59a537fce3fcbc24cead1d9679b6a9a035c |
Close
Hashes for anltk-0.5.11-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ffd22afb695089b18fc9b6718559c7b30d6136d5d4f041f75cc72fbecfbfaa92 |
|
MD5 | 3202863851671d84d5316542b5a91889 |
|
BLAKE2b-256 | b99dd4ebfecdf69ddf8799c947cd7f128387511e6fdac26de95f2501fa672362 |
Close
Hashes for anltk-0.5.11-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 13d48223fb59afe87254dca4a97cacd1fe662635cd6ba894d03f0f76c7d9190c |
|
MD5 | 50f4db5667ebcf18050591fbc612d72d |
|
BLAKE2b-256 | b4d7de954c130b12af480a22d98c78b0c37143cb8141547d018c52942594a22d |
Close
Hashes for anltk-0.5.11-cp37-cp37m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2641670ef2b66cb8b09dd2de0b052390cf3443dce2b59a9af3c52bd12f92867e |
|
MD5 | b0d5229b8fa834c92dfc4289ed36f72a |
|
BLAKE2b-256 | c06ce2a26f5b1cae6d760f29d02c9e07406fc5b5f7c367f3c68d23c4924bf531 |
Close
Hashes for anltk-0.5.11-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0a37f0e880761c4eb76dbe3758164019a94a867f9239e406fd8ed133278dff2f |
|
MD5 | 9b90c081e20b82b029b92fc4c0683160 |
|
BLAKE2b-256 | 929de16cfd7a6e57a9be5f39b57f7d645d1165e3827a36820aa53886e837aa5d |
Close
Hashes for anltk-0.5.11-cp37-cp37m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8e935c6122e503f85f3860ad36ec51caa224e802514c1333e59d18cdd278d909 |
|
MD5 | 40265c95d25b01ecb1b9ae57549ae9fc |
|
BLAKE2b-256 | 59af212ed432eeda69bee6693aa470687955a2ff4898156dfd8ee297739add8a |
Close
Hashes for anltk-0.5.11-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 51fad67d35e4d948c58db69057e8acdd38903e4f7e4691ae85b84112f78e2844 |
|
MD5 | b9a4d25a2a0b6d39d03d1c3bfe66d95f |
|
BLAKE2b-256 | 1ee344c330fca29011f6847638d632b6ce3487049dab094f84316c8ecfde3dca |
Close
Hashes for anltk-0.5.11-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 825b010a77f176c49ccf7c78c436097b2ea5db3bb49dcdaed45cca564bbd0ded |
|
MD5 | ad428d569a85fa2f9085c21d9c9bcb11 |
|
BLAKE2b-256 | 5cd329ff4b3d32b1e344281e77036c65347e25e86d8289698491f7e60f1828c6 |
Close
Hashes for anltk-0.5.11-cp36-cp36m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b9c92128c654dcb0aca9c37ebdc597331a4f1526408db73c8b6569cfe9e99f33 |
|
MD5 | 5c0956a92f5cc2666796b2f196f12f0c |
|
BLAKE2b-256 | 1fb89f99dd86f19bc3d30edeb381f362e0d1dfd0d14ac63ff225024be4c5c6e2 |
Close
Hashes for anltk-0.5.11-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0d8f429ca5796fb1e8df2b04293ba3784e2f2347d17f6cd8e587d3d273f4e3ec |
|
MD5 | 21a6496a0867749b473cb54b3b5b864d |
|
BLAKE2b-256 | 8f842e1e1fce6b0b616b6b618d3aec5b0e0b749e412be7593cb10985695cf4e5 |
Close
Hashes for anltk-0.5.11-cp36-cp36m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2c6f76d280c923cf80df7f920ec615355bcdba573c4c1112ede8d283391989f5 |
|
MD5 | 030cf2070a4f392260d061111be42c0c |
|
BLAKE2b-256 | add3ab7b01fe610f1d888bd711f2047e3bfd6254255248d60acd70ae7e92dfcf |
Close
Hashes for anltk-0.5.11-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cbe025d000ad2c5b52edf5b39cd849ea6dda461bf97f77560cf47143064c9864 |
|
MD5 | c200816d014f891ccdd31e0e02a4abfd |
|
BLAKE2b-256 | 094cf3a4e38ef1904a8fac3a3b631a359f8e24aba859d9a4c6331e0db724c904 |
Close
Hashes for anltk-0.5.11-cp35-cp35m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a8c1ad0eb636882260f97222d6ce240dc45480ff191079588ea7beeca85bc94a |
|
MD5 | 0e60e3e4cd0e2c558e8f2e2126f6f7d2 |
|
BLAKE2b-256 | 62041b9281b0b5bb5d68ee57f0dc451925de2ac6e301fa20803f7e4c5d8c0f72 |
Close
Hashes for anltk-0.5.11-cp35-cp35m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a5c2c297674b4e2408e2f557be399f2206368968862f5390c196b1a5a0ef2aa6 |
|
MD5 | c1fead0c36ac9436c752d3e446a43eba |
|
BLAKE2b-256 | ca278f0a53f0f3ef29ab0e7724e97c2a6b3512edcc5581472060175cdcffd021 |
Close
Hashes for anltk-0.5.11-cp35-cp35m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f4b80e0106bde437a2a1a66966d8e6746808ae91326f291b2b01138e8055c988 |
|
MD5 | c712fa3f51099ab5d45ff7ad438d8be5 |
|
BLAKE2b-256 | 59d68bc7690be714b687e830a9926cd8e2da305824dd5d90a83d3c0815d9bc7e |
Close
Hashes for anltk-0.5.11-cp35-cp35m-manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e8b2b4596ca861af2697e865ba2eb303038ff92253338051c278dbbaad16e05a |
|
MD5 | c323fbf0f070fcb2b073f26a65a5e44a |
|
BLAKE2b-256 | 05700dd7cd53abc9ac04661158f56a3fc80b5d78741978cb94af79b7b7583e8e |
Close
Hashes for anltk-0.5.11-cp35-cp35m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | beeab3d99b23cecd013a76b4e32e5007e7b4873d8b5b9d18c5720d6d0c26e921 |
|
MD5 | 5868b5e54590528f0b8f59a67a97c044 |
|
BLAKE2b-256 | df50bec903f505b34d542e4e9028ce15e344340ccca33676c896c16f5e1684dd |