Skip to main content

Aranizer: A Custom Tokenizer for Enhanced Arabic Language Processing

Project description

AraNizer

Description

AraNizer is a sophisticated toolkit of custom tokenizers tailored for Arabic language processing. Integrating advanced methodologies such as SentencePiece and Byte Pair Encoding (BPE), these tokenizers are specifically designed for seamless integration with the transformers and sentence_transformers libraries. The AraNizer suite offers a range of tokenizers, each optimized for distinct NLP tasks and accommodating varying vocabulary sizes to cater to a multitude of linguistic applications.

Key Features

  • Versatile Tokenization: Supports multiple tokenization strategies (BPE, SentencePiece) for varied NLP tasks.
  • Broad Vocabulary Range: Customizable tokenizers with vocabulary sizes ranging from 32k to 86k.
  • Seamless Integration: Compatible with popular libraries like transformers and sentence_transformers.
  • Optimized for Arabic: Specifically engineered for the intricacies of the Arabic language.

Installation

Install AraNizer effortlessly with pip:

pip install aranizer

Usage

Importing Tokenizers

Import your desired tokenizer from AraNizer. Available tokenizers include:

  • BEP variants: aranizer_bpe50k, aranizer_bpe64k,
  • SentencePiece variants: aranizer_bpe86k, aranizer_sp32k, aranizer_sp50k, aranizer_sp64k, aranizer_sp86k.
from aranizer import aranizer_sp32k  # Replace with your chosen tokenizer

#Load your tokenizer:
tokenizer = aranizer_sp32k.get_tokenizer()

Tokenizing Text

Tokenize Arabic text using the selected tokenizer:

text = "مثال على النص العربي"  # Example Arabic text
tokens = tokenizer.tokenize(text)
print(tokens)

Encoding and Decoding

Encode text into token ids and decode back to text.

Encoding: To encode text, use the encode method.

text = "مثال على النص العربي"  # Example Arabic text
encoded_output = tokenizer.encode(text, add_special_tokens=True)
print(encoded_output)

Decoding: To convert token ids back to text, use the decode method:

decoded_text = tokenizer.decode(encoded_output)
print(decoded_text)

Available Tokenizers

- aranizer_bpe32k: Based on BEP Tokenizer with Vocab Size of 32k
- aranizer_bpe50k: Based on BEP Tokenizer with Vocab Size of 50k
- aranizer_bpe64k: Based on BEP Tokenizer with Vocab Size of 64k
- aranizer_bpe86k: Based on BEP Tokenizer with Vocab Size of 86k
- aranizer_sp32k: Based on Sentence Peice Tokenizer with Vocab Size of 32k
- aranizer_sp50k: Based on Sentence Peice Tokenizer with Vocab Size of 50k
- aranizer_sp64k: Based on Sentence Peice Tokenizer with Vocab Size of 64k
- aranizer_sp86k: Based on Sentence Peice Tokenizer with Vocab Size of 86k

System Requirements

  • Python 3.x
  • transformers library

Contact:

For queries or assistance, please contact riotu@psu.edu.sa.

Acknowledgments:

This work is maintained by the Robotics and Internet-of-Things Lab at Prince Sultan University.

Team:

  • Prof. Anis Koubaa (Lab Leader)
  • Dr. Lahouari Ghouti (NLP Team Leader)
  • Eng. Omar Najjar (AI Research Assistant)
  • Eng. Serry Sebai (NLP Research Engineer)

Version:

0.1.8

Citations:

Coming soon

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aranizer-0.2.3.tar.gz (5.8 MB view details)

Uploaded Source

Built Distribution

aranizer-0.2.3-py3-none-any.whl (6.2 MB view details)

Uploaded Python 3

File details

Details for the file aranizer-0.2.3.tar.gz.

File metadata

  • Download URL: aranizer-0.2.3.tar.gz
  • Upload date:
  • Size: 5.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.5

File hashes

Hashes for aranizer-0.2.3.tar.gz
Algorithm Hash digest
SHA256 fc6a6c5a6fabb6e78cd81b8fcc55b0a2e71c9af16b724250e3f9ab301fe5f24f
MD5 f71a3d6ebd0b447a6fcf0b644d3cfe7f
BLAKE2b-256 b4f09b69c09172e2b1c7ea1633c05249b1d8b59133ca793b46e81eff8b8cd3f6

See more details on using hashes here.

File details

Details for the file aranizer-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: aranizer-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 6.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.5

File hashes

Hashes for aranizer-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 3cb65e30fb080154370dfe3f12bc146556a0b9058fc2d584142d212cc4f293ec
MD5 1dfacb8891892f5c645025e45f012212
BLAKE2b-256 12860e6bd06f2bf9f1d602211454ac7c5f139cc9d5f62bed9033b07d625cf435

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page