Skip to main content

Aranizer: A Custom Tokenizer for Enhanced Arabic Language Processing

Project description

AraNizer

Description

AraNizer contains custom tokenizers specifically designed for Arabic language processing. Built with SentencePiece and Byte Pair Encoding (BPE) methodologies, these tokenizers are engineered to be compatible with the transformers and sentence_transformers libraries. Each tokenizer in the AraNizer collection is optimized for different NLP tasks, accommodating a diverse range of vocabulary sizes to suit various linguistic scenarios.

Installation

Install AraNizer effortlessly with pip:

pip install aranizer

Usage 
Start by importing the desired tokenizer from AraNizer:
# Other aranizers: aranizer_bpe50k, aranizer_bpe64k, aranizer_bpe86k, aranizer_sp32k, aranizer_sp50k, aranizer_sp64k, aranizer_sp86k

from aranizer import aranizer_sp50k

Load your tokenizer:

tokenizer = aranizer_sp50k.get_tokenizer()  # Replace aranizer_bpe32k with your chosen tokenizer

Example of tokenizing a text:

text = "مثال على النص العربي"  # Example Arabic text
tokens = tokenizer.tokenize(text)
print(tokens)

Encoding Text: To encode text, use the encode method. This converts a text string into a sequence of token ids:

text = "مثال على النص العربي"  # Example Arabic text
encoded_output = tokenizer.encode(text, add_special_tokens=True)
print(encoded_output)

Decoding Text: To convert token ids back to text, use the decode method:

decoded_text = tokenizer.decode(encoded_output)
print(decoded_text)

AraNizers

- aranizer_bpe32k: Based on BEP Tokenizer with Vocab Size of 32k
- aranizer_bpe50k: Based on BEP Tokenizer with Vocab Size of 50k
- aranizer_bpe64k: Based on BEP Tokenizer with Vocab Size of 64k
- aranizer_bpe86k: Based on BEP Tokenizer with Vocab Size of 86k
- aranizer_sp32k: Based on Sentence Peice Tokenizer with Vocab Size of 32k
- aranizer_sp50k: Based on Sentence Peice Tokenizer with Vocab Size of 50k
- aranizer_sp64k: Based on Sentence Peice Tokenizer with Vocab Size of 64k
- aranizer_sp86k: Based on Sentence Peice Tokenizer with Vocab Size of 86k

Requirements:

  • transformers

Contact:

For queries or assistance, please contact onajar@psu.edu.sa.

Acknowledgments:

Special thanks to Prince Sultan University and Riotu Lab, under the guidance of Dr. Lahouari Ghouti and Dr. Anis Koubaa, for their invaluable support.

Version:

0.1.8

Citations:

If AraNizer benefits your research, please cite us:

@misc{AraNizer_2023,
  title={Aranizer: A Custom Tokenizer for Enhanced Arabic Language Processing},
  author={Najar, Omar and Sibaee, Serry and Ghouti, Lahouari and Koubaa, Anis},
  affiliation={Prince Sultan University, Riyadh, Saudi Arabia},
  year={2023},
  howpublished={\url{https://github.com/omarnj-lab/aranizer}}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aranizer-0.1.8.tar.gz (5.7 MB view details)

Uploaded Source

File details

Details for the file aranizer-0.1.8.tar.gz.

File metadata

  • Download URL: aranizer-0.1.8.tar.gz
  • Upload date:
  • Size: 5.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for aranizer-0.1.8.tar.gz
Algorithm Hash digest
SHA256 f75e925e58a826e540907db7955ba226bdc6373b05ce06c99af3428d1d79b35a
MD5 15f9fb9df81e881c2e68b07cac71b2c0
BLAKE2b-256 04073829c34cb6510b088fcf0e1852ee11b74f0dc1ceed0a16dbb95aff050a05

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page