Skip to main content

Aranizer: A Custom Tokenizer for Enhanced Arabic Language Processing

Project description

AraNizer

Description

AraNizer contains custom tokenizers specifically designed for Arabic language processing. Built with SentencePiece and Byte Pair Encoding (BPE) methodologies, these tokenizers are engineered to be compatible with the transformers and sentence_transformers libraries. Each tokenizer in the AraNizer collection is optimized for different NLP tasks, accommodating a diverse range of vocabulary sizes to suit various linguistic scenarios.

Installation

Install AraNizer effortlessly with pip:

pip install aranizer

Usage 
Start by importing the desired tokenizer from AraNizer:
# Other aranizers: aranizer_bpe50k, aranizer_bpe64k, aranizer_bpe86k, aranizer_sp32k, aranizer_sp50k, aranizer_sp64k, aranizer_sp86k

from AraNizer import aranizer_bpe32k

Load your tokenizer:

tokenizer = aranizer_bpe32k.get_tokenizer()  # Replace aranizer_bpe32k with your chosen tokenizer

Example of tokenizing a text:

text = "مثال على النص العربي"  # Example Arabic text
tokens = tokenizer.tokenize(text)
print(tokens)

Encoding Text: To encode text, use the encode method. This converts a text string into a sequence of token ids:

text = "مثال على النص العربي"  # Example Arabic text
encoded_output = tokenizer.encode(text, add_special_tokens=True)
print(encoded_output)

Decoding Text: To convert token ids back to text, use the decode method:

decoded_text = tokenizer.decode(encoded_output)
print(decoded_text)

AraNizers

- aranizer_bpe32k: Based on BEP Tokenizer with Vocab Size of 32k
- aranizer_bpe50k: Based on BEP Tokenizer with Vocab Size of 50k
- aranizer_bpe64k: Based on BEP Tokenizer with Vocab Size of 64k
- aranizer_bpe86k: Based on BEP Tokenizer with Vocab Size of 86k
- aranizer_sp32k: Based on Sentence Peice Tokenizer with Vocab Size of 32k
- aranizer_sp50k: Based on Sentence Peice Tokenizer with Vocab Size of 50k
- aranizer_sp64k: Based on Sentence Peice Tokenizer with Vocab Size of 64k
- aranizer_sp86k: Based on Sentence Peice Tokenizer with Vocab Size of 86k

Requirements:

  • transformers

Contact:

For queries or assistance, please contact onajar@psu.edu.sa.

Acknowledgments:

Special thanks to Prince Sultan University and Riotu Lab, under the guidance of Dr. Lahouari Ghouti and Dr. Anis Koubaa, for their invaluable support.

Version:

0.1.7

Citations:

If AraNizer benefits your research, please cite us:

@misc{AraNizer_2023,
  title={Aranizer: A Custom Tokenizer for Enhanced Arabic Language Processing},
  author={Najar, Omar and Sibaee, Serry and Ghouti, Lahouari and Koubaa, Anis},
  affiliation={Prince Sultan University, Riyadh, Saudi Arabia},
  year={2023},
  howpublished={\url{https://github.com/omarnj-lab/aranizer}}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aranizer-0.1.7.tar.gz (5.7 MB view details)

Uploaded Source

File details

Details for the file aranizer-0.1.7.tar.gz.

File metadata

  • Download URL: aranizer-0.1.7.tar.gz
  • Upload date:
  • Size: 5.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for aranizer-0.1.7.tar.gz
Algorithm Hash digest
SHA256 f3cc08cb6ee40ca1451408eac7aefdfb4d2dded532a95a9fd37806b108ecbc60
MD5 af5337871bb777a2ac038b70e98ea019
BLAKE2b-256 f4d834b07938dd56493bcaefdbec61c620144f7d4fedcd15261fbd3d0f93aa79

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page