Skip to main content

Aranizer: A Custom Tokenizer for Enhanced Arabic Language Processing

Project description

AraNizer

Description

AraNizer contains custom tokenizers specifically designed for Arabic language processing. Built with SentencePiece and Byte Pair Encoding (BPE) methodologies, these tokenizers are engineered to be compatible with the transformers and sentence_transformers libraries. Each tokenizer in the AraNizer collection is optimized for different NLP tasks, accommodating a diverse range of vocabulary sizes to suit various linguistic scenarios.

Installation

Install AraNizer effortlessly with pip:

pip install aranizer

Usage 
Start by importing the desired tokenizer from AraNizer:
# Other aranizers: aranizer_bpe50k, aranizer_bpe64k, aranizer_bpe86k, aranizer_sp32k, aranizer_sp50k, aranizer_sp64k, aranizer_sp86k

from AraNizer import aranizer_bpe32k

Load your tokenizer:

tokenizer = aranizer_bpe32k.get_tokenizer()  # Replace aranizer_bpe32k with your chosen tokenizer

Example of tokenizing a text:

text = "مثال على النص العربي"  # Example Arabic text
tokens = tokenizer.tokenize(text)
print(tokens)

Encoding Text: To encode text, use the encode method. This converts a text string into a sequence of token ids:

text = "مثال على النص العربي"  # Example Arabic text
encoded_output = tokenizer.encode(text, add_special_tokens=True)
print(encoded_output)

Decoding Text: To convert token ids back to text, use the decode method:

decoded_text = tokenizer.decode(encoded_output)
print(decoded_text)

AraNizers

aranizer_bpe32k: Tailored for general language modeling with a 32k vocab size.
aranizer_bpe50k: Ideal for technical or scientific texts, featuring a 50k vocab size.
aranizer_bpe64k: Provides comprehensive language coverage with a 64k vocab size.
aranizer_bpe86k: Suitable for extensive vocabularies in large-scale NLP tasks with an 86k vocab size.
aranizer_sp32k: Efficiently segments Arabic dialects with a 32k vocab size.
aranizer_sp50k: Designed for complex text analysis, equipped with a 50k vocab size.
aranizer_sp64k: Balances performance and breadth in NLP applications with a 64k vocab size.
aranizer_sp86k: Supports multilingual and cross-lingual tasks with an 86k vocab size.

Requirements:

  • transformers

Contact:

For queries or assistance, please contact onajar@psu.edu.sa.

Acknowledgments:

Special thanks to Prince Sultan University and Riotu Lab, under the guidance of Dr. Lahouari Ghouti and Dr. Anis Koubaa, for their invaluable support.

Version:

0.1.4

Citations:

If AraNizer benefits your research, please cite us:

@misc{AraNizer_2023,
  title={Aranizer: A Custom Tokenizer for Enhanced Arabic Language Processing},
  author={Najar, Omar and Sibaee, Serry and Ghouti, Lahouari and Koubaa, Anis},
  affiliation={Prince Sultan University, Riyadh, Saudi Arabia},
  year={2023},
  howpublished={\url{https://github.com/omarnj-lab/aranizer}}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aranizer-0.1.6.tar.gz (5.7 MB view details)

Uploaded Source

File details

Details for the file aranizer-0.1.6.tar.gz.

File metadata

  • Download URL: aranizer-0.1.6.tar.gz
  • Upload date:
  • Size: 5.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for aranizer-0.1.6.tar.gz
Algorithm Hash digest
SHA256 eeefccc8ee474b8402394275018f3fa990670c1b8cec20e79a8a5c9c9518d8bc
MD5 97c75aad1ab4319d0b253f78ecf4f26b
BLAKE2b-256 cd7d520acde26c0d705f3f9340e35f0eddc2825f2dbeefb6731ef9a650bdbded

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page