Skip to main content

Enterprise-grade Amharic & Ge'ez Biblical Text Tokenizer with Byte Pair Encoding

Project description

EthioBBPE

PyPI - Version PyPI - Python Version PyPI - License Hugging Face GitHub

📖 Overview

EthioBBPE is a professional, production-ready Byte Pair Encoding (BPE) tokenizer specifically designed for Amharic, Ge'ez, and biblical texts. It provides state-of-the-art tokenization for Ethiopian languages with support for ancient scripts and special characters.

✨ Key Features

  • 🚀 One-line Installation: pip install EthioBBPE
  • 🤖 Auto-download: Models automatically downloaded from Hugging Face Hub
  • 📦 Compressed Storage: Gzip compression for efficient storage (65%+ size reduction)
  • High Performance: Optimized for speed with batch processing
  • 🎯 Perfect Reconstruction: 100% accuracy on Amharic biblical texts
  • 🔧 Professional API: Clean, intuitive interface inspired by Hugging Face
  • 📊 Production Ready: Checkpointing, quantization, and comprehensive metrics

🚀 Quick Start

Installation

pip install EthioBBPE

Basic Usage

from ethiobbpe import EthioBBPETokenizer

# Load pretrained tokenizer (auto-downloads from Hugging Face)
tokenizer = EthioBBPETokenizer.from_pretrained()

# Encode text
text = "ሰላም ለኢዮብ ዘኢነበበ ከንቶ ።"
encoded = tokenizer.encode(text)

print(f"Tokens: {encoded.tokens}")
print(f"IDs: {encoded.ids}")

# Decode back to text
decoded = tokenizer.decode(encoded.ids)
print(f"Decoded: {decoded}")
# Output: ሰላም ለኢዮብ ዘኢነበበ ከንቶ ። (100% accurate!)

Advanced Usage

from ethiobbpe import AutoTokenizer

# Alternative: Use AutoTokenizer factory
tokenizer = AutoTokenizer.from_pretrained("Nexuss0781/Ethio-BBPE")

# Batch encoding
texts = [
    "በመዠመሪያ፡እግዚአብሔር፡ሰማይንና፡ምድርን፡ፈጠረ።",
    "ወደ ቍስጥንጥንያ አገርም በደረሰች ጊዜ",
    "ሐዋርያ መንፈስ ይቤ እንዘ ያነክር ሕይወቶ"
]

encodings = tokenizer.encode_batch(texts)
for i, enc in enumerate(encodings):
    print(f"Text {i}: {len(enc.tokens)} tokens")

# Callable interface
result = tokenizer("ሰላም ዓለም")
print(result.tokens)

# Get vocabulary
vocab = tokenizer.get_vocab()
print(f"Vocabulary size: {tokenizer.get_vocab_size()}")

📊 Model Details

Property Value
Model Name EthioBBPE_AmharicBible
Vocabulary Size 16,000 tokens
Training Data 61,769 lines (Synaxarium + Canon Biblical)
Compression Gzip Level 9 (65%+ reduction)
Reconstruction Accuracy 100%
License Apache 2.0

Training Datasets

  1. Synaxarium Dataset: 366 religious texts
  2. Canon Biblical Dataset: 61,403 Amharic-English parallel texts

Total corpus: ~27.5 MB of high-quality Amharic biblical text.

🛠️ API Reference

EthioBBPETokenizer

Main tokenizer class with full encoding/decoding capabilities.

Methods

  • from_pretrained(model_name, cache_dir, force_download): Load pretrained model
  • from_file(file_path): Load from local file
  • encode(text, add_special_tokens, truncation, max_length): Encode single text
  • encode_batch(texts, ...): Encode multiple texts
  • decode(ids, skip_special_tokens): Decode token IDs
  • decode_batch(batch_ids, ...): Decode multiple sequences
  • get_vocab(): Get vocabulary dictionary
  • get_vocab_size(): Get vocabulary size
  • save(path): Save tokenizer to file

AutoTokenizer

Factory class for automatic tokenizer loading.

from ethiobbpe import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Nexuss0781/Ethio-BBPE")

Encoding

Wrapper object for encoding results with properties:

  • ids: Token IDs (List[int])
  • tokens: Token strings (List[str])
  • attention_mask: Attention mask (List[int])
  • type_ids: Token type IDs (List[int])
  • offsets: Character offsets (List[tuple])
  • special_tokens_mask: Special tokens mask (List[int])

📈 Performance

Compression Benefits

Format Size Compression
tokenizer.json 1.3 MB Baseline
vocab.json.gz 136 KB 89.8%
tokenizer_quantized_8bit.json.gz 56 KB 95.7%

Speed

  • Single encoding: < 1ms
  • Batch encoding (32 texts): < 10ms
  • Model download: Automatic caching for subsequent uses

🔬 Technical Details

Architecture

  • Algorithm: Byte Pair Encoding (BPE)
  • Pre-tokenization: Whitespace splitting
  • Special Tokens: [PAD], [CLS], [SEP], [MASK], [UNK]
  • Vocabulary: 16,000 tokens optimized for Amharic/Ge'ez

Production Features

  1. Checkpointing: SHA256 integrity verification for fault-tolerant training
  2. Quantization: 8-bit and 4-bit options for deployment optimization
  3. Metrics Tracking: Comprehensive training statistics in JSON format
  4. Multiple Export Formats: tokenizer.json, vocab.json.gz, quantized versions

📝 Examples

Ge'ez Punctuation Handling

tokenizer = EthioBBPETokenizer.from_pretrained()

# Ancient Ge'ez punctuation marks
geez_text = "፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠"
encoded = tokenizer.encode(geez_text)
print(f"Ge'ez punctuation: {encoded.tokens}")
# Single token for repeated marks!

decoded = tokenizer.decode(encoded.ids)
assert decoded == geez_text  # Perfect reconstruction

Biblical Text Processing

biblical_text = """
ሰላም ለኢዮብ ዘኢነበበ ከንቶ ። 
አመ አኀዞ አበቅ ወአመ አህጎለ ጥሪቶ ። 
ሐዋርያ መንፈስ ይቤ እንዘ ያነክር ሕይወቶ ።
"""

encoded = tokenizer.encode(biblical_text)
print(f"Token count: {len(encoded.tokens)}")
print(f"Perfect reconstruction: {tokenizer.decode(encoded.ids) == biblical_text}")

🤝 Integration

With Hugging Face Transformers

from transformers import AutoModel, AutoTokenizer
from ethiobbpe import EthioBBPETokenizer

# Use EthioBBPE as the tokenizer
ethio_tokenizer = EthioBBPETokenizer.from_pretrained()

# Integrate with your pipeline
def tokenize_for_model(text):
    encoding = ethio_tokenizer.encode(text)
    return {
        "input_ids": encoding.ids,
        "attention_mask": encoding.attention_mask
    }

With PyTorch DataLoader

import torch
from torch.utils.data import Dataset
from ethiobbpe import EthioBBPETokenizer

class AmharicDataset(Dataset):
    def __init__(self, texts, tokenizer, max_length=512):
        self.texts = texts
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        encoding = self.tokenizer.encode(
            self.texts[idx],
            truncation=True,
            max_length=self.max_length
        )
        return {
            "input_ids": torch.tensor(encoding.ids),
            "attention_mask": torch.tensor(encoding.attention_mask)
        }

📦 Installation Options

Basic Installation

pip install EthioBBPE

Development Installation

pip install EthioBBPE[dev]

From Source

git clone https://github.com/nexuss0781/Ethio_BBPE.git
cd Ethio_BBPE
pip install -e .

🧪 Testing

# Run tests
pytest tests/

# With coverage
pytest tests/ --cov=ethiobbpe

📚 Resources

🙏 Acknowledgments

📄 License

Apache License 2.0 - See LICENSE for details.

👤 Author

Nexuss0781
Email: nexuss0781@gmail.com


Made with ❤️ for Ethiopian Language NLP

Hugging Face GitHub Stars

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ethiobbpe-2.0.0.tar.gz (14.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ethiobbpe-2.0.0-py3-none-any.whl (10.3 kB view details)

Uploaded Python 3

File details

Details for the file ethiobbpe-2.0.0.tar.gz.

File metadata

  • Download URL: ethiobbpe-2.0.0.tar.gz
  • Upload date:
  • Size: 14.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for ethiobbpe-2.0.0.tar.gz
Algorithm Hash digest
SHA256 85efac41f4dcd07f6cc50a681da8eb76881b60ae249495df1f549aaa33ffa913
MD5 e18eb2a11bd5510d9b0a9d0e8b9be472
BLAKE2b-256 be960e648f201cc077655f83b222db050ac9fdd7656110c1385dab0686c0c387

See more details on using hashes here.

File details

Details for the file ethiobbpe-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: ethiobbpe-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 10.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for ethiobbpe-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2d0a9be9d4a3dc699d46e3f1dae1382bb8ba239cdeb274075c61f6c80b5bfc84
MD5 ee0942e4bb2b8d517a8c855c4cc2df18
BLAKE2b-256 3403d99c3c1fd9eb3aaa7b0f40cdc99ae2d27e4e217605953245f6c1beaa1c84

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page