Skip to main content

Professional Amharic Biblical Text Tokenizer with Byte Pair Encoding

Reason this release was yanked:

This is older version that lack some improvements please upgrade to latest versions

Project description

EthioBBPE

PyPI - Version PyPI - Python Version PyPI - License Hugging Face GitHub

📖 Overview

EthioBBPE is a professional, production-ready Byte Pair Encoding (BPE) tokenizer specifically designed for Amharic, Ge'ez, and biblical texts. It provides state-of-the-art tokenization for Ethiopian languages with support for ancient scripts and special characters.

✨ Key Features

  • 🚀 One-line Installation: pip install EthioBBPE
  • 🤖 Auto-download: Models automatically downloaded from Hugging Face Hub
  • 📦 Compressed Storage: Gzip compression for efficient storage (65%+ size reduction)
  • High Performance: Optimized for speed with batch processing
  • 🎯 Perfect Reconstruction: 100% accuracy on Amharic biblical texts
  • 🔧 Professional API: Clean, intuitive interface inspired by Hugging Face
  • 📊 Production Ready: Checkpointing, quantization, and comprehensive metrics

🚀 Quick Start

Installation

pip install EthioBBPE

Basic Usage

from ethiobbpe import EthioBBPETokenizer

# Load pretrained tokenizer (auto-downloads from Hugging Face)
tokenizer = EthioBBPETokenizer.from_pretrained()

# Encode text
text = "ሰላም ለኢዮብ ዘኢነበበ ከንቶ ።"
encoded = tokenizer.encode(text)

print(f"Tokens: {encoded.tokens}")
print(f"IDs: {encoded.ids}")

# Decode back to text
decoded = tokenizer.decode(encoded.ids)
print(f"Decoded: {decoded}")
# Output: ሰላም ለኢዮብ ዘኢነበበ ከንቶ ። (100% accurate!)

Advanced Usage

from ethiobbpe import AutoTokenizer

# Alternative: Use AutoTokenizer factory
tokenizer = AutoTokenizer.from_pretrained("Nexuss0781/Ethio-BBPE")

# Batch encoding
texts = [
    "በመዠመሪያ፡እግዚአብሔር፡ሰማይንና፡ምድርን፡ፈጠረ።",
    "ወደ ቍስጥንጥንያ አገርም በደረሰች ጊዜ",
    "ሐዋርያ መንፈስ ይቤ እንዘ ያነክር ሕይወቶ"
]

encodings = tokenizer.encode_batch(texts)
for i, enc in enumerate(encodings):
    print(f"Text {i}: {len(enc.tokens)} tokens")

# Callable interface
result = tokenizer("ሰላም ዓለም")
print(result.tokens)

# Get vocabulary
vocab = tokenizer.get_vocab()
print(f"Vocabulary size: {tokenizer.get_vocab_size()}")

📊 Model Details

Property Value
Model Name EthioBBPE_AmharicBible
Vocabulary Size 16,000 tokens
Training Data 61,769 lines (Synaxarium + Canon Biblical)
Compression Gzip Level 9 (65%+ reduction)
Reconstruction Accuracy 100%
License Apache 2.0

Training Datasets

  1. Synaxarium Dataset: 366 religious texts
  2. Canon Biblical Dataset: 61,403 Amharic-English parallel texts

Total corpus: ~27.5 MB of high-quality Amharic biblical text.

🛠️ API Reference

EthioBBPETokenizer

Main tokenizer class with full encoding/decoding capabilities.

Methods

  • from_pretrained(model_name, cache_dir, force_download): Load pretrained model
  • from_file(file_path): Load from local file
  • encode(text, add_special_tokens, truncation, max_length): Encode single text
  • encode_batch(texts, ...): Encode multiple texts
  • decode(ids, skip_special_tokens): Decode token IDs
  • decode_batch(batch_ids, ...): Decode multiple sequences
  • get_vocab(): Get vocabulary dictionary
  • get_vocab_size(): Get vocabulary size
  • save(path): Save tokenizer to file

AutoTokenizer

Factory class for automatic tokenizer loading.

from ethiobbpe import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Nexuss0781/Ethio-BBPE")

Encoding

Wrapper object for encoding results with properties:

  • ids: Token IDs (List[int])
  • tokens: Token strings (List[str])
  • attention_mask: Attention mask (List[int])
  • type_ids: Token type IDs (List[int])
  • offsets: Character offsets (List[tuple])
  • special_tokens_mask: Special tokens mask (List[int])

📈 Performance

Compression Benefits

Format Size Compression
tokenizer.json 1.3 MB Baseline
vocab.json.gz 136 KB 89.8%
tokenizer_quantized_8bit.json.gz 56 KB 95.7%

Speed

  • Single encoding: < 1ms
  • Batch encoding (32 texts): < 10ms
  • Model download: Automatic caching for subsequent uses

🔬 Technical Details

Architecture

  • Algorithm: Byte Pair Encoding (BPE)
  • Pre-tokenization: Whitespace splitting
  • Special Tokens: [PAD], [CLS], [SEP], [MASK], [UNK]
  • Vocabulary: 16,000 tokens optimized for Amharic/Ge'ez

Production Features

  1. Checkpointing: SHA256 integrity verification for fault-tolerant training
  2. Quantization: 8-bit and 4-bit options for deployment optimization
  3. Metrics Tracking: Comprehensive training statistics in JSON format
  4. Multiple Export Formats: tokenizer.json, vocab.json.gz, quantized versions

📝 Examples

Ge'ez Punctuation Handling

tokenizer = EthioBBPETokenizer.from_pretrained()

# Ancient Ge'ez punctuation marks
geez_text = "፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠"
encoded = tokenizer.encode(geez_text)
print(f"Ge'ez punctuation: {encoded.tokens}")
# Single token for repeated marks!

decoded = tokenizer.decode(encoded.ids)
assert decoded == geez_text  # Perfect reconstruction

Biblical Text Processing

biblical_text = """
ሰላም ለኢዮብ ዘኢነበበ ከንቶ ። 
አመ አኀዞ አበቅ ወአመ አህጎለ ጥሪቶ ። 
ሐዋርያ መንፈስ ይቤ እንዘ ያነክር ሕይወቶ ።
"""

encoded = tokenizer.encode(biblical_text)
print(f"Token count: {len(encoded.tokens)}")
print(f"Perfect reconstruction: {tokenizer.decode(encoded.ids) == biblical_text}")

🤝 Integration

With Hugging Face Transformers

from transformers import AutoModel, AutoTokenizer
from ethiobbpe import EthioBBPETokenizer

# Use EthioBBPE as the tokenizer
ethio_tokenizer = EthioBBPETokenizer.from_pretrained()

# Integrate with your pipeline
def tokenize_for_model(text):
    encoding = ethio_tokenizer.encode(text)
    return {
        "input_ids": encoding.ids,
        "attention_mask": encoding.attention_mask
    }

With PyTorch DataLoader

import torch
from torch.utils.data import Dataset
from ethiobbpe import EthioBBPETokenizer

class AmharicDataset(Dataset):
    def __init__(self, texts, tokenizer, max_length=512):
        self.texts = texts
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        encoding = self.tokenizer.encode(
            self.texts[idx],
            truncation=True,
            max_length=self.max_length
        )
        return {
            "input_ids": torch.tensor(encoding.ids),
            "attention_mask": torch.tensor(encoding.attention_mask)
        }

📦 Installation Options

Basic Installation

pip install EthioBBPE

Development Installation

pip install EthioBBPE[dev]

From Source

git clone https://github.com/nexuss0781/Ethio_BBPE.git
cd Ethio_BBPE
pip install -e .

🧪 Testing

# Run tests
pytest tests/

# With coverage
pytest tests/ --cov=ethiobbpe

📚 Resources

🙏 Acknowledgments

📄 License

Apache License 2.0 - See LICENSE for details.

👤 Author

Nexuss0781
Email: nexuss0781@gmail.com


Made with ❤️ for Ethiopian Language NLP

Hugging Face GitHub Stars

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ethiobbpe-1.0.0.tar.gz (11.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ethiobbpe-1.0.0-py3-none-any.whl (9.5 kB view details)

Uploaded Python 3

File details

Details for the file ethiobbpe-1.0.0.tar.gz.

File metadata

  • Download URL: ethiobbpe-1.0.0.tar.gz
  • Upload date:
  • Size: 11.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for ethiobbpe-1.0.0.tar.gz
Algorithm Hash digest
SHA256 fcaf66e2c3da43eba8236df39bccc5c3f6aea33518697397b97e674fb4a64d4b
MD5 aa02b679e7a5838f13af4dd269864b2d
BLAKE2b-256 d50f0df8f0cbe299d8079e102867e00c29ec1388cfa16eb0a7ce6096ddbd4c57

See more details on using hashes here.

File details

Details for the file ethiobbpe-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: ethiobbpe-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 9.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for ethiobbpe-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c8a3ec99fa6ee025f654f592708df3d2b8537d8c4231ebe06e1ec5fc6166ce5d
MD5 d5e773564938c9794747e60e53484d36
BLAKE2b-256 11e1e27d9dd1392ef0fbb6665b5167a1b973368e7e3492e6f0368940283cf875

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page