Skip to main content

Advanced Byte Pair Encoding Tokenizer for Ethiopian Languages (Amharic, Tigrinya, Ge'ez)

Reason this release was yanked:

Older version may have dependency mismatch with your project dependencies

Project description

EthioBBPE

Advanced Byte Pair Encoding Tokenizer for Ethiopian Languages

PyPI version License: Apache 2.0 Python 3.8+ Hugging Face

EthioBBPE is a production-ready, high-performance tokenizer optimized for Amharic, Tigrinya, and Ge'ez scripts. Built with advanced features including checkpointing, multi-format compression, and model quantization, it delivers efficient text processing for Ethiopian languages.

✨ Features

🚀 Production-Ready

  • Automatic Model Download: Seamlessly downloads pretrained models from Hugging Face Hub on first use
  • Embedded Models: Includes pretrained weights in the package for offline usage
  • Zero Configuration: Works out of the box with sensible defaults

🔧 Advanced Capabilities

  • Checkpointing: SHA256 integrity verification for fault-tolerant training
  • Multi-Format Compression: Support for gzip, bz2, and lzma/xz (up to 90% size reduction)
  • Model Quantization: 8-bit and 4-bit quantization for efficient deployment
  • Batch Processing: Efficient encoding/decoding of text batches

📊 Performance

  • Vocabulary Size: 16,000 tokens optimized for Ethiopic scripts
  • Compression Ratio: ~90% size reduction with gzip level 9
  • Perfect Reconstruction: 100% accuracy on Amharic biblical texts
  • Fast Inference: Optimized for production workloads

📦 Installation

Basic Installation

pip install EthioBBPE

With Training Dependencies

pip install EthioBBPE[training]

Development Installation

pip install EthioBBPE[dev]

🎯 Quick Start

Simple Usage

from ethiobbpe import EthioBBPE

# Initialize tokenizer (auto-downloads model on first use)
tokenizer = EthioBBPE()

# Encode text
text = "ሰላም ለኢዮብ ዘኢነበበ"
encoded = tokenizer.encode(text)
print(encoded['ids'])      # Token IDs
print(encoded['tokens'])   # Token strings

# Decode back to text
decoded = tokenizer.decode(encoded['ids'])
print(decoded)  # Perfect reconstruction: "ሰላም ለኢዮብ ዘኢነበበ"

Batch Processing

texts = [
    "በመዠመሪያ፡እግዚአብሔር፡ሰማይንና፡ምድርን፡ፈጠረ።",
    "ሰላም ለኢዮብ ዘኢነበበ ከንቶ ።"
]

# Encode batch
encoded_batch = tokenizer.encode_batch(texts)

# Decode batch
decoded_batch = tokenizer.decode_batch([e['ids'] for e in encoded_batch])

Load from Hugging Face Hub

from ethiobbpe import EthioBBPE

# Load specific model from Hugging Face
tokenizer = EthioBBPE.from_pretrained("nexuss0781/Ethio-BBPE")

🔬 Advanced Usage

Custom Configuration

from ethiobbpe import Config, EthioBBPE

# Create custom configuration
config = Config(
    model_name="MyTokenizer",
    vocab_size=32000,
    compression_format="bz2",
    compression_level=9,
    enable_quantization=True,
    quantization_bits=8
)

# Use configuration
tokenizer = EthioBBPE(model_name=config.model_name)

Utility Functions

from ethiobbpe import (
    load_compressed_vocab,
    validate_checkpoint,
    list_checkpoints,
    get_model_info
)

# Load compressed vocabulary
vocab = load_compressed_vocab("vocab.json.gz")

# Validate checkpoint
is_valid = validate_checkpoint("checkpoint_ckpt_1.json")

# List available checkpoints
checkpoints = list_checkpoints("./checkpoints")

# Get model information
info = get_model_info("./models/EthioBBPE_AmharicBible")
print(f"Total size: {info['total_size_mb']} MB")

Training Your Own Tokenizer

from ethiobbpe.trainer import BBPETrainer

# Initialize trainer with advanced features
trainer = BBPETrainer(
    vocab_size=16000,
    min_frequency=2,
    compression_format="gzip",
    compression_level=9,
    enable_quantization=True,
    quantization_bits=8,
    max_checkpoints=5
)

# Train on your data
texts = ["Your Amharic text 1", "Your Amharic text 2", ...]
metrics = trainer.train(
    texts=texts,
    output_dir="./my_tokenizer",
    model_name="MyAmharicTokenizer"
)

print(f"Training completed in {metrics['training_duration_seconds']:.2f}s")
print(f"Final vocab size: {metrics['final_vocab_size']}")
print(f"Compression ratio: {metrics['compression_ratio']:.2%}")

📈 Model Details

Training Data

  • Synaxarium Dataset: 366 Ethiopian Orthodox Church texts
  • Canon Biblical Dataset: 61,403 Amharic-English parallel biblical texts
  • Total Corpus: 27.5 MB, 61,769 lines

Performance Metrics

Metric Value
Vocabulary Size 16,000 tokens
Training Time ~17 seconds
Original Size 1.34 MB
Compressed Size 136 KB (gzip)
Compression Ratio 89.8%
Reconstruction Accuracy 100%

Supported Scripts

  • ✅ Amharic (አማርኛ)
  • ✅ Tigrinya (ትግርኛ)
  • ✅ Ge'ez (ግዕዝ)
  • ✅ Mixed-language texts

🛠️ API Reference

EthioBBPE Class

__init__(model_name: str = "EthioBBPE_AmharicBible", model_dir: Optional[str] = None)

Initialize the tokenizer with optional custom model name or directory.

encode(text: str, add_special_tokens: bool = True, ...) -> Dict[str, Any]

Encode text into token IDs, tokens, and offsets.

decode(ids: List[int], skip_special_tokens: bool = True) -> str

Decode token IDs back to text.

encode_batch(texts: List[str], ...) -> List[Dict[str, Any]]

Encode multiple texts efficiently.

decode_batch(batch_ids: List[List[int]], ...) -> List[str]

Decode multiple sequences efficiently.

get_vocab_size() -> int

Get the vocabulary size.

get_vocab() -> Dict[str, int]

Get the full vocabulary mapping.

save(path: str) -> None

Save the tokenizer to a file.

from_pretrained(model_name: str = "nexuss0781/Ethio-BBPE") -> EthioBBPE

Class method to load a pretrained tokenizer from Hugging Face Hub.

📁 Package Structure

ethiobbpe/
├── __init__.py          # Main exports
├── tokenizer.py         # Core EthioBBPE class
├── config.py            # Configuration management
├── utils.py             # Utility functions
├── trainer.py           # Advanced BBPE trainer
└── models/              # Pretrained models
    ├── tokenizer.json
    ├── vocab.json.gz
    ├── config.json
    └── training_metrics.json

🤝 Integration

Hugging Face Transformers

from transformers import AutoTokenizer

# Use with Hugging Face Transformers
tokenizer = AutoTokenizer.from_pretrained("nexuss0781/Ethio-BBPE")

LangChain

from langchain.text_splitter import CharacterTextSplitter

# Use in LangChain pipelines
text_splitter = CharacterTextSplitter(
    separator="",
    chunk_size=512,
    chunk_overlap=50
)

📄 License

Apache License 2.0 - See LICENSE for details.

🙏 Acknowledgments

📬 Contact

🗺️ Roadmap

  • Support for additional Ethiopian languages (Oromo, Somali, Sidama)
  • Pre-trained language models using EthioBBPE
  • WebAssembly build for browser-based inference
  • ONNX export for optimized deployment
  • Streaming tokenization for large documents

Made with ❤️ for Ethiopian NLP community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ethiobbpe-1.0.2.tar.gz (11.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ethiobbpe-1.0.2-py3-none-any.whl (13.5 kB view details)

Uploaded Python 3

File details

Details for the file ethiobbpe-1.0.2.tar.gz.

File metadata

  • Download URL: ethiobbpe-1.0.2.tar.gz
  • Upload date:
  • Size: 11.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for ethiobbpe-1.0.2.tar.gz
Algorithm Hash digest
SHA256 9c2744ebeaae5fdce8ed40490fceb2b37b6977c3ca7e4e9d6252a3d5f6de61b9
MD5 ba10f0fea8fa88e574ceb10fdc3bd0a7
BLAKE2b-256 f4383408f4fe197ca718124439f2b05034db6e500a65ca71753d01069cb6785f

See more details on using hashes here.

File details

Details for the file ethiobbpe-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: ethiobbpe-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 13.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for ethiobbpe-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 5ba0b02162b2e2c45d5cd4ff9bb4f3e15d757ce8036321e1a523be64bb26267f
MD5 6dcb0c295db7694c287f1c88b09cc5d8
BLAKE2b-256 f05e66b88b4d0a32f5a55d4edc9095e63fadebd8a7732a8569016baa7b2f3f7d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page