Turkish Diacritics Restoration with Neural Networks
Project description
nokta-ai
Turkish Diacritics Restoration with Neural Networks
A lightweight PyTorch-based neural network package for restoring diacritics in Turkish text. nokta-ai can accurately restore Turkish special characters (ç, ğ, ı, ö, ş, ü) from text where they have been removed or replaced with ASCII equivalents.
Overview
This project implements a character-level sequence-to-sequence model using bidirectional LSTMs with multi-head self-attention to restore diacritics in Turkish text. Unlike rule-based or large language model approaches, this solution is:
- Lightweight: Small model size (~10-15MB depending on configuration)
- Fast: Processes text in real-time using GPU acceleration (MPS on Apple Silicon, CUDA on NVIDIA)
- Accurate: Achieves 95%+ accuracy with proper training data
- Self-contained: No external API dependencies
Recent Improvements
v2.0 - Major Architecture Enhancements
- Case Normalization: 50% reduction in model complexity by processing lowercase text with case pattern restoration
- Turkish-Specific Case Handling: Proper i/İ and ı/I mappings (critical for Turkish orthography)
- CUDA Compatibility: Fixed character embedding bounds checking for stable GPU training
- Binary Classification with Loss Penalties: Model learns when NOT to add diacritics (reduces false positives)
- Improved Capital Letter Accuracy: Deterministic case handling instead of learned patterns
- Character Safety: All Unicode characters > 255 safely handled with fallback mechanisms
Architecture
Model Components
- Character-level Tokenization: Works at the character level for fine-grained control
- Case Normalization: Processes text in lowercase with case pattern restoration (50% complexity reduction)
- Bidirectional LSTM: Captures both forward and backward context (2-4 layers)
- Multi-Head Self-Attention: Focuses on relevant parts of the input sequence (optional, recommended)
- Binary Classification: 6 classifiers for Turkish diacritic pairs (c/ç, g/ğ, i/ı, o/ö, s/ş, u/ü)
Technical Specifications
- Input: Turkish text with diacritics removed (mixed case supported)
- Output: Turkish text with diacritics restored (preserves original case)
- Context Window: Configurable (default: 96 characters, expert recommended)
- Vocabulary Size: 256 characters (covers Turkish alphabet + common punctuation)
- Hidden Size: Configurable (default: 256 dimensions, expert recommended)
- Attention Heads: 4 (when self-attention enabled)
- Turkish Case Handling: Proper i/İ and ı/I mappings (differs from English)
- CUDA Compatible: Safe character bounds checking for GPU training
Installation
From PyPI (Recommended)
pip install nokta-ai
From Source
# Clone the repository
git clone https://github.com/armish/nokta-ai.git
cd nokta-ai
# Install in development mode
pip install -e .
# Or install with optional dependencies
pip install -e ".[dev,data]"
Prerequisites
- Python 3.9+
- PyTorch 2.0+
- Apple Silicon Mac (for MPS acceleration) or NVIDIA GPU (for CUDA)
Quick Start
1. Prepare Training Data
# Basic usage with the included Turkish deasciifier training data
python scripts/prepare_training_data.py data/aysnrgenc_turkishdeasciifier_train.txt
# Combine multiple corpus files for larger training set
python scripts/prepare_training_data.py data/corpus1.txt data/corpus2.txt data/corpus3.txt
# Use custom output location
python scripts/prepare_training_data.py data/*.txt --output data/my_cache.pkl
# Adjust train/validation split (90% train, 10% validation)
python scripts/prepare_training_data.py data/*.txt --train-ratio 0.9
This creates data/combined_cache.pkl containing prepared training and validation data.
2. Train a Model
Expert-Recommended Configuration (With Self-Attention)
# Best accuracy, ~15% better diacritic restoration
nokta-train --data-cache data/combined_cache.pkl \
--output models/my_model.pth \
--context-size 96 \
--hidden-size 256 \
--num-lstm-layers 2 \
--use-attention \
--epochs 50
Lightweight Configuration (Without Self-Attention)
# Smaller, faster model for resource-constrained environments
nokta-train --data-cache data/combined_cache.pkl \
--output models/my_model_light.pth \
--context-size 96 \
--hidden-size 256 \
--no-attention \
--epochs 50
3. Evaluate Your Model
# Basic evaluation
nokta evaluate --model models/my_model.pth \
--test-file data/test_datasets/vikipedi_test.txt
# Multi-pass restoration for words with multiple diacritics
nokta evaluate --model models/my_model.pth \
--test-file data/test_datasets/vikipedi_test.txt \
--num-passes 3
4. Use Your Model
# Interactive mode
nokta-restore --model models/my_model.pth
# Direct text input
nokta-restore --model models/my_model.pth --text "Bugun hava cok guzel"
# Process a file
nokta-restore --model models/my_model.pth --input input.txt --output output.txt
Training Parameters
Core Architecture Options
| Parameter | Default | Description |
|---|---|---|
--context-size |
96 | Context window size (expert recommended: 96) |
--hidden-size |
256 | LSTM hidden dimensions (expert recommended: 256) |
--num-lstm-layers |
2 | Number of LSTM layers (expert recommends: 2-4) |
--use-attention |
True | Enable self-attention (recommended for accuracy) |
--no-attention |
False | Disable self-attention (smaller, faster models) |
Training Hyperparameters
| Parameter | Default | Description |
|---|---|---|
--epochs |
20 | Number of training epochs |
--batch-size |
16 | Training batch size |
--learning-rate |
0.0003 | Learning rate (expert recommended: 3e-4) |
--max-train-texts |
15000 | Maximum training texts to use |
--max-val-texts |
1000 | Maximum validation texts to use |
Data Sampling Options
| Parameter | Default | Description |
|---|---|---|
--balanced-sampling |
False | Equal representation of each character type |
--samples-per-char |
Auto | Target samples per character (auto-detected if not specified) |
Platform-Specific Configurations
| Parameter | Description |
|---|---|
--cuda-config |
CUDA-optimized: batch=128, context=128, hidden=512, attention=on |
--mps-config |
MPS-optimized: batch=64, context=96, hidden=256, attention=on |
--cpu-config |
CPU-optimized: batch=4, context=20, hidden=128, attention=off |
Example Training Commands
# Small model for quick experiments
nokta-train --data-cache data/combined_cache.pkl \
--output models/small.pth \
--context-size 20 \
--hidden-size 128 \
--epochs 10 \
--batch-size 32
# Balanced training with equal character representation
nokta-train --data-cache data/combined_cache.pkl \
--output models/balanced.pth \
--balanced-sampling \
--samples-per-char 1000 \
--epochs 30
# High-accuracy configuration
nokta-train --data-cache data/combined_cache.pkl \
--output models/high_accuracy.pth \
--context-size 96 \
--hidden-size 256 \
--num-lstm-layers 3 \
--use-attention \
--epochs 100 \
--batch-size 8
# CUDA GPU optimized (automatic configuration)
nokta-train --data-cache data/combined_cache.pkl \
--output models/cuda_optimized.pth \
--cuda-config \
--epochs 50
# MPS optimized (automatic configuration)
nokta-train --data-cache data/combined_cache.pkl \
--output models/mps_optimized.pth \
--mps-config \
--epochs 50
# CPU optimized (automatic configuration)
nokta-train --data-cache data/combined_cache.pkl \
--output models/cpu_optimized.pth \
--cpu-config \
--epochs 20
Evaluation Parameters
Evaluation Options
| Parameter | Default | Description |
|---|---|---|
--model |
Required | Path to trained model file |
--test-file |
Required | Path to test file with ground truth |
--context-size |
Auto | Override context size (use model default if not specified) |
--num-passes |
1 | Number of restoration passes (improves multi-diacritic words) |
--output |
None | Save detailed results to file |
Example Evaluation Commands
# Basic evaluation
nokta evaluate --model models/my_model.pth \
--test-file data/test_datasets/vikipedi_test.txt
# Multi-pass evaluation with detailed output
nokta evaluate --model models/my_model.pth \
--test-file data/test_datasets/vikipedi_test.txt \
--num-passes 3 \
--output evaluation_results.txt
# Override context size if needed
nokta evaluate --model models/my_model.pth \
--test-file data/test_datasets/vikipedi_test.txt \
--context-size 50
The evaluation provides:
- Character-level accuracy: Precise diacritic restoration
- Word-level accuracy: Complete word correctness
- Diacritic-specific accuracy: How well diacritics are restored
- Multi-pass restoration: With convergence detection
- Per-sentence breakdown: Detailed analysis for debugging
Model Architecture Comparison
With Self-Attention (Recommended)
Benefits:
- ~15% better diacritic accuracy
- Better long-range context understanding
- Improved Turkish vowel harmony detection
- Expert validated architecture
- Model size: ~14MB
Use for: Production deployments, maximum accuracy requirements
Without Self-Attention (Lightweight)
Benefits:
- ~20% smaller model size
- Faster training and inference
- Lower memory requirements
- Still effective for most text
- Model size: ~10MB
Use for: Resource-constrained environments, mobile deployment
Performance Comparison
| Metric | Without Attention | With Attention | Improvement |
|---|---|---|---|
| Diacritic accuracy | ~6% | ~21% | +15% |
| Character accuracy | ~92% | ~93% | +1% |
| Model size | 10.2MB | 14.2MB | +4MB |
Python API
import nokta_ai
# Load trained model
restorer = nokta_ai.DiacriticsRestorer(model_path='models/my_model.pth')
# Restore diacritics
text_without = "Turkiye'nin baskenti Ankara'dir"
text_restored = restorer.restore_diacritics(text_without)
print(text_restored) # "Türkiye'nin başkenti Ankara'dır"
# Process multiple texts
texts = ["Bugun hava guzel", "Cocuklar oynuyor"]
for text in texts:
restored = restorer.restore_diacritics(text)
print(f"{text} -> {restored}")
# Use mapper utilities
mapper = nokta_ai.TurkishDiacriticsMapper()
stripped = mapper.remove_diacritics("Günaydın dünya")
normalized = mapper.normalize_text(" MERHABA!!! ")
Test Datasets
Included Test Files
data/test_datasets/vikipedi_test.txt: 290 lines of high-quality Turkish text from Wikipedia- Content: Turkish constitutional history with proper diacritics
- Purpose: Benchmark model performance on real-world text
Adding Custom Test Files
- Place files in:
data/test_datasets/ - Format: One sentence per line with correct diacritics
- Encoding: UTF-8 text files
- Example structure:
Türkiye'nin başkenti Ankara'dır. Öğrenciler sınıfta ders çalışıyor. Çocuklar bahçede futbol oynuyorlar.
Advanced Usage
Context Size Experiments
Different context sizes offer different trade-offs:
Small Context (Fast Training)
# 20-character context - sees immediate neighbors only
nokta-train --data-cache data/combined_cache.pkl \
--output models/small_ctx.pth \
--context-size 20 \
--hidden-size 128 \
--epochs 15 \
--batch-size 32
- Good for: Quick experiments, limited resources
- Limitations: May miss word-level patterns
Medium Context (Balanced)
# 50-character context - sees 1-2 full words
nokta-train --data-cache data/combined_cache.pkl \
--output models/medium_ctx.pth \
--context-size 50 \
--hidden-size 256 \
--epochs 20 \
--batch-size 16
- Good for: General purpose, balanced accuracy/speed
- Best for: Most use cases
Large Context (Expert Recommended)
# 96-character context - sees full sentences
nokta-train --data-cache data/combined_cache.pkl \
--output models/large_ctx.pth \
--context-size 96 \
--hidden-size 256 \
--num-lstm-layers 2 \
--use-attention \
--epochs 25 \
--batch-size 8
- Good for: Maximum accuracy, Turkish grammar patterns
- Expert validated: Optimal balance of accuracy and efficiency
Multi-Pass Restoration
For words with multiple diacritics like "Üçüncü" → "Ucuncu":
# Single pass (default)
nokta evaluate --model models/my_model.pth \
--test-file data/test_datasets/vikipedi_test.txt \
--num-passes 1
# Multi-pass with automatic convergence detection
nokta evaluate --model models/my_model.pth \
--test-file data/test_datasets/vikipedi_test.txt \
--num-passes 3
The model automatically stops when output converges between passes.
Performance
Hardware Acceleration
The system automatically detects and uses optimal hardware acceleration with intelligent warnings and recommendations:
1. NVIDIA CUDA (Recommended for Large-Scale Training)
- Professional GPUs: Tesla, V100, A100 - excellent for large models and batch sizes
- Consumer GPUs: RTX/GTX - good performance for most training tasks
- Automatic optimization tips: Batch size and model size recommendations based on GPU memory
- Memory detection: Automatically suggests configurations for high-memory GPUs
# CUDA-optimized training (automatic configuration)
nokta-train --data-cache data/combined_cache.pkl \
--output models/cuda_model.pth \
--cuda-config \
--epochs 50
# Manual CUDA optimization for high-memory GPUs
nokta-train --data-cache data/combined_cache.pkl \
--output models/cuda_manual.pth \
--batch-size 128 \
--context-size 128 \
--hidden-size 512 \
--epochs 50
2. Apple Silicon MPS (M1/M2/M3)
- Excellent performance: Optimized for Apple Silicon processors
- Energy efficient: Lower power consumption than NVIDIA GPUs
- Automatic detection: Works seamlessly on Mac systems
- Memory efficient: Good balance of performance and resource usage
# MPS-optimized training (automatic configuration)
nokta-train --data-cache data/combined_cache.pkl \
--output models/mps_model.pth \
--mps-config \
--epochs 50
3. CPU Fallback (With Warnings)
- Performance warnings: Alerts about slow training speed
- Optimization suggestions: Smaller models and configuration tips
- CPU-optimized presets: Automatic small model configurations
# CPU-optimized training (automatic configuration)
nokta-train --data-cache data/combined_cache.pkl \
--output models/cpu_model.pth \
--cpu-config \
--epochs 30
Benchmarks
NVIDIA CUDA GPUs
- High-memory GPUs (A100, V100, Tesla): ~5,000-8,000 samples/second training, ~50,000+ characters/second inference
- Consumer GPUs (RTX/GTX): ~2,000-4,000 samples/second training, ~20,000-30,000 characters/second inference
- Recommended: Use
--cuda-configfor optimal settings - Can handle large models with sufficient GPU memory
Apple M1/M2/M3 with MPS
- Training speed: ~1,000-2,000 samples/second
- Inference speed: ~10,000-20,000 characters/second
- Memory efficient: Good for models up to 512 hidden size
- Energy efficient: Lower power consumption
- Recommended: Use
--mps-configfor optimal settings
CPU Fallback
- Training speed: ~50-100 samples/second ⚠️
- Inference speed: ~1,000 characters/second ⚠️
- Use only for small models and testing
Expected Accuracy
With proper training (50+ epochs on quality Turkish data):
- Character-level accuracy: 95%+
- Word-level accuracy: 90%+
- Diacritic-specific accuracy: 85%+
Training Data Requirements
Data Sources
The system can be trained on any Turkish text corpus. Best results are achieved with:
- Wikipedia dumps: Comprehensive vocabulary and proper spelling
- News articles: Contemporary language usage
- Books and literature: Diverse writing styles
- Web crawls: Informal language patterns
Included Training Data
data/aysnrgenc_turkishdeasciifier_train.txt: High-quality Turkish text with proper diacritics- Ready to use for training without additional data collection
Adding Your Own Data
- Format: Plain text files (UTF-8) with proper Turkish diacritics
- Structure: One or more paragraphs per file
- Quality: Ensure text has correct spelling and diacritics
- Combine: Use the data preparation script to merge multiple files:
# Combine your data with the included training data
python scripts/prepare_training_data.py \
data/aysnrgenc_turkishdeasciifier_train.txt \
your_data/*.txt \
--output data/custom_cache.pkl
Data Requirements
- Minimum: 1MB of Turkish text (included data meets this)
- Recommended: 10MB+ for good coverage
- Optimal: 100MB+ for production use
Troubleshooting
Common Issues
-
Low accuracy after training
- Solution: Train for more epochs (50+)
- Use more training data
- Enable self-attention (
--use-attention) - Increase model capacity (
--hidden-size 256)
-
MPS not detected on Mac
- Ensure PyTorch 2.0+ is installed
- Check MPS availability:
python -c "import torch; print(torch.backends.mps.is_available())"
-
Out of memory errors
- Reduce
--batch-size - Reduce
--context-size - Disable attention (
--no-attention) - Use smaller
--hidden-size
- Reduce
-
Model predictions seem random
- Check if you're using the correct
--context-sizeduring evaluation - Ensure model was trained for enough epochs
- Verify training data quality
- Check if you're using the correct
Implementation Details
Character Mapping
The system handles these Turkish-specific transformations:
- ç ↔ c, Ç ↔ C
- ğ ↔ g, Ğ ↔ G
- ı ↔ i, İ ↔ I
- ö ↔ o, Ö ↔ O
- ş ↔ s, Ş ↔ S
- ü ↔ u, Ü ↔ U
Multi-Head Self-Attention
When enabled, the attention mechanism helps the model:
- Focus on relevant context for ambiguous cases
- Learn long-range dependencies
- Handle Turkish vowel harmony rules
- Improve accuracy on compound words
Balanced Sampling
The --balanced-sampling option ensures equal representation:
- Groups training samples by diacritic character type
- Balances representation across all 6 Turkish character pairs
- Prevents model bias toward common characters like 'i/ı'
- Use
--samples-per-charto control target samples per character
Package Structure
nokta-ai/
├── nokta_ai/
│ ├── __init__.py # Package initialization
│ ├── core.py # Main restoration classes
│ ├── models/ # Neural network architectures
│ │ └── constrained.py # Constrained diacritics model
│ └── cli/ # Command line interfaces
│ ├── main.py # Main CLI (nokta command)
│ ├── train_constrained.py # Training script
│ └── evaluate_constrained.py # Evaluation script
├── data/
│ ├── aysnrgenc_turkishdeasciifier_train.txt # High-quality Turkish training data # Additional training data
│ ├── combined_cache.pkl # Preprocessed training data (generated)
│ └── test_datasets/ # Test datasets
│ └── vikipedi_test.txt # Wikipedia test data
├── models/ # Trained model weights (generated)
├── scripts/
│ └── prepare_training_data.py # Data preparation script
├── ARCHITECTURE.md # Detailed neural network documentation
└── README.md # This file
CLI Command Reference
Training: nokta-train
nokta-train --data-cache DATA --output MODEL [options]
Evaluation: nokta evaluate / nokta-evaluate
nokta evaluate --model MODEL --test-file FILE [options]
Restoration: nokta restore
nokta restore --model MODEL [--text TEXT | --input FILE] [options]
Benchmarking: nokta benchmark
nokta benchmark --model MODEL [options]
License
MIT License
Citation
If you use this work in your research, please cite:
@software{turkish_diacritics_nn,
title = {Turkish Diacritics Restoration with Neural Networks},
year = {2024},
url = {https://github.com/armish/nokta-ai}
}
Contact
For questions and contributions, please open an issue on GitHub.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nokta_ai-0.1.0.tar.gz.
File metadata
- Download URL: nokta_ai-0.1.0.tar.gz
- Upload date:
- Size: 31.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
585eb8645933588f6480ba106589610d8d2fd141c012e59cd77680bd2029d916
|
|
| MD5 |
4f10edeec56cc31b82419ad807d183b3
|
|
| BLAKE2b-256 |
1d958679189ac0a68c6acab1ba25e97447a1354c2b0d7f27641486810b83c374
|
File details
Details for the file nokta_ai-0.1.0-py3-none-any.whl.
File metadata
- Download URL: nokta_ai-0.1.0-py3-none-any.whl
- Upload date:
- Size: 27.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
07573724130a5d0e1978bab2b97739159b085bc457b3ba829237cfa522923dd2
|
|
| MD5 |
f2f4e471285748a3caaf03bffe42ca57
|
|
| BLAKE2b-256 |
b391f4fb09fffafb8e733dccfb8f277e74ab7648b5c8c6c70bbb40da610ddd32
|