Turkish Diacritics Restoration with Neural Networks

These details have not been verified by PyPI

Project links

Project description

nokta-ai

Turkish Diacritics Restoration with Neural Networks

A lightweight PyTorch-based neural network package for restoring diacritics in Turkish text. nokta-ai can accurately restore Turkish special characters (ç, ğ, ı, ö, ş, ü) from text where they have been removed or replaced with ASCII equivalents.

Overview

This project implements a character-level sequence-to-sequence model using bidirectional LSTMs with multi-head self-attention to restore diacritics in Turkish text. Unlike rule-based or large language model approaches, this solution is:

Lightweight: Small model size (~10-15MB depending on configuration)
Fast: Processes text in real-time using GPU acceleration (MPS on Apple Silicon, CUDA on NVIDIA)
Accurate: Achieves 95%+ accuracy with proper training data
Self-contained: No external API dependencies

Recent Improvements

v2.0 - Major Architecture Enhancements

Case Normalization: 50% reduction in model complexity by processing lowercase text with case pattern restoration
Turkish-Specific Case Handling: Proper i/İ and ı/I mappings (critical for Turkish orthography)
CUDA Compatibility: Fixed character embedding bounds checking for stable GPU training
Binary Classification with Loss Penalties: Model learns when NOT to add diacritics (reduces false positives)
Improved Capital Letter Accuracy: Deterministic case handling instead of learned patterns
Character Safety: All Unicode characters > 255 safely handled with fallback mechanisms

Architecture

Model Components

Character-level Tokenization: Works at the character level for fine-grained control
Case Normalization: Processes text in lowercase with case pattern restoration (50% complexity reduction)
Bidirectional LSTM: Captures both forward and backward context (2-4 layers)
Multi-Head Self-Attention: Focuses on relevant parts of the input sequence (optional, recommended)
Binary Classification: 6 classifiers for Turkish diacritic pairs (c/ç, g/ğ, i/ı, o/ö, s/ş, u/ü)

Technical Specifications

Input: Turkish text with diacritics removed (mixed case supported)
Output: Turkish text with diacritics restored (preserves original case)
Context Window: Configurable (default: 96 characters, expert recommended)
Vocabulary Size: 256 characters (covers Turkish alphabet + common punctuation)
Hidden Size: Configurable (default: 256 dimensions, expert recommended)
Attention Heads: 4 (when self-attention enabled)
Turkish Case Handling: Proper i/İ and ı/I mappings (differs from English)
CUDA Compatible: Safe character bounds checking for GPU training

Installation

From PyPI (Recommended)

pip install nokta-ai

From Source

# Clone the repository
git clone https://github.com/armish/nokta-ai.git
cd nokta-ai

# Install in development mode
pip install -e .

# Or install with optional dependencies
pip install -e ".[dev,data]"

Prerequisites

Python 3.9+
PyTorch 2.0+
Apple Silicon Mac (for MPS acceleration) or NVIDIA GPU (for CUDA)

Quick Start

1. Prepare Training Data

# Basic usage with the included Turkish deasciifier training data
python scripts/prepare_training_data.py data/aysnrgenc_turkishdeasciifier_train.txt

# Combine multiple corpus files for larger training set
python scripts/prepare_training_data.py data/corpus1.txt data/corpus2.txt data/corpus3.txt

# Use custom output location
python scripts/prepare_training_data.py data/*.txt --output data/my_cache.pkl

# Adjust train/validation split (90% train, 10% validation)
python scripts/prepare_training_data.py data/*.txt --train-ratio 0.9

This creates data/combined_cache.pkl containing prepared training and validation data.

2. Train a Model

Expert-Recommended Configuration (With Self-Attention)

# Best accuracy, ~15% better diacritic restoration
nokta-train --data-cache data/combined_cache.pkl \
            --output models/my_model.pth \
            --context-size 96 \
            --hidden-size 256 \
            --num-lstm-layers 2 \
            --use-attention \
            --epochs 50

Lightweight Configuration (Without Self-Attention)

# Smaller, faster model for resource-constrained environments
nokta-train --data-cache data/combined_cache.pkl \
            --output models/my_model_light.pth \
            --context-size 96 \
            --hidden-size 256 \
            --no-attention \
            --epochs 50

3. Evaluate Your Model

# Basic evaluation
nokta evaluate --model models/my_model.pth \
               --test-file data/test_datasets/vikipedi_test.txt

# Multi-pass restoration for words with multiple diacritics
nokta evaluate --model models/my_model.pth \
               --test-file data/test_datasets/vikipedi_test.txt \
               --num-passes 3

4. Use Your Model

# Interactive mode
nokta-restore --model models/my_model.pth

# Direct text input
nokta-restore --model models/my_model.pth --text "Bugun hava cok guzel"

# Process a file
nokta-restore --model models/my_model.pth --input input.txt --output output.txt

Training Parameters

Core Architecture Options

Parameter	Default	Description
`--context-size`	96	Context window size (expert recommended: 96)
`--hidden-size`	256	LSTM hidden dimensions (expert recommended: 256)
`--num-lstm-layers`	2	Number of LSTM layers (expert recommends: 2-4)
`--use-attention`	True	Enable self-attention (recommended for accuracy)
`--no-attention`	False	Disable self-attention (smaller, faster models)

Training Hyperparameters

Parameter	Default	Description
`--epochs`	20	Number of training epochs
`--batch-size`	16	Training batch size
`--learning-rate`	0.0003	Learning rate (expert recommended: 3e-4)
`--max-train-texts`	15000	Maximum training texts to use
`--max-val-texts`	1000	Maximum validation texts to use

Data Sampling Options

Parameter	Default	Description
`--balanced-sampling`	False	Equal representation of each character type
`--samples-per-char`	Auto	Target samples per character (auto-detected if not specified)

Platform-Specific Configurations

Parameter	Description
`--cuda-config`	CUDA-optimized: batch=128, context=128, hidden=512, attention=on
`--mps-config`	MPS-optimized: batch=64, context=96, hidden=256, attention=on
`--cpu-config`	CPU-optimized: batch=4, context=20, hidden=128, attention=off

Example Training Commands

# Small model for quick experiments
nokta-train --data-cache data/combined_cache.pkl \
            --output models/small.pth \
            --context-size 20 \
            --hidden-size 128 \
            --epochs 10 \
            --batch-size 32

# Balanced training with equal character representation
nokta-train --data-cache data/combined_cache.pkl \
            --output models/balanced.pth \
            --balanced-sampling \
            --samples-per-char 1000 \
            --epochs 30

# High-accuracy configuration
nokta-train --data-cache data/combined_cache.pkl \
            --output models/high_accuracy.pth \
            --context-size 96 \
            --hidden-size 256 \
            --num-lstm-layers 3 \
            --use-attention \
            --epochs 100 \
            --batch-size 8

# CUDA GPU optimized (automatic configuration)
nokta-train --data-cache data/combined_cache.pkl \
            --output models/cuda_optimized.pth \
            --cuda-config \
            --epochs 50

# MPS optimized (automatic configuration)
nokta-train --data-cache data/combined_cache.pkl \
            --output models/mps_optimized.pth \
            --mps-config \
            --epochs 50

# CPU optimized (automatic configuration)
nokta-train --data-cache data/combined_cache.pkl \
            --output models/cpu_optimized.pth \
            --cpu-config \
            --epochs 20

Evaluation Parameters

Evaluation Options

Parameter	Default	Description
`--model`	Required	Path to trained model file
`--test-file`	Required	Path to test file with ground truth
`--context-size`	Auto	Override context size (use model default if not specified)
`--num-passes`	1	Number of restoration passes (improves multi-diacritic words)
`--output`	None	Save detailed results to file

Example Evaluation Commands

# Basic evaluation
nokta evaluate --model models/my_model.pth \
               --test-file data/test_datasets/vikipedi_test.txt

# Multi-pass evaluation with detailed output
nokta evaluate --model models/my_model.pth \
               --test-file data/test_datasets/vikipedi_test.txt \
               --num-passes 3 \
               --output evaluation_results.txt

# Override context size if needed
nokta evaluate --model models/my_model.pth \
               --test-file data/test_datasets/vikipedi_test.txt \
               --context-size 50

The evaluation provides:

Character-level accuracy: Precise diacritic restoration
Word-level accuracy: Complete word correctness
Diacritic-specific accuracy: How well diacritics are restored
Multi-pass restoration: With convergence detection
Per-sentence breakdown: Detailed analysis for debugging

Model Architecture Comparison

With Self-Attention (Recommended)

Benefits:

~15% better diacritic accuracy
Better long-range context understanding
Improved Turkish vowel harmony detection
Expert validated architecture
Model size: ~14MB

Use for: Production deployments, maximum accuracy requirements

Without Self-Attention (Lightweight)

Benefits:

~20% smaller model size
Faster training and inference
Lower memory requirements
Still effective for most text
Model size: ~10MB

Use for: Resource-constrained environments, mobile deployment

Performance Comparison

Metric	Without Attention	With Attention	Improvement
Diacritic accuracy	~6%	~21%	+15%
Character accuracy	~92%	~93%	+1%
Model size	10.2MB	14.2MB	+4MB

Python API

import nokta_ai

# Load trained model
restorer = nokta_ai.DiacriticsRestorer(model_path='models/my_model.pth')

# Restore diacritics
text_without = "Turkiye'nin baskenti Ankara'dir"
text_restored = restorer.restore_diacritics(text_without)
print(text_restored)  # "Türkiye'nin başkenti Ankara'dır"

# Process multiple texts
texts = ["Bugun hava guzel", "Cocuklar oynuyor"]
for text in texts:
    restored = restorer.restore_diacritics(text)
    print(f"{text} -> {restored}")

# Use mapper utilities
mapper = nokta_ai.TurkishDiacriticsMapper()
stripped = mapper.remove_diacritics("Günaydın dünya")
normalized = mapper.normalize_text("  MERHABA!!!  ")

Test Datasets

Included Test Files

data/test_datasets/vikipedi_test.txt: 290 lines of high-quality Turkish text from Wikipedia
Content: Turkish constitutional history with proper diacritics
Purpose: Benchmark model performance on real-world text

Adding Custom Test Files

Place files in: data/test_datasets/
Format: One sentence per line with correct diacritics
Encoding: UTF-8 text files

Example structure:

Türkiye'nin başkenti Ankara'dır.
Öğrenciler sınıfta ders çalışıyor.
Çocuklar bahçede futbol oynuyorlar.

Advanced Usage

Context Size Experiments

Different context sizes offer different trade-offs:

Small Context (Fast Training)

# 20-character context - sees immediate neighbors only
nokta-train --data-cache data/combined_cache.pkl \
            --output models/small_ctx.pth \
            --context-size 20 \
            --hidden-size 128 \
            --epochs 15 \
            --batch-size 32

Good for: Quick experiments, limited resources
Limitations: May miss word-level patterns

Medium Context (Balanced)

# 50-character context - sees 1-2 full words
nokta-train --data-cache data/combined_cache.pkl \
            --output models/medium_ctx.pth \
            --context-size 50 \
            --hidden-size 256 \
            --epochs 20 \
            --batch-size 16

Good for: General purpose, balanced accuracy/speed
Best for: Most use cases

Large Context (Expert Recommended)

# 96-character context - sees full sentences
nokta-train --data-cache data/combined_cache.pkl \
            --output models/large_ctx.pth \
            --context-size 96 \
            --hidden-size 256 \
            --num-lstm-layers 2 \
            --use-attention \
            --epochs 25 \
            --batch-size 8

Good for: Maximum accuracy, Turkish grammar patterns
Expert validated: Optimal balance of accuracy and efficiency

Multi-Pass Restoration

For words with multiple diacritics like "Üçüncü" → "Ucuncu":

# Single pass (default)
nokta evaluate --model models/my_model.pth \
    --test-file data/test_datasets/vikipedi_test.txt \
    --num-passes 1

# Multi-pass with automatic convergence detection
nokta evaluate --model models/my_model.pth \
    --test-file data/test_datasets/vikipedi_test.txt \
    --num-passes 3

The model automatically stops when output converges between passes.

Performance

Hardware Acceleration

The system automatically detects and uses optimal hardware acceleration with intelligent warnings and recommendations:

1. NVIDIA CUDA (Recommended for Large-Scale Training)

Professional GPUs: Tesla, V100, A100 - excellent for large models and batch sizes
Consumer GPUs: RTX/GTX - good performance for most training tasks
Automatic optimization tips: Batch size and model size recommendations based on GPU memory
Memory detection: Automatically suggests configurations for high-memory GPUs

# CUDA-optimized training (automatic configuration)
nokta-train --data-cache data/combined_cache.pkl \
            --output models/cuda_model.pth \
            --cuda-config \
            --epochs 50

# Manual CUDA optimization for high-memory GPUs
nokta-train --data-cache data/combined_cache.pkl \
            --output models/cuda_manual.pth \
            --batch-size 128 \
            --context-size 128 \
            --hidden-size 512 \
            --epochs 50

2. Apple Silicon MPS (M1/M2/M3)

Excellent performance: Optimized for Apple Silicon processors
Energy efficient: Lower power consumption than NVIDIA GPUs
Automatic detection: Works seamlessly on Mac systems
Memory efficient: Good balance of performance and resource usage

# MPS-optimized training (automatic configuration)
nokta-train --data-cache data/combined_cache.pkl \
            --output models/mps_model.pth \
            --mps-config \
            --epochs 50

3. CPU Fallback (With Warnings)

Performance warnings: Alerts about slow training speed
Optimization suggestions: Smaller models and configuration tips
CPU-optimized presets: Automatic small model configurations

# CPU-optimized training (automatic configuration)
nokta-train --data-cache data/combined_cache.pkl \
            --output models/cpu_model.pth \
            --cpu-config \
            --epochs 30

Benchmarks

NVIDIA CUDA GPUs

High-memory GPUs (A100, V100, Tesla): ~5,000-8,000 samples/second training, ~50,000+ characters/second inference
Consumer GPUs (RTX/GTX): ~2,000-4,000 samples/second training, ~20,000-30,000 characters/second inference
Recommended: Use --cuda-config for optimal settings
Can handle large models with sufficient GPU memory

Apple M1/M2/M3 with MPS

Training speed: ~1,000-2,000 samples/second
Inference speed: ~10,000-20,000 characters/second
Memory efficient: Good for models up to 512 hidden size
Energy efficient: Lower power consumption
Recommended: Use --mps-config for optimal settings

CPU Fallback

Training speed: ~50-100 samples/second ⚠️
Inference speed: ~1,000 characters/second ⚠️
Use only for small models and testing

Expected Accuracy

With proper training (50+ epochs on quality Turkish data):

Character-level accuracy: 95%+
Word-level accuracy: 90%+
Diacritic-specific accuracy: 85%+

Training Data Requirements

Data Sources

The system can be trained on any Turkish text corpus. Best results are achieved with:

Wikipedia dumps: Comprehensive vocabulary and proper spelling
News articles: Contemporary language usage
Books and literature: Diverse writing styles
Web crawls: Informal language patterns

Included Training Data

data/aysnrgenc_turkishdeasciifier_train.txt: High-quality Turkish text with proper diacritics
Ready to use for training without additional data collection

Adding Your Own Data

Format: Plain text files (UTF-8) with proper Turkish diacritics
Structure: One or more paragraphs per file
Quality: Ensure text has correct spelling and diacritics
Combine: Use the data preparation script to merge multiple files:

# Combine your data with the included training data
python scripts/prepare_training_data.py \
    data/aysnrgenc_turkishdeasciifier_train.txt \
    your_data/*.txt \
    --output data/custom_cache.pkl

Data Requirements

Minimum: 1MB of Turkish text (included data meets this)
Recommended: 10MB+ for good coverage
Optimal: 100MB+ for production use

Troubleshooting

Common Issues

Low accuracy after training
- Solution: Train for more epochs (50+)
- Use more training data
- Enable self-attention (--use-attention)
- Increase model capacity (--hidden-size 256)
MPS not detected on Mac
- Ensure PyTorch 2.0+ is installed
- Check MPS availability: python -c "import torch; print(torch.backends.mps.is_available())"
Out of memory errors
- Reduce --batch-size
- Reduce --context-size
- Disable attention (--no-attention)
- Use smaller --hidden-size
Model predictions seem random
- Check if you're using the correct --context-size during evaluation
- Ensure model was trained for enough epochs
- Verify training data quality

Implementation Details

Character Mapping

The system handles these Turkish-specific transformations:

ç ↔ c, Ç ↔ C
ğ ↔ g, Ğ ↔ G
ı ↔ i, İ ↔ I
ö ↔ o, Ö ↔ O
ş ↔ s, Ş ↔ S
ü ↔ u, Ü ↔ U

Multi-Head Self-Attention

When enabled, the attention mechanism helps the model:

Focus on relevant context for ambiguous cases
Learn long-range dependencies
Handle Turkish vowel harmony rules
Improve accuracy on compound words

Balanced Sampling

The --balanced-sampling option ensures equal representation:

Groups training samples by diacritic character type
Balances representation across all 6 Turkish character pairs
Prevents model bias toward common characters like 'i/ı'
Use --samples-per-char to control target samples per character

Package Structure

nokta-ai/
├── nokta_ai/
│   ├── __init__.py           # Package initialization
│   ├── core.py               # Main restoration classes
│   ├── models/               # Neural network architectures
│   │   └── constrained.py    # Constrained diacritics model
│   └── cli/                  # Command line interfaces
│       ├── main.py           # Main CLI (nokta command)
│       ├── train_constrained.py  # Training script
│       └── evaluate_constrained.py  # Evaluation script
├── data/
│   ├── aysnrgenc_turkishdeasciifier_train.txt  # High-quality Turkish training data  # Additional training data
│   ├── combined_cache.pkl    # Preprocessed training data (generated)
│   └── test_datasets/        # Test datasets
│       └── vikipedi_test.txt # Wikipedia test data
├── models/                   # Trained model weights (generated)
├── scripts/
│   └── prepare_training_data.py  # Data preparation script
├── ARCHITECTURE.md           # Detailed neural network documentation
└── README.md                # This file

CLI Command Reference

Training: `nokta-train`

nokta-train --data-cache DATA --output MODEL [options]

Evaluation: `nokta evaluate` / `nokta-evaluate`

nokta evaluate --model MODEL --test-file FILE [options]

Restoration: `nokta restore`

nokta restore --model MODEL [--text TEXT | --input FILE] [options]

Benchmarking: `nokta benchmark`

nokta benchmark --model MODEL [options]

License

MIT License

Citation

If you use this work in your research, please cite:

@software{turkish_diacritics_nn,
  title = {Turkish Diacritics Restoration with Neural Networks},
  year = {2024},
  url = {https://github.com/armish/nokta-ai}
}

Contact

For questions and contributions, please open an issue on GitHub.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Sep 17, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nokta_ai-0.1.0.tar.gz (31.4 kB view details)

Uploaded Sep 17, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nokta_ai-0.1.0-py3-none-any.whl (27.9 kB view details)

Uploaded Sep 17, 2025 Python 3

File details

Details for the file nokta_ai-0.1.0.tar.gz.

File metadata

Download URL: nokta_ai-0.1.0.tar.gz
Upload date: Sep 17, 2025
Size: 31.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.9

File hashes

Hashes for nokta_ai-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`585eb8645933588f6480ba106589610d8d2fd141c012e59cd77680bd2029d916`
MD5	`4f10edeec56cc31b82419ad807d183b3`
BLAKE2b-256	`1d958679189ac0a68c6acab1ba25e97447a1354c2b0d7f27641486810b83c374`

See more details on using hashes here.

File details

Details for the file nokta_ai-0.1.0-py3-none-any.whl.

File metadata

Download URL: nokta_ai-0.1.0-py3-none-any.whl
Upload date: Sep 17, 2025
Size: 27.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.9

File hashes

Hashes for nokta_ai-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`07573724130a5d0e1978bab2b97739159b085bc457b3ba829237cfa522923dd2`
MD5	`f2f4e471285748a3caaf03bffe42ca57`
BLAKE2b-256	`b391f4fb09fffafb8e733dccfb8f277e74ab7648b5c8c6c70bbb40da610ddd32`

See more details on using hashes here.

nokta-ai 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

nokta-ai

Overview

Recent Improvements

v2.0 - Major Architecture Enhancements

Architecture

Model Components

Technical Specifications

Installation

From PyPI (Recommended)

From Source

Prerequisites

Quick Start

1. Prepare Training Data

2. Train a Model

Expert-Recommended Configuration (With Self-Attention)

Lightweight Configuration (Without Self-Attention)

3. Evaluate Your Model

4. Use Your Model

Training Parameters

Core Architecture Options

Training Hyperparameters

Data Sampling Options

Platform-Specific Configurations

Example Training Commands

Evaluation Parameters

Evaluation Options

Example Evaluation Commands

Model Architecture Comparison

With Self-Attention (Recommended)

Without Self-Attention (Lightweight)

Performance Comparison

Python API

Test Datasets

Included Test Files

Adding Custom Test Files

Advanced Usage

Context Size Experiments

Small Context (Fast Training)

Medium Context (Balanced)

Large Context (Expert Recommended)

Multi-Pass Restoration

Performance

Hardware Acceleration

1. NVIDIA CUDA (Recommended for Large-Scale Training)

2. Apple Silicon MPS (M1/M2/M3)

3. CPU Fallback (With Warnings)

Benchmarks

NVIDIA CUDA GPUs

Apple M1/M2/M3 with MPS

CPU Fallback

Expected Accuracy

Training Data Requirements

Data Sources

Included Training Data

Adding Your Own Data

Data Requirements

Troubleshooting

Common Issues

Implementation Details

Character Mapping

Multi-Head Self-Attention

Balanced Sampling

Package Structure

CLI Command Reference

Training: nokta-train

Evaluation: nokta evaluate / nokta-evaluate

Restoration: nokta restore

Benchmarking: nokta benchmark

License

Citation

Contact

Training: `nokta-train`

Evaluation: `nokta evaluate` / `nokta-evaluate`

Restoration: `nokta restore`

Benchmarking: `nokta benchmark`