Skip to main content

Neural Turkish Morphological Atomizer

Project description

Aksu

Neural Turkish morphological atomizer — root + ordered tags, no GPU required.

Python License: MIT CI

Turkish is agglutinative: a single word can carry the meaning of a full English phrase. Aksu decomposes it into root + morphological atoms — the building block every downstream NLP task needs.

from aksu import Atomizer

atomizer = Atomizer(backend="zeyrek")
atomizer.to_canonical("evlerinden")
# → ev +Noun +POSS.3PL +ABL
🎯 98.3% Exact Match SOTA-competitive disambiguation (em_argmax, 5-seed ensemble)
16.71 min CPU training Frozen BERTurk encoder + 1M-param reranker — no GPU needed
📚 80,537 annotated entries TR-Gold-Morph v1 — largest public Turkish morphological corpus

Why Aksu?

Turkish is one of the world's most morphologically productive languages. A single root generates thousands of legal surface forms through agglutination — the verb gitmek (to go) alone yields gidiyordum, gidemeyebilirdiniz, gidildiğinde, and thousands more. Standard NLP pipelines treat each surface form as an unrelated token, erasing the shared root and the grammatical information the suffixes encode.

Subword tokenizers (BPE, WordPiece) split Turkish words into character fragments that happen to repeat in the training corpus. The fragments are linguistically arbitrary and over-split rare forms that a morphological analyzer handles correctly:

Input BPE (BERTurk) Aksu
evlerinden ev ##ler ##inden ev +Noun +POSS.3PL +ABL
gidiyordum gidi ##yor ##dum gitmek +Verb +PROG +PAST
kitapçılardan kitap ##çı ##lar ##dan kitap +Noun +AGT +Noun +PLU +ABL

Aksu replaces the BPE step with a neural-symbolic pipeline: Zeyrek generates morphologically legal candidates; a frozen BERTurk encoder scores them in context; a 1M-parameter reranker selects the best parse. Out-of-vocabulary words fall back to a Dual-Head sequence decoder. The result is a linguistically transparent representation every downstream task can exploit.

What's New in v1.1

GPU-accelerated NeuralBackend

The NeuralBackend now auto-detects CUDA/MPS/XPU and applies bf16 mixed precision + torch.compile:

>>> from aksu.kokturk.core.analyzer import NeuralBackend
>>> # Auto-detects GPU; bf16 on CUDA, fp32 on CPU
>>> backend = NeuralBackend("models/atomizer_v2/best_model.pt", "models/vocabs")
>>> result = backend.analyze("evlerinden")
>>> results = backend.predict_batch(["evlerinden", "kitaplarımdan"], batch_size=32)

Override device or precision explicitly:

>>> backend = NeuralBackend(
...     "models/atomizer_v2/best_model.pt",
...     "models/vocabs",
...     device="cuda",           # or "cpu", "mps", "xpu"
...     precision="bf16",        # or "fp32", "auto"
...     compile_mode="reduce-overhead",  # or None to disable
...     batch_size=32,
... )

PDF Text Cleaning

from aksu.ariturk import reconstruct_line_breaks, fix_pdf_artifacts

# Re-join PDF line-break hyphens using Zemberek lexicon + vowel harmony
text = reconstruct_line_breaks("kitap-\nlar ve Türk-\nçe")

# Clean mojibake, zero-width chars, ligatures, repeated chars
clean = fix_pdf_artifacts("Türk​çe firma \xe7oooook")

New TextCleaner Methods

from aksu.ariturk import TextCleaner

cleaner = TextCleaner()
text1 = cleaner.fix_line_breaks("kitap-\nlar")
text2 = cleaner.fix_artifacts("Türk​\xe7e")

Optional LM-scored hyphenation

pip install "aksu[full]"

Features

  • State-of-the-art disambiguation: 98.3% Exact Match on the Aksu held-out test set (5-seed ensemble, em_string). Cross-system comparable.
  • GPU-accelerated NeuralBackend (v1.1): CUDA/MPS/XPU auto-detection, bf16 mixed precision, torch.compile, batched inference. CPU remains fully supported.
  • PDF text cleaning (v1.1): reconstruct_line_breaks + fix_pdf_artifacts — ftfy mojibake repair, Zemberek lexicon hyphenation decoder, vowel harmony validation.
  • CPU-only training: 16.71 minutes on TRUBA Orfoz (Intel Xeon Platinum 8362). No GPU required for training or inference.
  • Hybrid neural-symbolic: Zeyrek symbolic candidates → frozen BERTurk 768-dim encoder → 1M-parameter reranker. Best-parse selection without fine-tuning the language model.
  • OOV fallback: Dual-Head Decoder generates tag sequences character-by-character for words Zeyrek cannot parse (~4% of web-crawled Turkish).
  • sklearn-compatible: Drop-in MorphoTransformer for use in sklearn.pipeline.Pipeline.
  • TR-Gold-Morph corpus: 80,537 manually validated annotations across gold and silver tiers — the largest public Turkish morphological resource.
  • Honest benchmarks: Every number traces to a runnable script and a JSON artifact in audit/benchmark_results/.

Installation

From source (recommended until PyPI publish — see Roadmap):

git clone https://github.com/melikkul/Aksu.git
cd Aksu
pip install -e ".[dev,benchmark,train,data]"

Once published on PyPI:

pip install aksu                    # core only
pip install "aksu[train]"           # + MLflow, Optuna, Transformers
pip install "aksu[benchmark]"       # + SciPy (significance tests)
pip install "aksu[data]"            # + HuggingFace Datasets, diskcache

Quick Start

1. Single word:

from aksu import Atomizer

atomizer = Atomizer(backend="zeyrek")
atomizer.to_canonical("evlerinden")
# → ev +Noun +POSS.3PL +ABL

2. Sentence disambiguation:

from aksu import MorphoAnalyzer

analyzer = MorphoAnalyzer(backends=["disambiguator"])
results = analyzer.analyze_sentence("Çocuklar evlerinden çıktı")
# → list of TokenAnalysis objects, one per word

3. sklearn pipeline (compat import; see Migration):

from aksu.kokturk.sklearn_ext import MorphoTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ("morph", MorphoTransformer(output="atomized")),
    ("tfidf", TfidfVectorizer()),
    ("clf",   LogisticRegression()),
])

4. CLI:

aksu analyze "evlerinden"
# → evlerinden           → ev +Noun +POSS.3PL +ABL

Text cleaning (arı-türk):

from aksu import TextCleaner

TextCleaner().clean("  TÜRKÇE   metİn  ")
# → türkçe metin
from aksu import turkish_lower

turkish_lower("I")
# → ı

How It Works

Aksu operates in two modes depending on whether a word is in Zeyrek's lexicon.

Architecture

flowchart LR
    S["Çocuklar evlerinden çıktı"] --> T[Tokenize]
    T --> Z["Zeyrek<br/>candidate gen"]
    Z --> C{Candidates?}
    C -->|0 candidates| D["Dual-Head Decoder<br/>5.2M params"]
    C -->|1 candidate| ACC[Accept]
    C -->|2+ candidates| B["BERTurk encoder<br/>768-dim · frozen"]
    B --> R["Reranker<br/>1M params · scalar score"]
    R --> ACC
    D --> ACC
    ACC --> O["çocuk +Noun +PLU<br/>ev +Noun +POSS.3PL +ABL<br/>çıkmak +Verb +PAST"]

Disambiguation (primary path, ~96% of tokens): Zeyrek generates morphologically legal candidates for each token. BERTurk encodes the full sentence in context; a lightweight reranker scores each candidate and selects the highest-scoring parse. BERTurk is used frozen — no fine-tuning, no GPU, just a 768-dimensional sentence representation fed to a 1M-parameter scoring head.

Generation (OOV fallback, ~4% of tokens): Words Zeyrek cannot parse go to the Dual-Head Decoder: a character-level encoder reads the input, a root classifier predicts the lemma, and a conditional tag decoder emits the suffix sequence one tag at a time.

Performance

Measured in This Repository

Every row is generated by a committed script. Run commands in Reproducing the Results.

System em_string [^string-note] em_argmax [^argmax-note] Approach Encoder Params Training
Aksu disambiguator (5-seed ensemble) 98.3% 98.3% [^ensemble] Candidate selection BERTurk (frozen) 1M 16.71 min CPU
Aksu disambiguator (single-seed range) 97.98–98.28% std 0.11pp [^ensemble] Candidate selection BERTurk (frozen) 1M CPU
Aksu generation (DualHead) N/A [^dualhead-cpu] Dual-Head Decoder 3M CPU (v1.1)

[^string-note]: em_string = canonical-string equality. Cross-system comparable. In this repository's eval, no two candidates produce the same canonical string, so em_string ≡ em_argmax. On other test sets with candidate-string collisions, em_string would be lower. [^argmax-note]: em_argmax = candidate-index accuracy. Within-system metric; not directly comparable across systems. [^ensemble]: 5-seed ensemble; em_argmax_std=0.11pp; see models/v6/ensemble_results.json. [^dualhead-cpu]: The v1 CPU baseline checkpoint (models/dualhead_v1_cpu/best_model.pt, 3M params, 28/50 epochs) is undertrained: the tag decoder generates repetitive sequences and fails to predict EOS reliably in free-running mode (exposure bias). CPU training reached epoch 28 before the wall-clock budget was exhausted. GPU retraining (akya-cuda, ~3 h) is planned for v1.1. Throughput of the v1 CPU model is 23.2 tok/s (measured; see audit/benchmark_results/inference_throughput.json).

Reported in Prior Work

Numbers below are taken verbatim from their publications. Not reproduced in this repository. The column "EM (reported)" is not directly comparable to our em_string unless the source specifies the same metric.

System EM (reported) Source
MorseDisamb 98.59% Şeker & Eryiğit (2017) ACL 2017.semeval-1.28 [^morsedisambcite]
Sak et al. 97.81% Sak et al. (2009) NAACL 2009.naacl-main.19 [^sakcite]
Morse (generation) 97.67% Şeker & Eryiğit (2017) [^morsedisambcite]
TransMorph 96.25% Akyürek et al. (2022) ACL 2022.sigmorphon-1.13 [^transmorphcite]
SIGMORPHON 2019 baseline 92.27% McCarthy et al. (2019) 2019.sigmorphon-1 [^sigmorphoncite]
Yıldız et al. 84.12% Yıldız et al. (2016) SIU [^yildiz2016cite]

[^morsedisambcite]: Şeker, G. & Eryiğit, G. (2017). SemEval / ACL Anthology 2017.semeval-1.28. [^sakcite]: Sak, H., Güngör, T. & Saraçlar, M. (2009). NAACL 2009.naacl-main.19. [^transmorphcite]: Akyürek, A.F., Akyürek, E. & Goldwater, S. (2022). ACL 2022.sigmorphon-1.13. [^sigmorphoncite]: McCarthy, A.D. et al. (2019). SIGMORPHON 2019.sigmorphon-1. [^yildiz2016cite]: Yıldız, E. et al. (2016). SIU.

Text classification deferred to v1.1 — TTC-3600 corpus requires email-request acquisition. Pipeline code ready at src/aksu/benchmark/run_all_benchmarks.py.

Inference Throughput

CPU throughput on shared SLURM nodes (Orfoz) varies ±30–50% by co-scheduling. Treat as order-of-magnitude guidance; see audit/halt_reports/2026-05-16-berturk-measurement-drift.md.

Component Speed Peak RSS Hardware
Zeyrek candidate generation 1537.3 tok/s ~267 MB Intel(R) Xeon(R) Platinum 8480+ (Orfoz)
BERTurk embedding 67.9 sent/s [^cpu-variability] ~1.5 GB Orfoz CPU
Reranker scoring 474.4 tok/s ~10 MB Orfoz CPU
Dual-Head generation (v1 CPU) 23.2 tok/s [^dualhead-cpu] ~492 MB Orfoz CPU

[^cpu-variability]: CPU throughput varies ±30–50% on shared Orfoz nodes; treat as order-of-magnitude guidance only.

Reproducing the Results

All numbers are stored in audit/benchmark_results/metrics.json and generated by committed scripts.

Disambiguation EM (98.3% em_string ensemble):

sbatch scripts/truba/submit_v6_eval_aksu.sh    # akya-cuda GPU, ~2 h
python scripts/ingest_metrics.py
# → models/v6/eval_results.json, models/v6/ensemble_results.json

Training wall-clock (16.71 min):

sbatch scripts/truba/submit_train_disambiguator.sh    # Orfoz CPU, ~17 min
# → models/v6_retimed/training_log.json

Zeyrek throughput (1537.3 tok/s):

sbatch scripts/truba/submit_zeyrek_benchmark.sh    # Orfoz compute node
# → audit/benchmark_results/zeyrek_throughput.json

Corpus entry count (80,537 entries):

python scripts/data/validate_dataset.py
# → data/gold/tr_gold_morph_v1_stats.json

Hardware used:

Task Hardware Approx. time
Disambiguation training (1 seed) Orfoz CPU (Intel Xeon Platinum 8362) 16.71 min
Disambiguation eval (5-seed ensemble) akya-cuda GPU ~2 h
DualHead training akya-cuda GPU ~3 h
Zeyrek candidate generation Any CPU laptop 1537.3 tok/s

Dataset — TR-Gold-Morph

Version Entries Tiers Status License
TR-Gold-Morph v1 80,537 gold 2,496 / silver 78,041 ✅ Released CC BY 4.0
TR-Gold-Morph v2 ~2.5M target (pipeline ready, harvest pending v1.1) gold / silver / bronze 🚧 v1.1 roadmap CC BY 4.0 + per-shard CC BY-SA

Comparison with other public resources:

Resource Entries Annotation License
TR-Gold-Morph v1 80,537 Auto + manual CC BY 4.0
UniMorph Turkish 275,460 Rule-generated CC BY-SA
BOUN Treebank ~121,000 Manual CC BY-SA 4.0
IMST Treebank ~56,000 Manual CC BY-NC-SA 3.0

Provenance: TR-Gold-Morph v1 derives from BOUN Treebank and Zeyrek candidate sets. Gold tier: 2,496 linguist-verified entries. Silver tier: 78,041 ensemble-confident entries.

TR-Gold-Morph v2 (2.5M target): autolabel pipeline is ready in this repository. Sources: OSCAR-tr (CC0/CC-BY-4.0), mC4-tr (ODC-BY), Wikipedia-tr (CC-BY-SA-3.0), BOUN Treebank (CC-BY-SA-4.0). IMST-UD used only for internal evaluation (NC clause). Source manifest: data/external/manifest.json.

HuggingFace: melikkul/tr-gold-morph (v2 upload pending). License: CC BY 4.0 (main corpus). BOUN/Wikipedia-derived shards carry CC BY-SA — see LICENSE-DATA.

Limitations and Known Gaps

  • em_argmax vs em_string: In this repository's eval, em_string ≡ em_argmax (no two candidates produce the same canonical string). On other test sets with candidate-string collisions, em_string would be lower — the metrics are not interchangeable in general.
  • Gold annotation size: Gold tier is 2,496 entries (single annotator). Cohen's κ requires a 200-entry double-annotated overlap — planned for v1.1.
  • OOV handling: Zeyrek fails on ~4% of web-crawled Turkish (neologisms, foreign proper nouns). The DualHead is designed as a fallback for these cases; the v1 CPU model is undertrained (see [^dualhead-cpu]) and OOV-handling accuracy will be reported in v1.1 after GPU retraining.
  • Text classification deferred: TTC-3600 benchmarks move to v1.1 — dataset requires email-request acquisition (Akın & Akın, 2007). Pipeline code is ready.
  • Throughput variability: CPU throughput on shared SLURM nodes (Orfoz) varies ±30–50%. All inference figures are order-of-magnitude guidance; see audit/halt_reports/2026-05-16-berturk-measurement-drift.md.
  • BOUN ShareAlike propagation: Model weights trained on BOUN-derived data should be distributed under CC-BY-SA-4.0. Shards are tracked in data/external/manifest.json.

Roadmap

v1.1 (planned):

  • TR-Gold-Morph v2 (2.5M auto-labeled entries) — autolabel pipeline already in repository
  • TTC-3600 text classification benchmark — pending dataset acquisition
  • DualHead GPU retraining — v1 CPU baseline (28/50 epochs) is undertrained; GPU retraining (akya-cuda, ~3 h) will produce a valid EM figure; see audit/halt_reports/2026-05-16-dualhead-em.md
  • Cohen's κ on 200-entry double-annotated gold subset

v2.0:

  • Remove deprecated top-level kokturk / ariturk shim packages
  • PyPI stable release of aksu (pending Trusted Publishing pre-flight)

Project Structure

src/aksu/
├── kokturk/          # Core morphological atomizer
│   ├── core/         # MorphoAnalyzer, datatypes, cache, phonology
│   ├── models/       # Disambiguator, Dual-Head Decoder, context encoders
│   ├── sklearn_ext/  # sklearn integration
│   └── cli/          # Command-line interface
├── ariturk/          # Turkish text cleaning & normalization
├── train/            # Training scripts, curriculum, losses
├── benchmark/        # Evaluation suite
├── resource/         # TR-Gold-Morph pipeline
└── classify/         # TTC-3600 text classification experiments (deferred to v1.1)

Migration from kokturk

Migrating from the legacy kokturk / ariturk packages? The shim packages re-export under aksu.kokturk and aksu.ariturk with DeprecationWarning — your existing imports keep working through all v1.x releases. See docs/MIGRATION.md for the full symbol rename table and the v2.0 sunset schedule.

Citation

@thesis{kul2026aksu,
  title={Aksu: Neural Morphological Atomization for Turkish: Gold Standard Corpus Construction and Hybrid Text Classification},
  author={Kul, Melik},
  year={2026},
  school={Ostim Technical University},
}

License

Code: MIT. Dataset (TR-Gold-Morph): CC BY 4.0. BOUN/Wikipedia-derived shards carry CC BY-SA 4.0 — see LICENSE-DATA.

Contributing

Contributions welcome. See CONTRIBUTING.md for the dev setup and PR workflow. The README is rendered from docs/README.md.j2 — edit the template, not README.md (a CI gate in tests/test_readme_render.py enforces this).

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aksu-1.1.0a0.tar.gz (604.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aksu-1.1.0a0-py3-none-any.whl (657.5 kB view details)

Uploaded Python 3

File details

Details for the file aksu-1.1.0a0.tar.gz.

File metadata

  • Download URL: aksu-1.1.0a0.tar.gz
  • Upload date:
  • Size: 604.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for aksu-1.1.0a0.tar.gz
Algorithm Hash digest
SHA256 1fb2ec20f89b4d6d0b8c4e9d783bb06d19e70fbeadf065b04c763bf14935543a
MD5 a50f7da9a50fd7f981d7c717dc6a052f
BLAKE2b-256 d5d5e9c05ecaec1871cfb60a87f58b7d3aa834626d61d4c6f02ddc80ba83ed1a

See more details on using hashes here.

File details

Details for the file aksu-1.1.0a0-py3-none-any.whl.

File metadata

  • Download URL: aksu-1.1.0a0-py3-none-any.whl
  • Upload date:
  • Size: 657.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for aksu-1.1.0a0-py3-none-any.whl
Algorithm Hash digest
SHA256 e47b486a66d0aef3c933ea164642db517733b0597301176b5711bd8d45109a16
MD5 8674164ad7ff47ea2121d1264b0fee8a
BLAKE2b-256 55f4c395b0dccf7ef294e292b41eb8a1767ec2a1e27d058b2de188e1bd5f873d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page