Neural Turkish Morphological Atomizer

These details have not been verified by PyPI

Project links

Project description

Aksu

Neural Turkish morphological atomizer — root + ordered tags, no GPU required.

Turkish is agglutinative: a single word can carry the meaning of a full English phrase. Aksu decomposes it into root + morphological atoms — the building block every downstream NLP task needs.

from aksu import Atomizer

atomizer = Atomizer(backend="zeyrek")
atomizer.to_canonical("evlerinden")
# → ev +Noun +POSS.3PL +ABL


🎯 98.3% Exact Match	SOTA-competitive disambiguation (em_argmax, 5-seed ensemble)
⚡ 16.71 min CPU training	Frozen BERTurk encoder + 1M-param reranker — no GPU needed
📚 80,537 annotated entries	TR-Gold-Morph v1 — largest public Turkish morphological corpus

Why Aksu?

Turkish is one of the world's most morphologically productive languages. A single root generates thousands of legal surface forms through agglutination — the verb gitmek (to go) alone yields gidiyordum, gidemeyebilirdiniz, gidildiğinde, and thousands more. Standard NLP pipelines treat each surface form as an unrelated token, erasing the shared root and the grammatical information the suffixes encode.

Subword tokenizers (BPE, WordPiece) split Turkish words into character fragments that happen to repeat in the training corpus. The fragments are linguistically arbitrary and over-split rare forms that a morphological analyzer handles correctly:

Input	BPE (BERTurk)	Aksu
evlerinden	ev ##ler ##inden	ev +Noun +POSS.3PL +ABL
gidiyordum	gidi ##yor ##dum	gitmek +Verb +PROG +PAST
kitapçılardan	kitap ##çı ##lar ##dan	kitap +Noun +AGT +Noun +PLU +ABL

Aksu replaces the BPE step with a neural-symbolic pipeline: Zeyrek generates morphologically legal candidates; a frozen BERTurk encoder scores them in context; a 1M-parameter reranker selects the best parse. Out-of-vocabulary words fall back to a Dual-Head sequence decoder. The result is a linguistically transparent representation every downstream task can exploit.

What's New in v1.1

GPU-accelerated NeuralBackend

The NeuralBackend now auto-detects CUDA/MPS/XPU and applies bf16 mixed precision + torch.compile:

>>> from aksu.kokturk.core.analyzer import NeuralBackend
>>> # Auto-detects GPU; bf16 on CUDA, fp32 on CPU
>>> backend = NeuralBackend("models/atomizer_v2/best_model.pt", "models/vocabs")
>>> result = backend.analyze("evlerinden")
>>> results = backend.predict_batch(["evlerinden", "kitaplarımdan"], batch_size=32)

Override device or precision explicitly:

>>> backend = NeuralBackend(
...     "models/atomizer_v2/best_model.pt",
...     "models/vocabs",
...     device="cuda",           # or "cpu", "mps", "xpu"
...     precision="bf16",        # or "fp32", "auto"
...     compile_mode="reduce-overhead",  # or None to disable
...     batch_size=32,
... )

PDF Text Cleaning

from aksu.ariturk import reconstruct_line_breaks, fix_pdf_artifacts

# Re-join PDF line-break hyphens using Zemberek lexicon + vowel harmony
text = reconstruct_line_breaks("kitap-\nlar ve Türk-\nçe")

# Clean mojibake, zero-width chars, ligatures, repeated chars
clean = fix_pdf_artifacts("Türkçe ﬁrma \xe7oooook")

New TextCleaner Methods

from aksu.ariturk import TextCleaner

cleaner = TextCleaner()
text1 = cleaner.fix_line_breaks("kitap-\nlar")
text2 = cleaner.fix_artifacts("Türk\xe7e")

Optional LM-scored hyphenation

pip install "aksu[full]"

Features

State-of-the-art disambiguation: 98.3% Exact Match on the Aksu held-out test set (5-seed ensemble, em_string). Cross-system comparable.
GPU-accelerated NeuralBackend (v1.1): CUDA/MPS/XPU auto-detection, bf16 mixed precision, torch.compile, batched inference. CPU remains fully supported.
PDF text cleaning (v1.1): reconstruct_line_breaks + fix_pdf_artifacts — ftfy mojibake repair, Zemberek lexicon hyphenation decoder, vowel harmony validation.
CPU-only training: 16.71 minutes on TRUBA Orfoz (Intel Xeon Platinum 8362). No GPU required for training or inference.
Hybrid neural-symbolic: Zeyrek symbolic candidates → frozen BERTurk 768-dim encoder → 1M-parameter reranker. Best-parse selection without fine-tuning the language model.
OOV fallback: Dual-Head Decoder generates tag sequences character-by-character for words Zeyrek cannot parse (~4% of web-crawled Turkish).
sklearn-compatible: Drop-in MorphoTransformer for use in sklearn.pipeline.Pipeline.
TR-Gold-Morph corpus: 80,537 manually validated annotations across gold and silver tiers — the largest public Turkish morphological resource.
Honest benchmarks: Every number traces to a runnable script and a JSON artifact in audit/benchmark_results/.

Installation

From source (recommended until PyPI publish — see Roadmap):

git clone https://github.com/melikkul/Aksu.git
cd Aksu
pip install -e ".[dev,benchmark,train,data]"

Once published on PyPI:

pip install aksu                    # core only
pip install "aksu[train]"           # + MLflow, Optuna, Transformers
pip install "aksu[benchmark]"       # + SciPy (significance tests)
pip install "aksu[data]"            # + HuggingFace Datasets, diskcache

Quick Start

1. Single word:

from aksu import Atomizer

atomizer = Atomizer(backend="zeyrek")
atomizer.to_canonical("evlerinden")
# → ev +Noun +POSS.3PL +ABL

2. Sentence disambiguation:

from aksu import MorphoAnalyzer

analyzer = MorphoAnalyzer(backends=["disambiguator"])
results = analyzer.analyze_sentence("Çocuklar evlerinden çıktı")
# → list of TokenAnalysis objects, one per word

3. sklearn pipeline (compat import; see Migration):

from aksu.kokturk.sklearn_ext import MorphoTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ("morph", MorphoTransformer(output="atomized")),
    ("tfidf", TfidfVectorizer()),
    ("clf",   LogisticRegression()),
])

4. CLI:

aksu analyze "evlerinden"
# → evlerinden           → ev +Noun +POSS.3PL +ABL

Text cleaning (arı-türk):

from aksu import TextCleaner

TextCleaner().clean("  TÜRKÇE   metİn  ")
# → türkçe metin

from aksu import turkish_lower

turkish_lower("I")
# → ı

How It Works

Aksu operates in two modes depending on whether a word is in Zeyrek's lexicon.

Architecture

flowchart LR
    S["Çocuklar evlerinden çıktı"] --> T[Tokenize]
    T --> Z["Zeyrek<br/>candidate gen"]
    Z --> C{Candidates?}
    C -->|0 candidates| D["Dual-Head Decoder<br/>5.2M params"]
    C -->|1 candidate| ACC[Accept]
    C -->|2+ candidates| B["BERTurk encoder<br/>768-dim · frozen"]
    B --> R["Reranker<br/>1M params · scalar score"]
    R --> ACC
    D --> ACC
    ACC --> O["çocuk +Noun +PLU<br/>ev +Noun +POSS.3PL +ABL<br/>çıkmak +Verb +PAST"]

Disambiguation (primary path, ~96% of tokens): Zeyrek generates morphologically legal candidates for each token. BERTurk encodes the full sentence in context; a lightweight reranker scores each candidate and selects the highest-scoring parse. BERTurk is used frozen — no fine-tuning, no GPU, just a 768-dimensional sentence representation fed to a 1M-parameter scoring head.

Generation (OOV fallback, ~4% of tokens): Words Zeyrek cannot parse go to the Dual-Head Decoder: a character-level encoder reads the input, a root classifier predicts the lemma, and a conditional tag decoder emits the suffix sequence one tag at a time.

Performance

Measured in This Repository

Every row is generated by a committed script. Run commands in Reproducing the Results.

System	em_string ¹	em_argmax ²	Approach	Encoder	Params	Training
Aksu disambiguator (5-seed ensemble)	98.3%	98.3% ³	Candidate selection	BERTurk (frozen)	1M	16.71 min CPU
Aksu disambiguator (single-seed range)	97.98–98.28%	std 0.11pp ³	Candidate selection	BERTurk (frozen)	1M	CPU
Aksu generation (DualHead)	N/A ⁴	—	Dual-Head Decoder	—	3M	CPU (v1.1)

Reported in Prior Work

Numbers below are taken verbatim from their publications. Not reproduced in this repository. The column "EM (reported)" is not directly comparable to our em_string unless the source specifies the same metric.

System	EM (reported)	Source
MorseDisamb	98.59%	Şeker & Eryiğit (2017) ACL `2017.semeval-1.28` ⁵
Sak et al.	97.81%	Sak et al. (2009) NAACL `2009.naacl-main.19` ⁶
Morse (generation)	97.67%	Şeker & Eryiğit (2017) ⁵
TransMorph	96.25%	Akyürek et al. (2022) ACL `2022.sigmorphon-1.13` ⁷
SIGMORPHON 2019 baseline	92.27%	McCarthy et al. (2019) `2019.sigmorphon-1` ⁸
Yıldız et al.	84.12%	Yıldız et al. (2016) SIU ⁹

Text classification deferred to v1.1 — TTC-3600 corpus requires email-request acquisition. Pipeline code ready at src/aksu/benchmark/run_all_benchmarks.py.

Inference Throughput

CPU throughput on shared SLURM nodes (Orfoz) varies ±30–50% by co-scheduling. Treat as order-of-magnitude guidance; see audit/halt_reports/2026-05-16-berturk-measurement-drift.md.

Component	Speed	Peak RSS	Hardware
Zeyrek candidate generation	1537.3 tok/s	~267 MB	Intel(R) Xeon(R) Platinum 8480+ (Orfoz)
BERTurk embedding	67.9 sent/s ¹⁰	~1.5 GB	Orfoz CPU
Reranker scoring	474.4 tok/s	~10 MB	Orfoz CPU
Dual-Head generation (v1 CPU)	23.2 tok/s ⁴	~492 MB	Orfoz CPU

Reproducing the Results

All numbers are stored in audit/benchmark_results/metrics.json and generated by committed scripts.

Disambiguation EM (98.3% em_string ensemble):

sbatch scripts/truba/submit_v6_eval_aksu.sh    # akya-cuda GPU, ~2 h
python scripts/ingest_metrics.py
# → models/v6/eval_results.json, models/v6/ensemble_results.json

Training wall-clock (16.71 min):

sbatch scripts/truba/submit_train_disambiguator.sh    # Orfoz CPU, ~17 min
# → models/v6_retimed/training_log.json

Zeyrek throughput (1537.3 tok/s):

sbatch scripts/truba/submit_zeyrek_benchmark.sh    # Orfoz compute node
# → audit/benchmark_results/zeyrek_throughput.json

Corpus entry count (80,537 entries):

python scripts/data/validate_dataset.py
# → data/gold/tr_gold_morph_v1_stats.json

Hardware used:

Task	Hardware	Approx. time
Disambiguation training (1 seed)	Orfoz CPU (Intel Xeon Platinum 8362)	16.71 min
Disambiguation eval (5-seed ensemble)	akya-cuda GPU	~2 h
DualHead training	akya-cuda GPU	~3 h
Zeyrek candidate generation	Any CPU laptop	1537.3 tok/s

Dataset — TR-Gold-Morph

Version	Entries	Tiers	Status	License
TR-Gold-Morph v1	80,537	gold 2,496 / silver 78,041	✅ Released	CC BY 4.0
TR-Gold-Morph v2	~2.5M target (pipeline ready, harvest pending v1.1)	gold / silver / bronze	🚧 v1.1 roadmap	CC BY 4.0 + per-shard CC BY-SA

Comparison with other public resources:

Resource	Entries	Annotation	License
TR-Gold-Morph v1	80,537	Auto + manual	CC BY 4.0
UniMorph Turkish	275,460	Rule-generated	CC BY-SA
BOUN Treebank	~121,000	Manual	CC BY-SA 4.0
IMST Treebank	~56,000	Manual	CC BY-NC-SA 3.0

Provenance: TR-Gold-Morph v1 derives from BOUN Treebank and Zeyrek candidate sets. Gold tier: 2,496 linguist-verified entries. Silver tier: 78,041 ensemble-confident entries.

TR-Gold-Morph v2 (2.5M target): autolabel pipeline is ready in this repository. Sources: OSCAR-tr (CC0/CC-BY-4.0), mC4-tr (ODC-BY), Wikipedia-tr (CC-BY-SA-3.0), BOUN Treebank (CC-BY-SA-4.0). IMST-UD used only for internal evaluation (NC clause). Source manifest: data/external/manifest.json.

HuggingFace: melikkul/tr-gold-morph (v2 upload pending). License: CC BY 4.0 (main corpus). BOUN/Wikipedia-derived shards carry CC BY-SA — see LICENSE-DATA.

Limitations and Known Gaps

em_argmax vs em_string: In this repository's eval, em_string ≡ em_argmax (no two candidates produce the same canonical string). On other test sets with candidate-string collisions, em_string would be lower — the metrics are not interchangeable in general.
Gold annotation size: Gold tier is 2,496 entries (single annotator). Cohen's κ requires a 200-entry double-annotated overlap — planned for v1.1.
OOV handling: Zeyrek fails on ~4% of web-crawled Turkish (neologisms, foreign proper nouns). The DualHead is designed as a fallback for these cases; the v1 CPU model is undertrained (see [^dualhead-cpu]) and OOV-handling accuracy will be reported in v1.1 after GPU retraining.
Text classification deferred: TTC-3600 benchmarks move to v1.1 — dataset requires email-request acquisition (Akın & Akın, 2007). Pipeline code is ready.
Throughput variability: CPU throughput on shared SLURM nodes (Orfoz) varies ±30–50%. All inference figures are order-of-magnitude guidance; see audit/halt_reports/2026-05-16-berturk-measurement-drift.md.
BOUN ShareAlike propagation: Model weights trained on BOUN-derived data should be distributed under CC-BY-SA-4.0. Shards are tracked in data/external/manifest.json.

Roadmap

v1.1 (planned):

TR-Gold-Morph v2 (2.5M auto-labeled entries) — autolabel pipeline already in repository
TTC-3600 text classification benchmark — pending dataset acquisition
DualHead GPU retraining — v1 CPU baseline (28/50 epochs) is undertrained; GPU retraining (akya-cuda, ~3 h) will produce a valid EM figure; see audit/halt_reports/2026-05-16-dualhead-em.md
Cohen's κ on 200-entry double-annotated gold subset

v2.0:

Remove deprecated top-level kokturk / ariturk shim packages
PyPI stable release of aksu (pending Trusted Publishing pre-flight)

Project Structure

src/aksu/
├── kokturk/          # Core morphological atomizer
│   ├── core/         # MorphoAnalyzer, datatypes, cache, phonology
│   ├── models/       # Disambiguator, Dual-Head Decoder, context encoders
│   ├── sklearn_ext/  # sklearn integration
│   └── cli/          # Command-line interface
├── ariturk/          # Turkish text cleaning & normalization
├── train/            # Training scripts, curriculum, losses
├── benchmark/        # Evaluation suite
├── resource/         # TR-Gold-Morph pipeline
└── classify/         # TTC-3600 text classification experiments (deferred to v1.1)

Migration from `kokturk`

Migrating from the legacy kokturk / ariturk packages? The shim packages re-export under aksu.kokturk and aksu.ariturk with DeprecationWarning — your existing imports keep working through all v1.x releases. See docs/MIGRATION.md for the full symbol rename table and the v2.0 sunset schedule.

Citation

@thesis{kul2026aksu,
  title={Aksu: Neural Morphological Atomization for Turkish: Gold Standard Corpus Construction and Hybrid Text Classification},
  author={Kul, Melik},
  year={2026},
  school={Ostim Technical University},
}

License

Code: MIT. Dataset (TR-Gold-Morph): CC BY 4.0. BOUN/Wikipedia-derived shards carry CC BY-SA 4.0 — see LICENSE-DATA.

Contributing

Contributions welcome. See CONTRIBUTING.md for the dev setup and PR workflow. The README is rendered from docs/README.md.j2 — edit the template, not README.md (a CI gate in tests/test_readme_render.py enforces this).

Acknowledgments

Zeyrek — Python port of Zemberek
BOUN Treebank
UniMorph
BERTurk by Stefan Schweter

em_string = canonical-string equality. Cross-system comparable. In this repository's eval, no two candidates produce the same canonical string, so em_string ≡ em_argmax. On other test sets with candidate-string collisions, em_string would be lower. ↩
em_argmax = candidate-index accuracy. Within-system metric; not directly comparable across systems. ↩
5-seed ensemble; em_argmax_std=0.11pp; see models/v6/ensemble_results.json. ↩ ↩²
The v1 CPU baseline checkpoint (models/dualhead_v1_cpu/best_model.pt, 3M params, 28/50 epochs) is undertrained: the tag decoder generates repetitive sequences and fails to predict EOS reliably in free-running mode (exposure bias). CPU training reached epoch 28 before the wall-clock budget was exhausted. GPU retraining (akya-cuda, ~3 h) is planned for v1.1. Throughput of the v1 CPU model is 23.2 tok/s (measured; see audit/benchmark_results/inference_throughput.json). ↩ ↩²
Şeker, G. & Eryiğit, G. (2017). SemEval / ACL Anthology 2017.semeval-1.28. ↩ ↩²
Sak, H., Güngör, T. & Saraçlar, M. (2009). NAACL 2009.naacl-main.19. ↩
Akyürek, A.F., Akyürek, E. & Goldwater, S. (2022). ACL 2022.sigmorphon-1.13. ↩
McCarthy, A.D. et al. (2019). SIGMORPHON 2019.sigmorphon-1. ↩
Yıldız, E. et al. (2016). SIU. ↩
CPU throughput varies ±30–50% on shared Orfoz nodes; treat as order-of-magnitude guidance only. ↩

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.1.0a0 pre-release

May 17, 2026

1.0.0a0 pre-release

May 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aksu-1.1.0a0.tar.gz (604.6 kB view details)

Uploaded May 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

aksu-1.1.0a0-py3-none-any.whl (657.5 kB view details)

Uploaded May 17, 2026 Python 3

File details

Details for the file aksu-1.1.0a0.tar.gz.

File metadata

Download URL: aksu-1.1.0a0.tar.gz
Upload date: May 17, 2026
Size: 604.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for aksu-1.1.0a0.tar.gz
Algorithm	Hash digest
SHA256	`1fb2ec20f89b4d6d0b8c4e9d783bb06d19e70fbeadf065b04c763bf14935543a`
MD5	`a50f7da9a50fd7f981d7c717dc6a052f`
BLAKE2b-256	`d5d5e9c05ecaec1871cfb60a87f58b7d3aa834626d61d4c6f02ddc80ba83ed1a`

See more details on using hashes here.

File details

Details for the file aksu-1.1.0a0-py3-none-any.whl.

File metadata

Download URL: aksu-1.1.0a0-py3-none-any.whl
Upload date: May 17, 2026
Size: 657.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for aksu-1.1.0a0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e47b486a66d0aef3c933ea164642db517733b0597301176b5711bd8d45109a16`
MD5	`8674164ad7ff47ea2121d1264b0fee8a`
BLAKE2b-256	`55f4c395b0dccf7ef294e292b41eb8a1767ec2a1e27d058b2de188e1bd5f873d`

See more details on using hashes here.

aksu 1.1.0a0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Aksu

Why Aksu?

What's New in v1.1

GPU-accelerated NeuralBackend

PDF Text Cleaning

New TextCleaner Methods

Optional LM-scored hyphenation

Features

Installation

Quick Start

How It Works

Architecture

Performance

Measured in This Repository

Reported in Prior Work

Inference Throughput

Reproducing the Results

Dataset — TR-Gold-Morph

Limitations and Known Gaps

Roadmap

Project Structure

Migration from kokturk

Citation

License

Contributing

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Migration from `kokturk`