A Turkish text normalization library (numbers, dates, times, abbreviations, currency, and more)
Project description
turkificate
A Turkish text normalization library. It converts numbers, dates, times, phone numbers, Turkish identity numbers, abbreviations, currencies, percentages, ordinals and symbols into their written Turkish form, following Turkish grammar. Built for TTS pre-processing, search indexing and text cleaning.
"The language is the core of our being - Noam Chomsky"
No external dependencies — pure standard library.
Installation
pip install turkificate
Or from a local checkout:
pip install -e .
Quick start
import turkificate
turkificate.turkificate("Dr. Ahmet 15.03.2024'te %25 indirimle 1.250 TL ödedi.")
# "doktor Ahmet on beş Mart iki bin yirmi dört'te yüzde yirmi beş
# indirimle bin iki yüz elli lira ödedi."
turkificate() is the main function; normalize() is a kept alias for the same call.
The output is intentionally Turkish (e.g.
yüz,bin,Mart) — that is the whole point of the library. Only the code, API and docs are in English.
Selecting concepts
from turkificate import TurkishNormalizer
tn = TurkishNormalizer(features=["numbers", "dates"])
tn.normalize("Saat 14:30, fiyat 99,90 TL")
# "Saat 14:30, fiyat doksan dokuz virgül doksan TL" (times & currency untouched)
Normalize everything
Pass nothing (the default), or the explicit "all" keyword:
TurkishNormalizer() # all concepts (default)
TurkishNormalizer(features="all") # all concepts
TurkishNormalizer(features=["all"]) # all concepts
TurkishNormalizer(features=turkificate.ALL) # all concepts
List available concepts with turkificate.available_features().
| Concept | Description | Example |
|---|---|---|
emails |
e-mail addresses | info@firma.com → info et firma nokta com |
urls |
URLs | https://firma.com/detay → firma nokta com bölü detay |
phones |
Turkish phone numbers | 0532 123 45 67 → sıfır beş yüz otuz iki yüz yirmi üç kırk beş altmış yedi |
turkish_ids |
valid Turkish identity numbers | 10000000146 → bir sıfır sıfır sıfır sıfır sıfır sıfır sıfır bir dört altı |
numbers |
integer / decimal / signed | 3,5 → üç virgül beş |
dates |
DD.MM.YYYY | 15.03.2024 → on beş Mart iki bin yirmi dört |
times |
HH:MM(:SS) | 14:30 → on dört otuz |
percent |
percent sign | %50 → yüzde elli |
currency |
currencies | 100 TL → yüz lira |
ordinals |
ordinal numbers | 5'inci → beşinci, 3. kat → üçüncü kat |
units |
units of measure | 42 km → kırk iki kilometre, -3 °C → eksi üç derece |
abbreviations |
lexical abbreviations | Dr. → doktor |
symbols |
single-character symbols | & → ve, × → çarpı, ÷ → bölü |
whitespace |
whitespace cleanup | (always the final step) |
Per-concept options
tn = TurkishNormalizer(options={
"times": {"prefix_hour": True}, # 09:05 → "saat dokuz beş"
"ordinals": {"period_ordinals": False}, # disable "3. kat" → "üçüncü kat" (on by default)
})
Per-concept helpers
turkificate.normalize_numbers("3 elma") # "üç elma"
turkificate.normalize_emails("info@firma.com") # "info et firma nokta com"
turkificate.normalize_urls("https://firma.com/detay") # "firma nokta com bölü detay"
turkificate.normalize_phones("0532 123 45 67")
turkificate.normalize_turkish_ids("10000000146")
turkificate.normalize_dates(...)
turkificate.normalize_currency(...)
# normalize_times, normalize_percent, normalize_ordinals,
# normalize_units, normalize_abbreviations, normalize_symbols
Direct number engine:
from turkificate import integer_to_words, integer_to_ordinal, read_number
integer_to_words(1_000_000) # "bir milyon"
integer_to_ordinal(4) # "dördüncü"
read_number("1.234,5") # "bin iki yüz otuz dört virgül beş"
Adding a new concept
Subclass Normalizer and register it with @register:
from turkificate import Normalizer, register
import re
@register
class EmojiNormalizer(Normalizer):
name = "emoji"
def configure(self, **opts):
self._re = re.compile(r":\)")
def apply(self, text):
return self._re.sub("gülen yüz", text)
It is now usable via TurkishNormalizer(features=["emoji", ...]) or "all".
Architecture
- Strategy — each concept is an independent class with a common
Normalizerinterface. - Pipeline (Chain) — normalizers run in order; number-bearing concepts run before the bare
numbersconcept to avoid double conversion. - Registry + Facade — concepts are selected by name;
TurkishNormalizercomposes them.
Optimization: every regex is compiled once in the constructor; the abbreviation,
unit and currency dictionaries are compiled into a single alternation regex; the
number-to-words engine is lru_cache-d; and because apply is pure, a single
instance is reused across thousands of calls.
Known limits (roadmap)
- The period ordinal form (
3. kat→ "üçüncü kat") is enabled by default. It requires whitespace + a non-whitespace character after the dot, so sentence-final periods are safe. Disable withperiod_ordinals=False. - The number engine is one-way; the reverse direction (words → digits) is not yet implemented.
- Context-sensitive suffixes (
5'te→ "beşte") are not handled yet. - Roman numerals and fractions (
3/4) can be added.
Development
pip install -e ".[dev]"
pytest
Publishing
This repo ships a GitHub Actions workflow (.github/workflows/publish.yml) that
publishes to PyPI via Trusted Publishing (no API tokens) when you create a
GitHub Release. See the project README section below or the PyPI docs on
trusted publishers.
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file turkificate-0.1.2.tar.gz.
File metadata
- Download URL: turkificate-0.1.2.tar.gz
- Upload date:
- Size: 17.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
303e22299b8810fb801504049dd7d467ddeb1346ad17301f05cfc9684e773057
|
|
| MD5 |
f343d029b554cf5fe27bc1430f30abaf
|
|
| BLAKE2b-256 |
f371cf64293470916ad3cedd47c32f8ae920eb8ed5dd06aad696e8c59d916178
|
Provenance
The following attestation bundles were made for turkificate-0.1.2.tar.gz:
Publisher:
publish.yml on eaysu/turkificate
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
turkificate-0.1.2.tar.gz -
Subject digest:
303e22299b8810fb801504049dd7d467ddeb1346ad17301f05cfc9684e773057 - Sigstore transparency entry: 1766365564
- Sigstore integration time:
-
Permalink:
eaysu/turkificate@6a066b857edbd02c875f5ba0830131929b753782 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/eaysu
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6a066b857edbd02c875f5ba0830131929b753782 -
Trigger Event:
release
-
Statement type:
File details
Details for the file turkificate-0.1.2-py3-none-any.whl.
File metadata
- Download URL: turkificate-0.1.2-py3-none-any.whl
- Upload date:
- Size: 17.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
874ff67f87be915310e63ee835cb89cf78f5c49372c76738809269b9588a56fd
|
|
| MD5 |
6a0d0b827e564727bc4c7b996fdd8671
|
|
| BLAKE2b-256 |
75661667811923c417e6a3aed6ff62cef20b09e3fb847db0b49e3340325b5ae3
|
Provenance
The following attestation bundles were made for turkificate-0.1.2-py3-none-any.whl:
Publisher:
publish.yml on eaysu/turkificate
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
turkificate-0.1.2-py3-none-any.whl -
Subject digest:
874ff67f87be915310e63ee835cb89cf78f5c49372c76738809269b9588a56fd - Sigstore transparency entry: 1766365660
- Sigstore integration time:
-
Permalink:
eaysu/turkificate@6a066b857edbd02c875f5ba0830131929b753782 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/eaysu
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6a066b857edbd02c875f5ba0830131929b753782 -
Trigger Event:
release
-
Statement type: