Skip to main content

A Turkish text normalization library (numbers, dates, times, abbreviations, currency, and more)

Project description

turkificate

A Turkish text normalization library. It converts numbers, dates, times, phone numbers, Turkish identity numbers, abbreviations, currencies, percentages, ordinals and symbols into their written Turkish form, following Turkish grammar. Built for TTS pre-processing, search indexing and text cleaning.

"The language is the core of our being - Noam Chomsky"

No external dependencies — pure standard library.

Installation

pip install turkificate

Or from a local checkout:

pip install -e .

Quick start

import turkificate

turkificate.turkificate("Dr. Ahmet 15.03.2024'te %25 indirimle 1.250 TL ödedi.")
# "doktor Ahmet on beş Mart iki bin yirmi dört'te yüzde yirmi beş
#  indirimle bin iki yüz elli lira ödedi."

turkificate() is the main function; normalize() is a kept alias for the same call.

The output is intentionally Turkish (e.g. yüz, bin, Mart) — that is the whole point of the library. Only the code, API and docs are in English.

Selecting concepts

from turkificate import TurkishNormalizer

tn = TurkishNormalizer(features=["numbers", "dates"])
tn.normalize("Saat 14:30, fiyat 99,90 TL")
# "Saat 14:30, fiyat doksan dokuz virgül doksan TL"  (times & currency untouched)

Normalize everything

Pass nothing (the default), or the explicit "all" keyword:

TurkishNormalizer()                          # all concepts (default)
TurkishNormalizer(features="all")            # all concepts
TurkishNormalizer(features=["all"])          # all concepts
TurkishNormalizer(features=turkificate.ALL)  # all concepts

List available concepts with turkificate.available_features().

Concept Description Example
emails e-mail addresses info@firma.com → info et firma nokta com
urls URLs https://firma.com/detay → firma nokta com bölü detay
phones Turkish phone numbers 0532 123 45 67 → sıfır beş yüz otuz iki yüz yirmi üç kırk beş altmış yedi
turkish_ids valid Turkish identity numbers 10000000146 → bir sıfır sıfır sıfır sıfır sıfır sıfır sıfır bir dört altı
numbers integer / decimal / signed 3,5 → üç virgül beş
dates DD.MM.YYYY 15.03.2024 → on beş Mart iki bin yirmi dört
times HH:MM(:SS) 14:30 → on dört otuz
percent percent sign %50 → yüzde elli
currency currencies 100 TL → yüz lira
ordinals ordinal numbers 5'inci → beşinci, 3. kat → üçüncü kat
units units of measure 42 km → kırk iki kilometre, -3 °C → eksi üç derece
abbreviations lexical abbreviations Dr. → doktor
symbols single-character symbols & → ve, × → çarpı, ÷ → bölü
whitespace whitespace cleanup (always the final step)

Per-concept options

tn = TurkishNormalizer(options={
    "times":   {"prefix_hour":     True},  # 09:05 → "saat dokuz beş"
    "ordinals": {"period_ordinals": False}, # disable "3. kat" → "üçüncü kat" (on by default)
})

Per-concept helpers

turkificate.normalize_numbers("3 elma")               # "üç elma"
turkificate.normalize_emails("info@firma.com")         # "info et firma nokta com"
turkificate.normalize_urls("https://firma.com/detay")  # "firma nokta com bölü detay"
turkificate.normalize_phones("0532 123 45 67")
turkificate.normalize_turkish_ids("10000000146")
turkificate.normalize_dates(...)
turkificate.normalize_currency(...)
# normalize_times, normalize_percent, normalize_ordinals,
# normalize_units, normalize_abbreviations, normalize_symbols

Direct number engine:

from turkificate import integer_to_words, integer_to_ordinal, read_number
integer_to_words(1_000_000)   # "bir milyon"
integer_to_ordinal(4)         # "dördüncü"
read_number("1.234,5")        # "bin iki yüz otuz dört virgül beş"

Adding a new concept

Subclass Normalizer and register it with @register:

from turkificate import Normalizer, register
import re

@register
class EmojiNormalizer(Normalizer):
    name = "emoji"

    def configure(self, **opts):
        self._re = re.compile(r":\)")

    def apply(self, text):
        return self._re.sub("gülen yüz", text)

It is now usable via TurkishNormalizer(features=["emoji", ...]) or "all".

Architecture

  • Strategy — each concept is an independent class with a common Normalizer interface.
  • Pipeline (Chain) — normalizers run in order; number-bearing concepts run before the bare numbers concept to avoid double conversion.
  • Registry + Facade — concepts are selected by name; TurkishNormalizer composes them.

Optimization: every regex is compiled once in the constructor; the abbreviation, unit and currency dictionaries are compiled into a single alternation regex; the number-to-words engine is lru_cache-d; and because apply is pure, a single instance is reused across thousands of calls.

Known limits (roadmap)

  • The period ordinal form (3. kat → "üçüncü kat") is enabled by default. It requires whitespace + a non-whitespace character after the dot, so sentence-final periods are safe. Disable with period_ordinals=False.
  • The number engine is one-way; the reverse direction (words → digits) is not yet implemented.
  • Context-sensitive suffixes (5'te → "beşte") are not handled yet.
  • Roman numerals and fractions (3/4) can be added.

Development

pip install -e ".[dev]"
pytest

Publishing

This repo ships a GitHub Actions workflow (.github/workflows/publish.yml) that publishes to PyPI via Trusted Publishing (no API tokens) when you create a GitHub Release. See the project README section below or the PyPI docs on trusted publishers.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turkificate-0.1.2.tar.gz (17.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

turkificate-0.1.2-py3-none-any.whl (17.6 kB view details)

Uploaded Python 3

File details

Details for the file turkificate-0.1.2.tar.gz.

File metadata

  • Download URL: turkificate-0.1.2.tar.gz
  • Upload date:
  • Size: 17.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for turkificate-0.1.2.tar.gz
Algorithm Hash digest
SHA256 303e22299b8810fb801504049dd7d467ddeb1346ad17301f05cfc9684e773057
MD5 f343d029b554cf5fe27bc1430f30abaf
BLAKE2b-256 f371cf64293470916ad3cedd47c32f8ae920eb8ed5dd06aad696e8c59d916178

See more details on using hashes here.

Provenance

The following attestation bundles were made for turkificate-0.1.2.tar.gz:

Publisher: publish.yml on eaysu/turkificate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file turkificate-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: turkificate-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 17.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for turkificate-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 874ff67f87be915310e63ee835cb89cf78f5c49372c76738809269b9588a56fd
MD5 6a0d0b827e564727bc4c7b996fdd8671
BLAKE2b-256 75661667811923c417e6a3aed6ff62cef20b09e3fb847db0b49e3340325b5ae3

See more details on using hashes here.

Provenance

The following attestation bundles were made for turkificate-0.1.2-py3-none-any.whl:

Publisher: publish.yml on eaysu/turkificate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page