Skip to main content

A Turkish text normalization library (numbers, dates, times, abbreviations, currency, and more)

Project description

turkificate

A Turkish text normalization library. It converts numbers, dates, times, abbreviations, currencies, percentages, ordinals and symbols into their written Turkish form, following Turkish grammar. Built for TTS pre-processing, search indexing and text cleaning.

"The language is the core of our being - Noam Chomsky"

No external dependencies — pure standard library.

Installation

pip install turkificate

Or from a local checkout:

pip install -e .

Quick start

import turkificate

turkificate.turkificate("Dr. Ahmet 15.03.2024'te %25 indirimle 1.250 TL ödedi.")
# "doktor Ahmet on beş Mart iki bin yirmi dört'te yüzde yirmi beş
#  indirimle bin iki yüz elli lira ödedi."

turkificate() is the main function; normalize() is a kept alias for the same call.

The output is intentionally Turkish (e.g. yüz, bin, Mart) — that is the whole point of the library. Only the code, API and docs are in English.

Selecting concepts

from turkificate import TurkishNormalizer

tn = TurkishNormalizer(features=["numbers", "dates"])
tn.normalize("Saat 14:30, fiyat 99,90 TL")
# "Saat 14:30, fiyat doksan dokuz virgül doksan TL"  (times & currency untouched)

Normalize everything

Pass nothing (the default), or the explicit "all" keyword:

TurkishNormalizer()                          # all concepts (default)
TurkishNormalizer(features="all")            # all concepts
TurkishNormalizer(features=["all"])          # all concepts
TurkishNormalizer(features=turkificate.ALL)  # all concepts

List available concepts with turkificate.available_features().

Concept Description Example
emails e-mail addresses info@firma.com → info et firma nokta com
urls URLs https://firma.com/detay → firma nokta com bölü detay
numbers integer / decimal / signed 3,5 → üç virgül beş
dates DD.MM.YYYY 15.03.2024 → on beş Mart iki bin yirmi dört
times HH:MM(:SS) 14:30 → on dört otuz
percent percent sign %50 → yüzde elli
currency currencies 100 TL → yüz lira
ordinals ordinal numbers 5'inci → beşinci, 3. kat → üçüncü kat
units units of measure 42 km → kırk iki kilometre, -3 °C → eksi üç derece
abbreviations lexical abbreviations Dr. → doktor
symbols single-character symbols & → ve, × → çarpı, ÷ → bölü
whitespace whitespace cleanup (always the final step)

Per-concept options

tn = TurkishNormalizer(options={
    "times":   {"prefix_hour":     True},  # 09:05 → "saat dokuz beş"
    "ordinals": {"period_ordinals": False}, # disable "3. kat" → "üçüncü kat" (on by default)
})

Per-concept helpers

turkificate.normalize_numbers("3 elma")               # "üç elma"
turkificate.normalize_emails("info@firma.com")         # "info et firma nokta com"
turkificate.normalize_urls("https://firma.com/detay")  # "firma nokta com bölü detay"
turkificate.normalize_dates(...)
turkificate.normalize_currency(...)
# normalize_times, normalize_percent, normalize_ordinals,
# normalize_units, normalize_abbreviations, normalize_symbols

Direct number engine:

from turkificate import integer_to_words, integer_to_ordinal, read_number
integer_to_words(1_000_000)   # "bir milyon"
integer_to_ordinal(4)         # "dördüncü"
read_number("1.234,5")        # "bin iki yüz otuz dört virgül beş"

Adding a new concept

Subclass Normalizer and register it with @register:

from turkificate import Normalizer, register
import re

@register
class EmojiNormalizer(Normalizer):
    name = "emoji"

    def configure(self, **opts):
        self._re = re.compile(r":\)")

    def apply(self, text):
        return self._re.sub("gülen yüz", text)

It is now usable via TurkishNormalizer(features=["emoji", ...]) or "all".

Architecture

  • Strategy — each concept is an independent class with a common Normalizer interface.
  • Pipeline (Chain) — normalizers run in order; number-bearing concepts run before the bare numbers concept to avoid double conversion.
  • Registry + Facade — concepts are selected by name; TurkishNormalizer composes them.

Optimization: every regex is compiled once in the constructor; the abbreviation, unit and currency dictionaries are compiled into a single alternation regex; the number-to-words engine is lru_cache-d; and because apply is pure, a single instance is reused across thousands of calls.

Known limits (roadmap)

  • The period ordinal form (3. kat → "üçüncü kat") is enabled by default. It requires whitespace + a non-whitespace character after the dot, so sentence-final periods are safe. Disable with period_ordinals=False.
  • The number engine is one-way; the reverse direction (words → digits) is not yet implemented.
  • Context-sensitive suffixes (5'te → "beşte") are not handled yet.
  • Roman numerals, phone numbers and fractions (3/4) can be added.

Development

pip install -e ".[dev]"
pytest

Publishing

This repo ships a GitHub Actions workflow (.github/workflows/publish.yml) that publishes to PyPI via Trusted Publishing (no API tokens) when you create a GitHub Release. See the project README section below or the PyPI docs on trusted publishers.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turkificate-0.1.1.tar.gz (16.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

turkificate-0.1.1-py3-none-any.whl (16.5 kB view details)

Uploaded Python 3

File details

Details for the file turkificate-0.1.1.tar.gz.

File metadata

  • Download URL: turkificate-0.1.1.tar.gz
  • Upload date:
  • Size: 16.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for turkificate-0.1.1.tar.gz
Algorithm Hash digest
SHA256 3a57ba8bd27a98c9b22800899e729ed3c1d63927848b692118262f27079d452b
MD5 ad29141cabafab93df54baa51ea658d4
BLAKE2b-256 36a665a79057ac2ba89009c837d2eed3d7e747d31f6b2d8f274c35bf9673cc45

See more details on using hashes here.

File details

Details for the file turkificate-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: turkificate-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 16.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for turkificate-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 edfed58714ee43c0d7efef0b3445e49a9fdc784c60ba329c2988a8d7f9463b82
MD5 8acd07f1816dec445598ce1410c4df74
BLAKE2b-256 7b21436fcac1489333e3c8a92ec50ef058da16beddc4b17f76523961312fee60

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page