Skip to main content

A Turkish text normalization library (numbers, dates, times, abbreviations, currency, and more)

Project description

turkificate

A Turkish text normalization library. It converts numbers, dates, times, abbreviations, currencies, percentages, ordinals and symbols into their written Turkish form, following Turkish grammar. Built for TTS pre-processing, search indexing and text cleaning.

"The language is the core of our being - Noam Chomsky"

No external dependencies — pure standard library.

Installation

pip install turkificate

Or from a local checkout:

pip install -e .

Quick start

import turkificate

turkificate.turkificate("Dr. Ahmet 15.03.2024'te %25 indirimle 1.250 TL ödedi.")
# "doktor Ahmet on beş Mart iki bin yirmi dört'te yüzde yirmi beş
#  indirimle bin iki yüz elli lira ödedi."

turkificate() is the main function; normalize() is a kept alias for the same call.

The output is intentionally Turkish (e.g. yüz, bin, Mart) — that is the whole point of the library. Only the code, API and docs are in English.

Selecting concepts

from turkificate import TurkishNormalizer

tn = TurkishNormalizer(features=["numbers", "dates"])
tn.normalize("Saat 14:30, fiyat 99,90 TL")
# "Saat 14:30, fiyat doksan dokuz virgül doksan TL"  (times & currency untouched)

Normalize everything

Pass nothing (the default), or the explicit "all" keyword:

TurkishNormalizer()                          # all concepts (default)
TurkishNormalizer(features="all")            # all concepts
TurkishNormalizer(features=["all"])          # all concepts
TurkishNormalizer(features=turkificate.ALL)  # all concepts

List available concepts with turkificate.available_features().

Concept Description Example
numbers integer / decimal / signed 3,5 → üç virgül beş
dates DD.MM.YYYY 15.03.2024 → on beş Mart iki bin yirmi dört
times HH:MM(:SS) 14:30 → on dört otuz
percent percent sign %50 → yüzde elli
currency currencies 100 TL → yüz lira
ordinals ordinal numbers 5'inci → beşinci
units units of measure 42 km → kırk iki kilometre
abbreviations lexical abbreviations Dr. → doktor
symbols single-character symbols & → ve
whitespace whitespace cleanup (always the final step)

Per-concept options

tn = TurkishNormalizer(options={
    "times": {"prefix_hour": True},        # 09:05 → "saat dokuz beş"
    "ordinals": {"period_ordinals": True}, # "3. kat" → "üçüncü kat"
})

Per-concept helpers

turkificate.normalize_numbers("3 elma")     # "üç elma"
turkificate.normalize_dates(...)
turkificate.normalize_currency(...)
# normalize_times, normalize_percent, normalize_ordinals,
# normalize_units, normalize_abbreviations, normalize_symbols

Direct number engine:

from turkificate import integer_to_words, integer_to_ordinal, read_number
integer_to_words(1_000_000)   # "bir milyon"
integer_to_ordinal(4)         # "dördüncü"
read_number("1.234,5")        # "bin iki yüz otuz dört virgül beş"

Adding a new concept

Subclass Normalizer and register it with @register:

from turkificate import Normalizer, register
import re

@register
class EmojiNormalizer(Normalizer):
    name = "emoji"

    def configure(self, **opts):
        self._re = re.compile(r":\)")

    def apply(self, text):
        return self._re.sub("gülen yüz", text)

It is now usable via TurkishNormalizer(features=["emoji", ...]) or "all".

Architecture

  • Strategy — each concept is an independent class with a common Normalizer interface.
  • Pipeline (Chain) — normalizers run in order; number-bearing concepts run before the bare numbers concept to avoid double conversion.
  • Registry + Facade — concepts are selected by name; TurkishNormalizer composes them.

Optimization: every regex is compiled once in the constructor; the abbreviation, unit and currency dictionaries are compiled into a single alternation regex; the number-to-words engine is lru_cache-d; and because apply is pure, a single instance is reused across thousands of calls.

Known limits (roadmap)

  • The period ordinal form (3.) is disabled by default because it clashes with a sentence-final period; enable it with period_ordinals=True.
  • The number engine is one-way; the reverse direction (words → digits) is not yet implemented.
  • Context-sensitive suffixes (5'te → "beşte") are not handled yet.
  • Roman numerals, phone numbers and fractions (3/4) can be added.

Development

pip install -e ".[dev]"
pytest

Publishing

This repo ships a GitHub Actions workflow (.github/workflows/publish.yml) that publishes to PyPI via Trusted Publishing (no API tokens) when you create a GitHub Release. See the project README section below or the PyPI docs on trusted publishers.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turkificate-0.1.0.tar.gz (15.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

turkificate-0.1.0-py3-none-any.whl (16.3 kB view details)

Uploaded Python 3

File details

Details for the file turkificate-0.1.0.tar.gz.

File metadata

  • Download URL: turkificate-0.1.0.tar.gz
  • Upload date:
  • Size: 15.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for turkificate-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5ba695d38993a7913c1cbdc0f8b9b41d136ad5af0834e9883d45f6dc4b7c4a1e
MD5 d9ddabf946616637344958a5bbde9faf
BLAKE2b-256 af87a5bb4b27a5819ebf648e2e27679c41487c49ddf566e195cfdf33522dcfd8

See more details on using hashes here.

File details

Details for the file turkificate-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: turkificate-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 16.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for turkificate-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 eaa2e5abcc01f55aa72aa53b2f05f2d8a28f66dd01f2755a6c54e140d2c2be38
MD5 5473f5271d25e08033bd5dddd2c6d3c9
BLAKE2b-256 6812e2d82ef5db5a006c03210f9f925473c652b98b5d16164da32f8cbebfa112

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page