Skip to main content

Durak: modular Turkish NLP preprocessing toolkit.

Project description

Durak

PyPI Python Versions Tests License DOI

Durak logo

Durak is a Turkish natural language processing toolkit focused on reliable preprocessing building blocks. It offers configurable cleaning, tokenisation, stopword management, lemmatisation adapters, and frequency statistics so projects can bootstrap robust text pipelines quickly.

Quickstart

Install from PyPI:

pip install durak-nlp

Clean and tokenize Turkish text in seconds:

from durak import clean_text, process_text, tokenize, remove_stopwords

text = "Bu bir test. Durak kolaylaştırır."
tokens = tokenize(clean_text(text))
print(tokens)
# ['bu', 'bir', 'test', '.', 'durak', 'kolaylaştırır', '.']

filtered = remove_stopwords(tokens)
print(filtered)
# ['test', '.', 'durak', 'kolaylaştırır', '.']

processed = process_text(text, remove_stopwords=True)
print(processed)
# ['test', '.', 'durak', 'kolaylaştırır', '.']

# Repair detached suffix tokens (e.g., `Ankara ' da` → `ankara'da`) on demand:
suffix_fixed = process_text(
    "Ankara ' da kaldım ya",
    rejoin_suffixes=True,
    remove_stopwords=True,
)
print(suffix_fixed)
# ['ankara\'da', 'kaldım']

Need a quick lookup? is_stopword("ve") returns True, while list_stopwords()[:5] reveals the first few entries of the curated base set.

Features

  • Unicode-aware cleaning utilities tuned for Turkish content (social, news, informal text).
  • Configurable stopword management with keep-lists, custom additions, is_stopword, and list_stopwords helpers.
  • Regex-based tokenizer and sentence splitter with clitic and diacritic preservation.
  • Lightweight corpus validator to guard Turkish-specific artefacts.
  • Ready for extension with future lemmatization and subword adapters.

Development Setup

python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]
pytest

Before submitting changes, run:

ruff check .
mypy src
pytest

Refer to CONTRIBUTING.md for the full workflow, coding standards, and release process. The project roadmap lives in ROADMAP.md, and notable changes are tracked in CHANGELOG.md.

Community & Support

License

Durak is distributed under the Durak License v1.2. Commercial or institutional use requires explicit written permission from the author.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

durak_nlp-0.2.3.tar.gz (20.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

durak_nlp-0.2.3-py3-none-any.whl (16.9 kB view details)

Uploaded Python 3

File details

Details for the file durak_nlp-0.2.3.tar.gz.

File metadata

  • Download URL: durak_nlp-0.2.3.tar.gz
  • Upload date:
  • Size: 20.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for durak_nlp-0.2.3.tar.gz
Algorithm Hash digest
SHA256 354dd4dc6badc16be88062cc6e010969e0e1b017e47340bd822d845650894b0a
MD5 52d1c40fad789b9f85264f6229d2daa0
BLAKE2b-256 ee50b20892058893e09197a2f2e096fa400ba6cd96d31d55efd4f306c2bf4a16

See more details on using hashes here.

File details

Details for the file durak_nlp-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: durak_nlp-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 16.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for durak_nlp-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 612d902995b0209447137306ae5b481ab07b40eb94a765d60680bdb0714a5a16
MD5 8f3fe0eaaff1faec05d79ec43621c461
BLAKE2b-256 7876937f3b8fd783827e41fd31a747e36e2290939c5196772cd1dc4471682919

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page