Skip to main content

Durak: modular Turkish NLP preprocessing toolkit.

Project description

Durak

PyPI Python Versions Tests License

Durak logo

Durak is a Turkish natural language processing toolkit focused on reliable preprocessing building blocks. It offers configurable cleaning, tokenisation, stopword management, lemmatisation adapters, and frequency statistics so projects can bootstrap robust text pipelines quickly.

Quickstart

Install from PyPI:

pip install durak-nlp

Clean and tokenize Turkish text in seconds:

from durak import clean_text, process_text, tokenize, remove_stopwords

text = "Bu bir test. Durak kolaylaştırır."
tokens = tokenize(clean_text(text))
print(tokens)
# ['bu', 'bir', 'test', '.', 'durak', 'kolaylaştırır', '.']

filtered = remove_stopwords(tokens)
print(filtered)
# ['test', '.', 'durak', 'kolaylaştırır', '.']

processed = process_text(text, remove_stopwords=True)
print(processed)
# ['test', '.', 'durak', 'kolaylaştırır', '.']

Need a quick lookup? is_stopword("ve") returns True, while list_stopwords()[:5] reveals the first few entries of the curated base set.

Features

  • Unicode-aware cleaning utilities tuned for Turkish content (social, news, informal text).
  • Configurable stopword management with keep-lists, custom additions, is_stopword, and list_stopwords helpers.
  • Regex-based tokenizer and sentence splitter with clitic and diacritic preservation.
  • Lightweight corpus validator to guard Turkish-specific artefacts.
  • Ready for extension with future lemmatization and subword adapters.

Development Setup

python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]
pytest

Before submitting changes, run:

ruff check .
mypy src
pytest

Refer to CONTRIBUTING.md for the full workflow, coding standards, and release process. The project roadmap lives in ROADMAP.md, and notable changes are tracked in CHANGELOG.md.

Community & Support

License

Durak is distributed under the Durak License v1.2. Commercial or institutional use requires explicit written permission from the author.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

durak_nlp-0.2.0.tar.gz (18.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

durak_nlp-0.2.0-py3-none-any.whl (15.5 kB view details)

Uploaded Python 3

File details

Details for the file durak_nlp-0.2.0.tar.gz.

File metadata

  • Download URL: durak_nlp-0.2.0.tar.gz
  • Upload date:
  • Size: 18.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for durak_nlp-0.2.0.tar.gz
Algorithm Hash digest
SHA256 909f09d6de2842ec2e77ddd8618420549c37dba54c0199f13d8da4f53d5ff2f5
MD5 0f3dcccd7fa59bc633ddcffe4d4049a7
BLAKE2b-256 8e657bf6be445722cec76acd7e7f7c114f5aeb7260902fc377f85fad9c3e6514

See more details on using hashes here.

File details

Details for the file durak_nlp-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: durak_nlp-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 15.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for durak_nlp-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d9c9dd9479dce9c6877708c48b81bbc1cc2b6d0015e924eb88abbbaff92d0d6f
MD5 7b42f727d2670d130516eec0695dfe04
BLAKE2b-256 3783336310476721253583269790d669530f02632fa7b17ba5774c611fbb5292

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page