Skip to main content

Durak: modular Turkish NLP preprocessing toolkit.

Project description

Durak

PyPI Python Versions Tests License

Durak logo

Durak is a Turkish natural language processing toolkit focused on reliable preprocessing building blocks. It offers configurable cleaning, tokenisation, stopword management, lemmatisation adapters, and frequency statistics so projects can bootstrap robust text pipelines quickly.

Quickstart

Install from PyPI:

pip install durak-nlp

Clean and tokenize Turkish text in seconds:

from durak import clean_text, process_text, tokenize, remove_stopwords

text = "Bu bir test. Durak kolaylaştırır."
tokens = tokenize(clean_text(text))
print(tokens)
# ['bu', 'bir', 'test', '.', 'durak', 'kolaylaştırır', '.']

filtered = remove_stopwords(tokens)
print(filtered)
# ['test', '.', 'durak', 'kolaylaştırır', '.']

processed = process_text(text, remove_stopwords=True)
print(processed)
# ['test', '.', 'durak', 'kolaylaştırır', '.']

Need a quick lookup? is_stopword("ve") returns True, while list_stopwords()[:5] reveals the first few entries of the curated base set.

Features

  • Unicode-aware cleaning utilities tuned for Turkish content (social, news, informal text).
  • Configurable stopword management with keep-lists, custom additions, is_stopword, and list_stopwords helpers.
  • Regex-based tokenizer and sentence splitter with clitic and diacritic preservation.
  • Lightweight corpus validator to guard Turkish-specific artefacts.
  • Ready for extension with future lemmatization and subword adapters.

Development Setup

python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]
pytest

Before submitting changes, run:

ruff check .
mypy src
pytest

Refer to CONTRIBUTING.md for the full workflow, coding standards, and release process. The project roadmap lives in ROADMAP.md, and notable changes are tracked in CHANGELOG.md.

Community & Support

License

Durak is distributed under the Durak License v1.2. Commercial or institutional use requires explicit written permission from the author.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

durak_nlp-0.2.1.tar.gz (18.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

durak_nlp-0.2.1-py3-none-any.whl (15.5 kB view details)

Uploaded Python 3

File details

Details for the file durak_nlp-0.2.1.tar.gz.

File metadata

  • Download URL: durak_nlp-0.2.1.tar.gz
  • Upload date:
  • Size: 18.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for durak_nlp-0.2.1.tar.gz
Algorithm Hash digest
SHA256 a6ce8b232f5bd49ffdc276d2fe7d2603afe333aae28687259281dccbd6719505
MD5 88af227c9b7f0045882d05caabdff42d
BLAKE2b-256 a26fc06373c7dde7ae9d0fe8889ff525c378bbab746af6346aaec9e5121cf8a8

See more details on using hashes here.

File details

Details for the file durak_nlp-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: durak_nlp-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 15.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for durak_nlp-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e7b4db5842df3f8e7690b40852e943d9f7313a658de6b80236071e9e4090cc6d
MD5 499197cb9ea87e02ff63ab9bbc538efa
BLAKE2b-256 51fd6512622eedf66f79ea79d389866a01b8f2f8c3197118126610c5be8cfa06

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page