Skip to main content

Durak: modular Turkish NLP preprocessing toolkit.

Project description

Durak

PyPI Python Versions Tests License DOI

Durak logo

Durak is a Turkish natural language processing toolkit focused on reliable preprocessing building blocks. It offers configurable cleaning, tokenisation, stopword management, lemmatisation adapters, and frequency statistics so projects can bootstrap robust text pipelines quickly.

Quickstart

1. Install

pip install durak-nlp

2. Minimal pipeline

from durak import process_text

entries = [
    "Türkiye'de NLP zor. Durak kolaylaştırır.",
    "Ankara ' da kaldım.",
]

tokens = [
    process_text(
        entry,
        remove_stopwords=True,
        rejoin_suffixes=True,  # glue detached suffixes before filtering
    )
    for entry in entries
]

print(tokens[0])
# ["türkiye'de", "nlp", "zor", ".", "durak", "kolaylaştırır", "."]

print(tokens[1])
# ["ankara'da", "kaldım", "."]

The pipeline executes the steps in order: clean → tokenize → rejoin detached suffixes (when enabled) → remove stopwords (when enabled). This keeps noisy social-media strings consistent before filtering.

Need a quick lookup? is_stopword("ve") returns True, while list_stopwords()[:5] reveals the first few entries of the curated base set.

3. Build blocks à la carte

from durak import (
    StopwordManager,
    attach_detached_suffixes,
    clean_text,
    remove_stopwords,
    tokenize,
)

text = "İstanbul ' a vapurla geçtik."
cleaned = clean_text(text)
tokens = tokenize(cleaned)
tokens = attach_detached_suffixes(tokens)

# Keep custom terms while extending the curated stopwords
manager = StopwordManager(additions=["vapurla"], keep=["istanbul'a"])
filtered = remove_stopwords(tokens, manager=manager)

print(filtered)
# ["istanbul'a", "geçtik", "."]

Features

  • Unicode-aware cleaning utilities tuned for Turkish content (social, news, informal text).
  • Configurable stopword management with keep-lists, custom additions, is_stopword, and list_stopwords helpers.
  • Regex-based tokenizer and sentence splitter with clitic and diacritic preservation.
  • Lightweight corpus validator to guard Turkish-specific artefacts.
  • Ready for extension with future lemmatization and subword adapters.

Development Setup

python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]
pytest

Before submitting changes, run:

ruff check .
mypy src
pytest

Refer to CONTRIBUTING.md for the full workflow, coding standards, and release process. The project roadmap lives in ROADMAP.md, and notable changes are tracked in CHANGELOG.md.

Community & Support

License

Durak is distributed under the Durak License v1.2. Commercial or institutional use requires explicit written permission from the author.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

durak_nlp-0.2.4.tar.gz (20.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

durak_nlp-0.2.4-py3-none-any.whl (17.2 kB view details)

Uploaded Python 3

File details

Details for the file durak_nlp-0.2.4.tar.gz.

File metadata

  • Download URL: durak_nlp-0.2.4.tar.gz
  • Upload date:
  • Size: 20.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for durak_nlp-0.2.4.tar.gz
Algorithm Hash digest
SHA256 d9f4f2caba95e571fe64dce308ce9c048624e16363dce3fbc36b2c07cd95f65d
MD5 dcd3962bc915a6681d94bfebff29d779
BLAKE2b-256 fe17d1849971361fa516d8bcb1fb6a9bd031379f8be00be5880460c6eada42e7

See more details on using hashes here.

File details

Details for the file durak_nlp-0.2.4-py3-none-any.whl.

File metadata

  • Download URL: durak_nlp-0.2.4-py3-none-any.whl
  • Upload date:
  • Size: 17.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for durak_nlp-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 ba5ad3a626c6041a8ee34e0c744e7b30127dbd68cc585184654363132b78a6c2
MD5 72925d9725eea1447b1ec0d8c0a2d240
BLAKE2b-256 9d39ce1377d23f47dfa8fc6414646da44c55ea7e66628c1260d12615954ce13d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page