Durak: modular Turkish NLP preprocessing toolkit.
Project description
Durak
Durak is a Turkish natural language processing toolkit focused on reliable preprocessing building blocks. It offers configurable cleaning, tokenisation, stopword management, lemmatisation adapters, and frequency statistics so projects can bootstrap robust text pipelines quickly.
- Homepage: karagoz.io
- Repository: github.com/fbkaragoz/durak
- Issue tracker: github.com/fbkaragoz/durak/issues
Quickstart
1. Install
pip install durak-nlp
2. Minimal pipeline
from durak import process_text
entries = [
"Türkiye'de NLP zor. Durak kolaylaştırır.",
"Ankara ' da kaldım.",
]
tokens = [
process_text(
entry,
remove_stopwords=True,
rejoin_suffixes=True, # glue detached suffixes before filtering
)
for entry in entries
]
print(tokens[0])
# ["türkiye'de", "nlp", "zor", ".", "durak", "kolaylaştırır", "."]
print(tokens[1])
# ["ankara'da", "kaldım", "."]
The pipeline executes the steps in order: clean → tokenize → rejoin detached suffixes (when enabled) → remove stopwords (when enabled). This keeps noisy social-media strings consistent before filtering.
Need a quick lookup? is_stopword("ve") returns True, while list_stopwords()[:5] reveals the first few entries of the curated base set.
3. Build blocks à la carte
from durak import (
StopwordManager,
attach_detached_suffixes,
clean_text,
remove_stopwords,
tokenize,
)
text = "İstanbul ' a vapurla geçtik."
cleaned = clean_text(text)
tokens = tokenize(cleaned)
tokens = attach_detached_suffixes(tokens)
# Keep custom terms while extending the curated stopwords
manager = StopwordManager(additions=["vapurla"], keep=["istanbul'a"])
filtered = remove_stopwords(tokens, manager=manager)
print(filtered)
# ["istanbul'a", "geçtik", "."]
Features
- Unicode-aware cleaning utilities tuned for Turkish content (social, news, informal text).
- Configurable stopword management with keep-lists, custom additions,
is_stopword, andlist_stopwordshelpers. - Regex-based tokenizer and sentence splitter with clitic and diacritic preservation.
- Lightweight corpus validator to guard Turkish-specific artefacts.
- Ready for extension with future lemmatization and subword adapters.
Development Setup
python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]
pytest
Before submitting changes, run:
ruff check .
mypy src
pytest
Refer to CONTRIBUTING.md for the full workflow, coding standards, and release process. The project roadmap lives in ROADMAP.md, and notable changes are tracked in CHANGELOG.md.
Community & Support
- Code of Conduct: CODE_OF_CONDUCT.md
- Security policy: SECURITY.md
- Citation guidance: CITATION.cff
- Topics:
turkish-nlp,nlp,tokenization,lemmatization,text-processing,pre-processing,python
License
Durak is distributed under the Durak License v1.2. Commercial or institutional use requires explicit written permission from the author.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file durak_nlp-0.2.4.tar.gz.
File metadata
- Download URL: durak_nlp-0.2.4.tar.gz
- Upload date:
- Size: 20.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d9f4f2caba95e571fe64dce308ce9c048624e16363dce3fbc36b2c07cd95f65d
|
|
| MD5 |
dcd3962bc915a6681d94bfebff29d779
|
|
| BLAKE2b-256 |
fe17d1849971361fa516d8bcb1fb6a9bd031379f8be00be5880460c6eada42e7
|
File details
Details for the file durak_nlp-0.2.4-py3-none-any.whl.
File metadata
- Download URL: durak_nlp-0.2.4-py3-none-any.whl
- Upload date:
- Size: 17.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ba5ad3a626c6041a8ee34e0c744e7b30127dbd68cc585184654363132b78a6c2
|
|
| MD5 |
72925d9725eea1447b1ec0d8c0a2d240
|
|
| BLAKE2b-256 |
9d39ce1377d23f47dfa8fc6414646da44c55ea7e66628c1260d12615954ce13d
|