Skip to main content

KVKK uyumlu Türkçe PII detection kütüphanesi

Project description

kvkk-pii

KVKK-compliant Turkish PII detection library — fully on-premise, no cloud.

Detect, anonymize, and protect personally identifiable information in Turkish text. Built for KVKK (Turkish data protection law) compliance, with a 3-layer architecture that combines regex, NER, and zero-shot classification.

from kvkk_pii import PiiDetector

detector = PiiDetector()
result = detector.analyze("Ali Veli, TC: 10000000146, tel: 0532 123 45 67")

for e in result.entities:
    print(e)
# PiiEntity(type='TC_KIMLIK', text='10000000146', start=12, end=23, score=1.00, layer='regex')
# PiiEntity(type='TELEFON_TR', text='0532 123 45 67', start=30, end=44, score=1.00, layer='regex')

Features

  • Zero cloud — all models run locally, no data leaves your machine
  • 3-layer detection: Regex + checksum → XLM-RoBERTa NER → GLiNER zero-shot
  • KVKK Madde 6 support — special categories: health, religion, biometrics, political opinion
  • LLM proxy — mask PII before sending to AI, restore in the response, detect leakage
  • Compliance report — maps detected entities to KVKK articles and risk levels
  • Pluggable — add custom recognizers, tune thresholds per entity type
  • AsyncAsyncPiiDetector for FastAPI / async applications
  • CLIkvkk-pii scan, kvkk-pii anonymize

Installation

# Layer 1 only — regex + checksum (no dependencies)
pip install kvkk-pii

# + Layer 2 — XLM-RoBERTa NER (~450 MB, Turkish NER)
pip install kvkk-pii[ner]

# + Layer 3 — GLiNER zero-shot (~180 MB, KVKK Madde 6)
pip install kvkk-pii[full]

Models are downloaded from HuggingFace on first use and cached at ~/.cache/huggingface/hub.


Quickstart

Detect & Anonymize

from kvkk_pii import PiiDetector

detector = PiiDetector()  # regex only (Layer 1)

text = "Müşteri Ali Veli, IBAN: TR33 0006 1005 1978 6457 8413 26, e-posta: ali@example.com"
result = detector.analyze(text)

print(result.entities)
# [PiiEntity(type='IBAN_TR', ...), PiiEntity(type='EMAIL', ...)]

print(detector.anonymize(text))
# "Müşteri Ali Veli, IBAN: [IBAN_TR], e-posta: [EMAIL]"

With NER (Person, Location, Organization)

detector = PiiDetector(layers=["regex", "ner"])
# First run: prompts to download akdeniz27/xlm-roberta-base-turkish-ner (~450 MB)

result = detector.analyze("Ahmet Yılmaz, İstanbul'daki Türk Telekom şubesine gitti.")
# Detects: KISI_ADI (Ahmet Yılmaz), KONUM (İstanbul), KURUM (Türk Telekom)

With GLiNER — KVKK Madde 6 Special Categories

detector = PiiDetector(layers=["regex", "ner", "gliner"])

result = detector.analyze("Hasta diyabet tedavisi görüyor, Sünni mezhebine mensup.")
# Detects: SAGLIK_VERISI, DINI_INANC

Ready-Made Presets

from kvkk_pii import presets

detector = presets.turkish()      # Regex + NER (TR) + GLiNER — full KVKK coverage
detector = presets.german()       # Regex (DE) + GLiNER — DSGVO
detector = presets.french()       # Regex (FR) + GLiNER — RGPD
detector = presets.multilingual() # TR + DE + FR together

Layer Architecture

Layer Method Model Speed Detects
1 Regex + checksum <1ms TC Kimlik, IBAN, VKN, phone, plate, email, passport
2 NER akdeniz27/xlm-roberta-base-turkish-ner ~30ms Person, Location, Organization
3 Zero-shot NER urchade/gliner_multi-v2.1 ~80ms KVKK Madde 6 special categories

Each layer only processes spans not already found by a previous layer, avoiding double-detection.


LLM Proxy

Protect PII when sending text to external AI services. Mask before sending, restore after, detect any leakage.

Session-Based Masking

detector = PiiDetector(layers=["regex", "ner"])

session = detector.create_session("Ali Veli TC: 10000000146 hakkında bilgi ver.")
masked = session.mask()
# → "[KISI_ADI_x7k] TC: [TC_KIMLIK_a3f] hakkında bilgi ver."

ai_response = call_openai(masked)  # your AI call

restored = session.restore(ai_response)
# Placeholders in AI response replaced back with originals

Two-Way Proxy (mask → AI → leakage check → restore)

result = detector.two_way(
    prompt="Ali Veli'nin TC numarası 10000000146, özet çıkar.",
    call_fn=lambda masked: openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": masked}]
    ).choices[0].message.content,
    on_leak="warn",  # "raise" | "warn" | "ignore"
)

print(result.output)          # restored AI response
print(result.report.safe)     # True if no PII leaked
print(result.report.summary()) # leakage summary

Leakage Detection

from kvkk_pii import LeakageAnalyzer

analyzer = detector.leakage_analyzer()
report = analyzer.analyze(session, raw_ai_response)

report.safe            # bool
report.leaked          # entities that leaked through placeholders
report.new_pii         # PII in AI output not present in input (hallucination?)
report.risk_score      # 0.0–1.0
print(report.summary())

Compliance Report

Maps detected entities to KVKK articles with risk levels and recommendations.

report = detector.compliance_report(text)

print(report.summary())
# KVKK Uyum Raporu — 4 veri, genel risk: YÜKSEK
# KVKK Madde 6 (Özel Nitelikli Veri) tespit edildi!
#
#   [KRİTİK] SAGLIK_VERISI x 1
#     Dayanak: KVKK Madde 6 — Özel Nitelikli Kişisel Veri
#     Öneri  : Açık rıza zorunlu. Yetkili kurum olmadan işlenemez.
#   [YÜKSEK] TC_KIMLIK x 1
#     ...

report.has_madde6      # True if KVKK Article 6 data found
report.overall_risk    # "düşük" | "orta" | "yüksek" | "kritik"
report.to_dict()       # JSON-serializable

Async

from kvkk_pii import AsyncPiiDetector

detector = AsyncPiiDetector(layers=["regex", "ner"])

# FastAPI example
@app.post("/scan")
async def scan(text: str):
    result = await detector.analyze(text)
    return [e.__dict__ for e in result.entities]

# Parallel processing
import asyncio
results = await asyncio.gather(*[detector.analyze(t) for t in texts])

# Async two_way
result = await detector.two_way(prompt, async_call_fn)

CLI

# Scan text
kvkk-pii scan "Ali Veli TC: 10000000146"

# Scan file
kvkk-pii scan belge.txt

# Pipe
cat belge.txt | kvkk-pii scan

# With NER layer
kvkk-pii scan --layer ner "Ahmet Yılmaz İstanbul'da"

# JSON output
kvkk-pii scan --format json "TC: 10000000146"

# Anonymize
kvkk-pii anonymize "Ali Veli TC: 10000000146"
# → "Ali Veli TC: [TC_KIMLIK]"

# Version
kvkk-pii version

Custom Recognizers

from kvkk_pii import BaseRecognizer, PiiEntity

class SicilNoRecognizer(BaseRecognizer):
    entity_type = "SICIL_NO"

    def find(self, text: str) -> list[PiiEntity]:
        import re
        return [
            self._entity(m.group(), m.start(), m.end(), score=1.0)
            for m in re.finditer(r"\bSCL-\d{6}\b", text)
        ]

from kvkk_pii.layers.regex_layer import DEFAULT_RECOGNIZERS
detector = PiiDetector(recognizers=DEFAULT_RECOGNIZERS + [SicilNoRecognizer()])

Configuration

Fine-tune recognizer strictness via config dataclasses:

from kvkk_pii import PiiDetector
from kvkk_pii.config import NerConfig, GlinerConfig, TcKimlikConfig
from kvkk_pii.recognizers.tc_kimlik import TcKimlikRecognizer
from kvkk_pii.layers.regex_layer import DEFAULT_RECOGNIZERS

detector = PiiDetector(
    layers=["regex", "ner", "gliner"],
    recognizers=DEFAULT_RECOGNIZERS + [
        TcKimlikRecognizer(TcKimlikConfig(allow_spaced=True, require_checksum=True))
    ],
    download_policy="auto",   # "confirm" (default) | "auto" | "never"
    ner_config=NerConfig(
        min_score=0.85,       # higher = fewer false positives
        chunk_size=400,       # chars per chunk for long texts
    ),
    gliner_config=GlinerConfig(
        threshold=0.5,
    ),
)

Detected Entity Types

Layer 1 — Regex

Entity Description Validation
TC_KIMLIK Turkish national ID (11 digits) Checksum
VKN Tax ID (10 digits) Checksum
IBAN_TR IBAN (all country codes) Mod97
KREDI_KARTI Credit card number Luhn
TELEFON_TR Turkish phone numbers
EMAIL Email address
IP_ADRESI IPv4 address
PLAKA_TR Turkish license plate
PASAPORT_TR Turkish passport
SGK_NO Social security number
ADRES Street address
TARIH Date
KISI_ADI Person name (title-based)

Layer 2 — NER (akdeniz27/xlm-roberta-base-turkish-ner)

Entity Description
KISI_ADI Person name
KONUM Location
KURUM Organization

Layer 3 — GLiNER (urchade/gliner_multi-v2.1, KVKK Madde 6)

Entity KVKK Article
SAGLIK_VERISI Health data
DINI_INANC Religious belief
SIYASI_GORUS Political opinion
SENDIKA_UYELIGII Trade union membership
BIYOMETRIK_VERI Biometric / genetic data

Requirements

  • Python 3.10+
  • pip install kvkk-pii — no dependencies (regex only)
  • pip install kvkk-pii[ner]transformers, torch, huggingface-hub
  • pip install kvkk-pii[full] — above + gliner
  • pip install kvkk-pii[server] — above + fastapi, uvicorn

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kvkk_pii-0.1.0.tar.gz (42.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kvkk_pii-0.1.0-py3-none-any.whl (41.6 kB view details)

Uploaded Python 3

File details

Details for the file kvkk_pii-0.1.0.tar.gz.

File metadata

  • Download URL: kvkk_pii-0.1.0.tar.gz
  • Upload date:
  • Size: 42.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for kvkk_pii-0.1.0.tar.gz
Algorithm Hash digest
SHA256 073a939164b073126a4902788f99bde8c83e141fa8da40819ab1ad1ae073029e
MD5 86fefa1e816b9ebba2f5c66ecd5be06c
BLAKE2b-256 e2d74a3e6abe105149e14ff0b223ce3fca899697366ca521a8492f087e213ccf

See more details on using hashes here.

File details

Details for the file kvkk_pii-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: kvkk_pii-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 41.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for kvkk_pii-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d21485832317098f7fa5f52256738bf711146fd2365a7d45692da68c5b96bbc8
MD5 80f9789a082ee38dad689e30a819df63
BLAKE2b-256 0e12915a08909930ded8856381e2dbc797f989f53e46586f6790832a43a5ea93

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page