Skip to main content

Zero-dependency PII + quality + noise audit for LLM datasets (TR/EU/US)

Project description

flexorch-audit

PyPI Python License: MIT

Zero-dependency PII detection, quality grading, and noise audit for LLM datasets — in a single function call.

Why

Before feeding documents into an LLM pipeline you need to answer three questions:

  1. Does this text contain personal data? Sending PII to a language model is a compliance risk.
  2. Is the text quality high enough? Short, noisy, or duplicate records hurt fine-tuning and RAG retrieval.
  3. How bad is the noise? Garbled encodings and control characters degrade model output silently.

Most tools that answer these questions require heavy NLP frameworks, model weights, or cloud APIs. flexorch-audit answers all three with one call — using only regex and Python's standard library. No model weights, no network calls, no external packages.

Features

  • Quality grade — A/B/C/D composite score: is this text LLM-ready at a glance?
  • PII detection — email, phone (TR mobile + E.164), credit card (Luhn), IPv4, IPv6, TCKN, VKN, IBAN (mod-97 validated), SSN, label-prefixed names
  • Batch auditaudit_batch() aggregates duplicate ratio and PII counts across an entire dataset in one call
  • Noise metrics — garbage character ratio, encoding health check
  • Masking — four strategies: redact, replace (synthetic), token, hash
  • Zero runtime dependencies — pure Python stdlib, Python 3.10+

Install

pip install flexorch-audit

Quick start

from flexorch_audit import audit, mask

text = open("contract.txt").read()  # extract from PDF/DOCX first
result = audit(text, locale="tr")

result.quality_grade   # "A"
result.quality_score   # 0.91  (0.0–1.0 composite)
result.pii_summary     # [{"type": "national_id_tr", "count": 3}, {"type": "email", "count": 1}]

# Full findings and raw metrics — dict access also works:
result["pii"]          # [{"type": "email", "value": "...", "start": 8, "end": 23}]
result["quality"]      # {"completeness": 1.0, "avg_length": 342, "duplicate_ratio": None}
result["noise"]        # {"garbage_ratio": 0.0, "encoding_ok": True}

clean = mask(text, result["pii"], strategy="redact")
# "Contact: [REDACTED_EMAIL]"

demo

Batch audit

Use audit_batch() to audit an entire dataset and get aggregate metrics including duplicate_ratio:

from flexorch_audit import audit_batch

texts = [record["text"] for record in dataset]
batch = audit_batch(texts, locale="tr")

batch["duplicate_ratio"]    # 0.12 — fraction of exact-duplicate records
batch["avg_quality_score"]  # 0.78
batch["pii_summary"]        # [{"type": "email", "count": 47}, ...]
batch["results"]            # list of AuditResult, one per text

Locale support

locale Active detectors
"tr" (default) email, iban, credit_card, ip, ip_v6 + TCKN, VKN, phone_tr, name
"us" email, iban, credit_card, ip, ip_v6 + SSN, E.164 phone
"eu" email, iban, credit_card, ip, ip_v6 + E.164 phone
"all" All of the above (phone_tr takes precedence over generic phone)

PII types

Type Description Locale
email RFC-5321 address all
iban ISO 13616 IBAN — mod-97 checksum validated all
credit_card 16-digit groups, Luhn-validated all
ip IPv4 address all
ip_v6 IPv6 address (full, compressed, loopback) all
phone_tr Turkish mobile (+90/0 prefix + 10 digits) tr
national_id_tr TCKN — 11-digit modular arithmetic checksum tr
tax_id_tr VKN — 10-digit Luhn-variant checksum tr
name Label-prefixed name (e.g. "Adı: Ali Yıldız", "Full Name: Jane Doe") tr
phone E.164 international phone us, eu
ssn US Social Security Number (###-##-####) us

Masking strategies

Strategy Example output
redact (default) [REDACTED_EMAIL]
replace user@example.com (static synthetic)
token <PII_EMAIL_1> (unique per type per call)
hash [3d4f9a1b2c8e7f0a] (SHA-256 first 16 hex chars)

Quality grade

quality_grade (A–D) and quality_score (0.0–1.0) are composite signals:

Grade Score Signal
A ≥ 0.85 Ready for LLM training or RAG
B ≥ 0.65 Usable with minor cleanup
C ≥ 0.40 Review before use
D < 0.40 Not suitable — empty, too short, or high noise

Score formula: completeness × (0.4 × noise_score + 0.4 × length_score + 0.2)
length_score = min(char_count / 500, 1.0) · noise_score = max(0, 1 − garbage_ratio × 10)

Limitations (v0.4)

  • Free-standing name detection (without a label prefix) requires NLP/NER — not included.
  • replace masking strategy uses static synthetic values; locale-aware realistic synthesis is not yet implemented.

Also available for JavaScript / TypeScript

npm install @flexorch/audit

Contributing

See CONTRIBUTING.md.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flexorch_audit-0.3.1.tar.gz (18.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

flexorch_audit-0.3.1-py3-none-any.whl (11.9 kB view details)

Uploaded Python 3

File details

Details for the file flexorch_audit-0.3.1.tar.gz.

File metadata

  • Download URL: flexorch_audit-0.3.1.tar.gz
  • Upload date:
  • Size: 18.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for flexorch_audit-0.3.1.tar.gz
Algorithm Hash digest
SHA256 8bc6a401f6e4874516c5dbd0eea59b7f11775217dee528fe45b467d9dcd77ee6
MD5 7cc0cda67c94d62512da5c3087222115
BLAKE2b-256 56b7fd133ba4affe4e4ac40ede0993bc58f5eced52031b3cd476bfe873507a57

See more details on using hashes here.

File details

Details for the file flexorch_audit-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: flexorch_audit-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 11.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for flexorch_audit-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ac6ef6401b282e0756388db4f14c10c798ecd47225774a2d1d74834ada8edcd0
MD5 7c3d7826953c1373806693f1a7261ee0
BLAKE2b-256 378966ee0fc47c87fdbebc34424ab3a1b447a56b88aa87ffaec76ee66d1e9f0b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page