Zero-dependency PII + quality + noise audit for LLM datasets (TR/EU/US)
Project description
flexorch-audit
Zero-dependency PII detection, quality grading, and noise audit for LLM datasets — in a single function call.
Why
Before feeding documents into an LLM pipeline you need to answer three questions:
- Does this text contain personal data? Sending PII to a language model is a compliance risk.
- Is the text quality high enough? Short, noisy, or duplicate records hurt fine-tuning and RAG retrieval.
- How bad is the noise? Garbled encodings and control characters degrade model output silently.
Most tools that answer these questions require heavy NLP frameworks, model weights, or cloud APIs. flexorch-audit answers all three with one call — using only regex and Python's standard library. No model weights, no network calls, no external packages.
Features
- Quality grade — A/B/C/D composite score: is this text LLM-ready at a glance?
- PII detection — email, phone (TR mobile + E.164), credit card (Luhn), IPv4, IPv6, TCKN, VKN, IBAN (mod-97 validated), SSN, label-prefixed names
- Batch audit —
audit_batch()aggregates duplicate ratio and PII counts across an entire dataset in one call - Noise metrics — garbage character ratio, encoding health check
- Masking — four strategies: redact, replace (synthetic), token, hash
- Zero runtime dependencies — pure Python stdlib, Python 3.10+
Install
pip install flexorch-audit
Quick start
from flexorch_audit import audit, mask
text = open("contract.txt").read() # extract from PDF/DOCX first
result = audit(text, locale="tr")
result.quality_grade # "A"
result.quality_score # 0.91 (0.0–1.0 composite)
result.pii_summary # [{"type": "national_id_tr", "count": 3}, {"type": "email", "count": 1}]
# Full findings and raw metrics — dict access also works:
result["pii"] # [{"type": "email", "value": "...", "start": 8, "end": 23}]
result["quality"] # {"completeness": 1.0, "avg_length": 342, "duplicate_ratio": None}
result["noise"] # {"garbage_ratio": 0.0, "encoding_ok": True}
clean = mask(text, result["pii"], strategy="redact")
# "Contact: [REDACTED_EMAIL]"
Batch audit
Use audit_batch() to audit an entire dataset and get aggregate metrics including duplicate_ratio:
from flexorch_audit import audit_batch
texts = [record["text"] for record in dataset]
batch = audit_batch(texts, locale="tr")
batch["duplicate_ratio"] # 0.12 — fraction of exact-duplicate records
batch["avg_quality_score"] # 0.78
batch["pii_summary"] # [{"type": "email", "count": 47}, ...]
batch["results"] # list of AuditResult, one per text
Locale support
locale |
Active detectors |
|---|---|
"tr" (default) |
email, iban, credit_card, ip, ip_v6 + TCKN, VKN, phone_tr, name |
"us" |
email, iban, credit_card, ip, ip_v6 + SSN, E.164 phone |
"eu" |
email, iban, credit_card, ip, ip_v6 + E.164 phone |
"all" |
All of the above (phone_tr takes precedence over generic phone) |
PII types
| Type | Description | Locale |
|---|---|---|
email |
RFC-5321 address | all |
iban |
ISO 13616 IBAN — mod-97 checksum validated | all |
credit_card |
16-digit groups, Luhn-validated | all |
ip |
IPv4 address | all |
ip_v6 |
IPv6 address (full, compressed, loopback) | all |
phone_tr |
Turkish mobile (+90/0 prefix + 10 digits) | tr |
national_id_tr |
TCKN — 11-digit modular arithmetic checksum | tr |
tax_id_tr |
VKN — 10-digit Luhn-variant checksum | tr |
name |
Label-prefixed name (e.g. "Adı: Ali Yıldız", "Full Name: Jane Doe") | tr |
phone |
E.164 international phone | us, eu |
ssn |
US Social Security Number (###-##-####) | us |
Masking strategies
| Strategy | Example output |
|---|---|
redact (default) |
[REDACTED_EMAIL] |
replace |
user@example.com (static synthetic) |
token |
<PII_EMAIL_1> (unique per type per call) |
hash |
[3d4f9a1b2c8e7f0a] (SHA-256 first 16 hex chars) |
Quality grade
quality_grade (A–D) and quality_score (0.0–1.0) are composite signals:
| Grade | Score | Signal |
|---|---|---|
| A | ≥ 0.85 | Ready for LLM training or RAG |
| B | ≥ 0.65 | Usable with minor cleanup |
| C | ≥ 0.40 | Review before use |
| D | < 0.40 | Not suitable — empty, too short, or high noise |
Score formula: completeness × (0.4 × noise_score + 0.4 × length_score + 0.2)
length_score = min(char_count / 500, 1.0) · noise_score = max(0, 1 − garbage_ratio × 10)
Limitations (v0.4)
- Free-standing name detection (without a label prefix) requires NLP/NER — not included.
replacemasking strategy uses static synthetic values; locale-aware realistic synthesis is not yet implemented.
Also available for JavaScript / TypeScript
npm install @flexorch/audit
Contributing
See CONTRIBUTING.md.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file flexorch_audit-0.4.0.tar.gz.
File metadata
- Download URL: flexorch_audit-0.4.0.tar.gz
- Upload date:
- Size: 23.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e92fdd736797914bcc303f5e825cfab3bf5f07c98c1f28bad8c66dc06a891c73
|
|
| MD5 |
5da0bf9eb62690079644407ecfebcdf8
|
|
| BLAKE2b-256 |
e6f0d2b0374a919c67891e23f483b55a91427f04426462727e32611a45164c33
|
File details
Details for the file flexorch_audit-0.4.0-py3-none-any.whl.
File metadata
- Download URL: flexorch_audit-0.4.0-py3-none-any.whl
- Upload date:
- Size: 14.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3db2657ed255f929f37bcc273d045b4f731bfaef369c2c2c02ae7755719a55d4
|
|
| MD5 |
14c324447697aceab01cb45cd148d96e
|
|
| BLAKE2b-256 |
134016733f230e93eb04a68d4497b4096109c86e53da0ba16e4f454e11b5c0ce
|