Zero-dependency PII + quality + noise audit for LLM datasets (TR/EU/US)
Project description
flexorch-audit
Zero-dependency PII detection, quality grading, and noise audit for LLM datasets — in a single function call.
Why
Before feeding documents into an LLM pipeline you need to answer three questions:
- Does this text contain personal data? Sending PII to a language model is a compliance risk.
- Is the text quality high enough? Short, noisy, or duplicate records hurt fine-tuning and RAG retrieval.
- How bad is the noise? Garbled encodings and symbol clutter degrade model output silently.
Most tools that answer these questions require heavy NLP frameworks, model weights, or cloud APIs. flexorch-audit answers all three with one call — using only regex and Python's standard library. No model weights, no network calls, no external packages.
Features
- Quality grade — A/B/C/D composite score: is this text LLM-ready at a glance?
- Noise ratio — line-level symbol clutter detection (
noise_ratio); values above 0.20 indicate likely extraction artifacts - PII detection — 30+ types across 8 countries (TR/DE/FR/IT/NL/ES/UK/US) + universal types; all regex-based with checksum validation
- Batch audit —
audit_batch()aggregates duplicate ratio and PII counts across an entire dataset in one call - Masking — four strategies: redact, replace (synthetic), token, hash
- Zero runtime dependencies — pure Python stdlib, Python 3.10+
Install
pip install flexorch-audit
Quick start
from flexorch_audit import audit, mask
text = open("contract.txt").read() # extract from PDF/DOCX first
result = audit(text) # "und" by default — all detectors active
# result = audit(text, locale="tr") # restrict to TR-only detectors
result.quality_grade # "B"
result.quality_score # 0.73 (0.0–1.0 composite)
result.noise_ratio # 0.04 (fraction of blank/garbage lines; >0.20 = low quality)
result.detected_language # "und" (locale you passed in; caller controls language)
result.pii_summary # [{"type": "email", "count": 2}, {"type": "national_id_tr", "count": 1}]
# Full findings and raw metrics — dict access also works:
result["pii"] # [{"type": "email", "value": "ali@example.com", "start": 8, "end": 23}]
result["quality"] # {"completeness": 1.0, "avg_length": 342, "duplicate_ratio": None}
result["noise"] # {"garbage_ratio": 0.0, "encoding_ok": True}
clean = mask(text, result["pii"], strategy="redact")
# "Contact: [REDACTED_EMAIL]"
Batch audit
from flexorch_audit import audit_batch
texts = [record["text"] for record in dataset]
batch = audit_batch(texts) # locale="und" by default
batch["duplicate_ratio"] # 0.12 — fraction of exact-duplicate records
batch["avg_quality_score"] # 0.78
batch["pii_summary"] # [{"type": "email", "count": 47}, ...]
batch["results"] # list of AuditResult, one per text
Country coverage
locale |
Detectors activated |
|---|---|
"und" (default) |
All locales combined — use when document language is unknown |
"all" |
Alias for "und" |
"tr" |
TCKN · VKN · phone_tr · name · IBAN_TR · company_name_tr · MERSIS · postal_code_tr · province_tr |
"de" |
Steueridentifikationsnummer · Sozialversicherungsnummer |
"fr" |
SIREN · SIRET · INSEE/NIR |
"it" |
Codice Fiscale · Partita IVA |
"nl" |
BSN · KvK |
"es" |
DNI/NIE · CIF |
"uk" |
NI number · UTR |
"us" |
SSN · EIN · ITIN |
"eu" |
E.164 phone · IBAN (EU+GB+CH+NO) · company name |
Universal detectors (always active regardless of locale): email · iban · credit_card · ip · ip_v6
Language detection:
flexorch-auditis zero-dependency — no language detection library is included. Pass the correctlocaleyourself, or use"und"(default) to activate all detectors.
PII types
Universal
| Type | Description |
|---|---|
email |
RFC-5321 email address |
iban |
ISO 13616 IBAN — mod-97 validated; suppressed when iban_tr or iban_intl fires on same span |
credit_card |
16-digit groups, Luhn-validated |
ip |
IPv4 address |
ip_v6 |
IPv6 — full, compressed ::, loopback forms |
Turkey (locale="tr")
| Type | Description |
|---|---|
national_id_tr |
TCKN — 11-digit, modular arithmetic checksum |
tax_id_tr |
VKN — 10-digit, Luhn-variant checksum |
phone_tr |
Turkish mobile: +90/0 prefix + 10 digits |
name |
Label-prefixed name: Adı:, Full Name:, Customer Name:, etc. |
iban_tr |
Turkish IBAN (TR + 24 chars), mod-97 validated |
company_name_tr |
Company with TR legal suffix: A.Ş. · Ltd.Şti. · Koll.Şti. · Koop. · T.A.Ş. |
mersis_no |
MERSIS — 16-digit company registry number |
postal_code_tr |
Turkish postal code (province plate 01–81) |
province_tr |
All 81 Turkish provinces |
Germany (locale="de")
| Type | Description |
|---|---|
tax_id_de |
Steueridentifikationsnummer — 11 digits, ISO 7064 MOD 11,2 checksum |
social_id_de |
Sozialversicherungsnummer — area + DOB + letter + serial |
France (locale="fr")
| Type | Description |
|---|---|
siret_fr |
SIRET — 14 digits, label-prefix gated |
company_id_fr |
SIREN — 9 digits, label-prefix gated |
social_id_fr |
INSEE/NIR — 15 digits, starts with 1 or 2 |
Italy (locale="it")
| Type | Description |
|---|---|
national_id_it |
Codice Fiscale — 16 chars alphanumeric, uppercase normalized |
tax_id_it |
Partita IVA — 11 digits, Agenzia delle Entrate checksum |
Netherlands (locale="nl")
| Type | Description |
|---|---|
national_id_nl |
BSN — 9 digits, 11-check (weighted sum mod 11) |
company_id_nl |
KvK — 8 digits, label-prefix gated |
Spain (locale="es")
| Type | Description |
|---|---|
national_id_es |
DNI (8 digits + letter, mod-23) and NIE (X/Y/Z prefix, same check) |
tax_id_es |
CIF — letter prefix + 7 digits + control character |
United Kingdom (locale="uk")
| Type | Description |
|---|---|
social_id_uk |
NI number — 2 letters + 6 digits + A/B/C/D; HMRC forbidden prefixes excluded |
tax_id_uk |
UTR — 10 digits, label-prefix gated |
United States (locale="us")
| Type | Description |
|---|---|
ssn |
SSN — ###-##-####, invalid prefixes (000/666/9xx) excluded |
tax_id_us |
EIN — XX-XXXXXXX, IRS invalid area prefixes excluded |
national_id_us |
ITIN — 9XX-7X/8X/9X-XXXX middle group validated |
EU / International (locale="eu")
| Type | Description |
|---|---|
phone_intl |
E.164 international phone — 7–15 digits, TR (+90) excluded |
iban_intl |
IBAN for EU+GB+CH+NO — ISO 13616 country+length table + mod-97 |
company_name_intl |
Company with international suffix: GmbH · LLC · S.r.l. · B.V. · SAS · Inc. · Ltd. etc. |
Noise detection
noise_ratio measures the fraction of lines that are blank or contain symbol clutter:
result = audit("clean line\n@@@garbage\n\nclean")
result.noise_ratio # 0.5 (2 noisy lines out of 4)
A line is "noisy" when it is blank (after strip) or contains 3+ consecutive characters from @ # ! ~ * =.
noise_ratio |
Signal |
|---|---|
< 0.05 |
Clean — likely well-extracted text |
0.05–0.20 |
Acceptable — minor formatting artifacts |
> 0.20 |
Low quality — likely OCR noise or extraction failure |
Masking strategies
clean = mask(text, result["pii"], strategy="redact") # default
clean = mask(text, result["pii"], strategy="token")
clean = mask(text, result["pii"], strategy="hash")
clean = mask(text, result["pii"], strategy="replace")
| Strategy | Example output |
|---|---|
redact (default) |
[REDACTED_EMAIL] |
replace |
user@example.com (static synthetic) |
token |
<PII_EMAIL_1> (unique per type per call) |
hash |
[3d4f9a1b2c8e7f0a] (SHA-256 first 16 hex chars) |
Quality grade
quality_grade (A–D) and quality_score (0.0–1.0) are composite signals:
| Grade | Score | Signal |
|---|---|---|
| A | ≥ 0.85 | Ready for LLM training or RAG |
| B | ≥ 0.65 | Usable with minor cleanup |
| C | ≥ 0.40 | Review before use |
| D | < 0.40 | Not suitable — empty, too short, or high noise |
Score formula: completeness × (0.4 × noise_score + 0.4 × length_score + 0.2)
length_score = min(char_count / 500, 1.0) · noise_score = max(0, 1 − garbage_ratio × 10)
Limitations
- No automatic language detection —
flexorch-audithas zero dependencies. Passlocaleexplicitly, or use the default"und"to activate all detectors. See LIMITATIONS.md. - Free-standing name detection (without a label prefix) requires NLP/NER — not included.
replacemasking uses static synthetic values; locale-aware realistic synthesis is not implemented.- The library audits plain text. PDF/DOCX parsing, e-invoice extraction, and pipeline orchestration are out of scope.
Also available for JavaScript / TypeScript
npm install @flexorch/audit
Contributing
See CONTRIBUTING.md.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file flexorch_audit-0.6.0.tar.gz.
File metadata
- Download URL: flexorch_audit-0.6.0.tar.gz
- Upload date:
- Size: 35.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
49c15701427bafa4851b5ee439389a4cd43152ce49c5602a22ed04c37acba9ff
|
|
| MD5 |
7e1f2030e2b83df7a5bb0a599df01d44
|
|
| BLAKE2b-256 |
3e263bf9c37a1b8f3b9e2d78d254974ac03866b745365be8191f414a5bd4c847
|
File details
Details for the file flexorch_audit-0.6.0-py3-none-any.whl.
File metadata
- Download URL: flexorch_audit-0.6.0-py3-none-any.whl
- Upload date:
- Size: 20.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4c779a220377c7563ea23bbde05a51013eb987ec0dd0e02cb3e2b9ab8b943586
|
|
| MD5 |
70540ab4a5c6941ddf29709359c9b604
|
|
| BLAKE2b-256 |
7967f9eac1e3b1b861bce280da77cffbe1c4d852d0458981b89ba12571799a0e
|