Zero-dependency PII + quality + noise audit for LLM datasets (TR/EU/US)
Project description
flexorch-audit
Zero-dependency PII + quality + noise audit for LLM datasets. Answers one question: is this dataset ready for LLM training?
- PII detection — email, phone (TR + E.164), credit card (Luhn), IP, TCKN, IBAN, SSN, label-prefixed names
- Quality metrics — completeness, average length, duplicate ratio
- Noise metrics — garbage character ratio, encoding health
- Masking — redact / replace / token / hash strategies
- Zero runtime dependencies — pure Python stdlib, Python 3.10+
from flexorch_audit import audit, mask
result = audit(text, locale="tr")
# {
# "pii": [{"type": "email", "value": "ali@example.com", "start": 8, "end": 23}],
# "quality": {"completeness": 1.0, "avg_length": 342, "duplicate_ratio": None},
# "noise": {"garbage_ratio": 0.0, "encoding_ok": True},
# }
clean = mask(text, result["pii"], strategy="redact")
# "Contact: [REDACTED_EMAIL]"
Install
pip install flexorch-audit
Locale support
locale |
Active detectors |
|---|---|
"tr" (default) |
email, iban, credit_card, ip + TCKN, phone_tr, name |
"us" |
email, iban, credit_card, ip + SSN, E.164 phone |
"eu" |
email, iban, credit_card, ip + E.164 phone |
"all" |
All of the above (phone_tr takes precedence over generic phone) |
PII types
| Type | Description | Locale |
|---|---|---|
email |
RFC-5321 address | all |
iban |
ISO 13616 IBAN (any country) | all |
credit_card |
16-digit groups, Luhn-validated | all |
ip |
IPv4 address | all |
phone_tr |
Turkish mobile (+90/0 prefix + 10 digits) | tr |
national_id_tr |
TCKN — 11-digit modular arithmetic checksum | tr |
name |
Label-prefixed name (e.g. "Adı: Ali Yıldız", "Full Name: Jane Doe") | tr |
phone |
E.164 international phone | us, eu |
ssn |
US Social Security Number (###-##-####) | us |
Masking strategies
| Strategy | Example output |
|---|---|
redact (default) |
[REDACTED_EMAIL] |
replace |
user@example.com (realistic synthetic) |
token |
<PII_EMAIL_1> (unique per type) |
hash |
[3d4f9a1b2c8e7f0a] (SHA-256 first 16 hex chars) |
Quality & noise
duplicate_ratio is null for single-string input. To compute it across a dataset:
texts = [record["text"] for record in dataset]
results = [audit(t) for t in texts]
seen = set()
duplicates = sum(1 for t in texts if t in seen or seen.add(t))
duplicate_ratio = duplicates / len(texts)
Limitations (v0.1)
- Free-standing name detection (without a label prefix) requires NLP/NER — not included.
duplicate_ratiois per-call; aggregate across your dataset manually (see above).- IPv6 not detected.
- IBAN format-only check; mod-97 validation not performed.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
flexorch_audit-0.1.0.tar.gz
(9.8 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file flexorch_audit-0.1.0.tar.gz.
File metadata
- Download URL: flexorch_audit-0.1.0.tar.gz
- Upload date:
- Size: 9.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b9d5b9da3570011cf279a7cb4d30465845010126555c0b80efe14fdc7d636429
|
|
| MD5 |
a477b0f0851f420f15bbd17910917bdd
|
|
| BLAKE2b-256 |
eff5ba093854e8275b36100268bdf19f3ca0719bf16153ab8c2df1c9330ea801
|
File details
Details for the file flexorch_audit-0.1.0-py3-none-any.whl.
File metadata
- Download URL: flexorch_audit-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
78206468062736a3e6d7269aab3faf668111946e6c2aeb70f4938430c3681ad7
|
|
| MD5 |
840b82c3aa074161561838f4f4050e98
|
|
| BLAKE2b-256 |
7c75c2fd683ef9771e7a1dd2f096c1a5c7c9ff2d2dc9183f8db923cd41a792de
|