Skip to main content

Zero-dependency PII + quality + noise audit for LLM datasets (TR/EU/US)

Project description

flexorch-audit

PyPI Python License: MIT

Zero-dependency PII detection, quality grading, and noise audit for LLM datasets — in a single function call.

Why

Before feeding documents into an LLM pipeline you need to answer three questions:

  1. Does this text contain personal data? Sending PII to a language model is a compliance risk.
  2. Is the text quality high enough? Short, noisy, or duplicate records hurt fine-tuning and RAG retrieval.
  3. How bad is the noise? Garbled encodings and symbol clutter degrade model output silently.

Most tools that answer these questions require heavy NLP frameworks, model weights, or cloud APIs. flexorch-audit answers all three with one call — using only regex and Python's standard library. No model weights, no network calls, no external packages.

Features

  • Quality grade — A/B/C/D composite score: is this text LLM-ready at a glance?
  • Noise ratio — line-level symbol clutter detection (noise_ratio); values above 0.20 indicate likely extraction artifacts
  • PII detection — 30+ types across 8 countries (TR/DE/FR/IT/NL/ES/UK/US) + universal types; all regex-based with checksum validation
  • Batch auditaudit_batch() aggregates duplicate ratio and PII counts across an entire dataset in one call
  • Masking — four strategies: redact, replace (synthetic), token, hash
  • Zero runtime dependencies — pure Python stdlib, Python 3.10+

Install

pip install flexorch-audit

Quick start

from flexorch_audit import audit, mask

text = open("contract.txt").read()  # extract from PDF/DOCX first

result = audit(text)               # "und" by default — all detectors active
# result = audit(text, locale="tr")  # restrict to TR-only detectors

result.quality_grade      # "B"
result.quality_score      # 0.73  (0.0–1.0 composite)
result.noise_ratio        # 0.04  (fraction of blank/garbage lines; >0.20 = low quality)
result.detected_language  # "und" (locale you passed in; caller controls language)
result.pii_summary        # [{"type": "email", "count": 2}, {"type": "national_id_tr", "count": 1}]

# Full findings and raw metrics — dict access also works:
result["pii"]    # [{"type": "email", "value": "ali@example.com", "start": 8, "end": 23}]
result["quality"]  # {"completeness": 1.0, "avg_length": 342, "duplicate_ratio": None}
result["noise"]    # {"garbage_ratio": 0.0, "encoding_ok": True}

clean = mask(text, result["pii"], strategy="redact")
# "Contact: [REDACTED_EMAIL]"

demo

One-shot redaction

from flexorch_audit import redact_for_llm

clean = redact_for_llm("TCKN: 12345678950, email: ali@example.com", locale="tr")
# "TCKN: [REDACTED_NATIONAL_ID_TR], email: [REDACTED_EMAIL]"

# Different masking strategies
redact_for_llm(text, locale="tr", strategy="token")   # <PII_NATIONAL_ID_TR_1>
redact_for_llm(text, locale="tr", strategy="hash")    # [3d4f9a1b2c8e7f0a]
redact_for_llm(text, locale="tr", strategy="replace") # static synthetic value

No PII found → original text returned unchanged.

Token estimation

from flexorch_audit import estimate_tokens

estimate_tokens("The quick brown fox jumps over the lazy dog.")  # → 16
estimate_tokens("")  # → 0

Heuristic: words × 4/3 — no tiktoken required. Accuracy within ~15% of the real tokenizer for English and most European languages; treat as a planning estimate for context window sizing and cost forecasting.

Batch audit

from flexorch_audit import audit_batch

texts = [record["text"] for record in dataset]
batch = audit_batch(texts)           # locale="und" by default

batch["duplicate_ratio"]    # 0.12 — fraction of exact-duplicate records
batch["avg_quality_score"]  # 0.78
batch["pii_summary"]        # [{"type": "email", "count": 47}, ...]
batch["results"]            # list of AuditResult, one per text

Country coverage

locale Detectors activated
"und" (default) All locales combined — use when document language is unknown
"all" Alias for "und"
"tr" TCKN · VKN · phone_tr · name · IBAN_TR · company_name_tr · MERSIS · postal_code_tr · province_tr
"de" Steueridentifikationsnummer · Sozialversicherungsnummer
"fr" SIREN · SIRET · INSEE/NIR
"it" Codice Fiscale · Partita IVA
"nl" BSN · KvK
"es" DNI/NIE · CIF
"uk" NI number · UTR
"us" SSN · EIN · ITIN
"eu" E.164 phone · IBAN (EU+GB+CH+NO) · company name

Universal detectors (always active regardless of locale): email · iban · credit_card · ip · ip_v6

Language detection: flexorch-audit is zero-dependency — no language detection library is included. Pass the correct locale yourself, or use "und" (default) to activate all detectors.

PII types

Universal

Type Description
email RFC-5321 email address
iban ISO 13616 IBAN — mod-97 validated; suppressed when iban_tr or iban_intl fires on same span
credit_card 16-digit groups, Luhn-validated
ip IPv4 address
ip_v6 IPv6 — full, compressed ::, loopback forms

Turkey (locale="tr")

Type Description
national_id_tr TCKN — 11-digit, modular arithmetic checksum
tax_id_tr VKN — 10-digit, Luhn-variant checksum
phone_tr Turkish mobile: +90/0 prefix + 10 digits
name Label-prefixed name: Adı:, Full Name:, Customer Name:, etc.
iban_tr Turkish IBAN (TR + 24 chars), mod-97 validated
company_name_tr Company with TR legal suffix: A.Ş. · Ltd.Şti. · Koll.Şti. · Koop. · T.A.Ş.
mersis_no MERSIS — 16-digit company registry number
postal_code_tr Turkish postal code (province plate 01–81)
province_tr All 81 Turkish provinces

Germany (locale="de")

Type Description
tax_id_de Steueridentifikationsnummer — 11 digits, ISO 7064 MOD 11,2 checksum
social_id_de Sozialversicherungsnummer — area + DOB + letter + serial

France (locale="fr")

Type Description
siret_fr SIRET — 14 digits, label-prefix gated
company_id_fr SIREN — 9 digits, label-prefix gated
social_id_fr INSEE/NIR — 15 digits, starts with 1 or 2

Italy (locale="it")

Type Description
national_id_it Codice Fiscale — 16 chars alphanumeric, uppercase normalized
tax_id_it Partita IVA — 11 digits, Agenzia delle Entrate checksum

Netherlands (locale="nl")

Type Description
national_id_nl BSN — 9 digits, 11-check (weighted sum mod 11)
company_id_nl KvK — 8 digits, label-prefix gated

Spain (locale="es")

Type Description
national_id_es DNI (8 digits + letter, mod-23) and NIE (X/Y/Z prefix, same check)
tax_id_es CIF — letter prefix + 7 digits + control character

United Kingdom (locale="uk")

Type Description
social_id_uk NI number — 2 letters + 6 digits + A/B/C/D; HMRC forbidden prefixes excluded
tax_id_uk UTR — 10 digits, label-prefix gated

United States (locale="us")

Type Description
ssn SSN — ###-##-####, invalid prefixes (000/666/9xx) excluded
tax_id_us EIN — XX-XXXXXXX, IRS invalid area prefixes excluded
national_id_us ITIN — 9XX-7X/8X/9X-XXXX middle group validated

EU / International (locale="eu")

Type Description
phone_intl E.164 international phone — 7–15 digits, TR (+90) excluded
iban_intl IBAN for EU+GB+CH+NO — ISO 13616 country+length table + mod-97
company_name_intl Company with international suffix: GmbH · LLC · S.r.l. · B.V. · SAS · Inc. · Ltd. etc.

Noise detection

noise_ratio measures the fraction of lines that are blank or contain symbol clutter:

result = audit("clean line\n@@@garbage\n\nclean")
result.noise_ratio   # 0.5  (2 noisy lines out of 4)

A line is "noisy" when it is blank (after strip) or contains 3+ consecutive characters from @ # ! ~ * =.

noise_ratio Signal
< 0.05 Clean — likely well-extracted text
0.05–0.20 Acceptable — minor formatting artifacts
> 0.20 Low quality — likely OCR noise or extraction failure

Masking strategies

clean = mask(text, result["pii"], strategy="redact")   # default
clean = mask(text, result["pii"], strategy="token")
clean = mask(text, result["pii"], strategy="hash")
clean = mask(text, result["pii"], strategy="replace")
Strategy Example output
redact (default) [REDACTED_EMAIL]
replace user@example.com (static synthetic)
token <PII_EMAIL_1> (unique per type per call)
hash [3d4f9a1b2c8e7f0a] (SHA-256 first 16 hex chars)

Quality grade

quality_grade (A–D) and quality_score (0.0–1.0) are composite signals:

Grade Score Signal
A ≥ 0.85 Ready for LLM training or RAG
B ≥ 0.65 Usable with minor cleanup
C ≥ 0.40 Review before use
D < 0.40 Not suitable — empty, too short, or high noise

Score formula: completeness × (0.4 × noise_score + 0.4 × length_score + 0.2)
length_score = min(char_count / 500, 1.0) · noise_score = max(0, 1 − garbage_ratio × 10)

Limitations

  • No automatic language detectionflexorch-audit has zero dependencies. Pass locale explicitly, or use the default "und" to activate all detectors. See LIMITATIONS.md.
  • Free-standing name detection (without a label prefix) requires NLP/NER — not included.
  • replace masking uses static synthetic values; locale-aware realistic synthesis is not implemented.
  • The library audits plain text. PDF/DOCX parsing, e-invoice extraction, and pipeline orchestration are out of scope.

Integrations

Works with LangChain Works with LlamaIndex

flexorch-audit slots into any LangChain or LlamaIndex pipeline as a pre-load filter — audit quality, detect PII, and optionally mask before your documents reach the LLM.

LangChainexamples/langchain_loader.py

from examples.langchain_loader import AuditedLoader  # copy to your project

loader = AuditedLoader(
    texts=my_texts,
    locale="tr",       # or "de", "fr", "us", "und" (all)
    mask_pii=True,     # redact PII before loading
    min_grade="B",     # skip low-quality documents
)
docs = loader.load()
# doc.metadata → {"quality_grade": "A", "quality_score": 0.91, "pii_summary": [...], ...}

LlamaIndexexamples/llamaindex_reader.py

from examples.llamaindex_reader import AuditedReader  # copy to your project

reader = AuditedReader(locale="tr", mask_pii=True)
docs = reader.load_data(my_texts, min_grade="B")
# doc.extra_info → {"quality_grade": "A", "quality_score": 0.91, "pii_summary": [...], ...}

Both loaders are thin wrappers (~60 lines) with no new dependencies beyond langchain-core or llama-index-core. Copy them into your project — no framework lock-in.

Also available for JavaScript / TypeScript

npm install @flexorch/audit

Contributing

See CONTRIBUTING.md.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flexorch_audit-0.8.2.tar.gz (41.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

flexorch_audit-0.8.2-py3-none-any.whl (22.5 kB view details)

Uploaded Python 3

File details

Details for the file flexorch_audit-0.8.2.tar.gz.

File metadata

  • Download URL: flexorch_audit-0.8.2.tar.gz
  • Upload date:
  • Size: 41.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for flexorch_audit-0.8.2.tar.gz
Algorithm Hash digest
SHA256 b3d8ac5b6f2694e2da76d048b4da01ed0df5a237f78307415e19e1d14574c118
MD5 e2b9650f302a97a0014d6298a1ea97b4
BLAKE2b-256 e3b8ae5f7c81c29905c21eaf6eab175dee9ed82c350ac9ce98abc0c67049b4ea

See more details on using hashes here.

File details

Details for the file flexorch_audit-0.8.2-py3-none-any.whl.

File metadata

  • Download URL: flexorch_audit-0.8.2-py3-none-any.whl
  • Upload date:
  • Size: 22.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for flexorch_audit-0.8.2-py3-none-any.whl
Algorithm Hash digest
SHA256 1617e03fb825ad25e66f9c0554b62abd8065d890b13dc7d83ee159da8c5dc1bc
MD5 ef5e24076c2aee05e75da6ec1fc2f5ac
BLAKE2b-256 8697b4a38181a06eb5c5e54fbdfb9c128395a4b3ef8f052f7bba10a10686f94e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page