Skip to main content

Production-quality data cleaning, validation, similarity, anonymization, and Swedish-specific utilities — pure stdlib core

Project description

trollfab-data-cleaner

Production-quality data cleaning, validation, similarity, anonymization, and Swedish-specific utilities — built for Trollfabriken AITrix AB document processing pipelines. Pure Python, no required dependencies.


Modules

Module Purpose
TextCleaner HTML strip, Unicode normalize, whitespace, Swedish mojibake
ContentCleaner OCR/web content cleaning with 88+ Swedish legal patterns
HTMLConverter Parse, sanitize, extract from HTML
MarkdownConverter HTML→Markdown with LLM-optimized output
TextSimilarity Levenshtein, n-gram, cosine, deduplication
DuplicateDetector Weighted similarity, Union-Find clustering
Schema + validators 30+ composable validators (Email, URL, Range, UUID…)
Anonymizer GDPR-compliant, role-aware, 6 strategies
repair_json Fix truncated/malformed LLM JSON output
swedish.* Personnummer, org numbers, phone, banking normalization

Installation

# Core (no external deps)
pip install trollfab-data-cleaner

# With fuzzy matching acceleration
pip install "trollfab-data-cleaner[similarity]"

# With HTML parser (better HTML→Markdown conversion)
pip install "trollfab-data-cleaner[html]"

# Everything
pip install "trollfab-data-cleaner[all]"

Quick start

Text cleaning

from data_cleaner import TextCleaner, CleanerConfig

cleaner = TextCleaner()
result = cleaner.clean("<p>Göteborg &amp; Stockholm  </p>")
print(result.text)  # "Göteborg & Stockholm"

# Custom config
cfg = CleanerConfig(strip_html=True, fix_swedish_mojibake=True, normalize_unicode=True)
cleaner = TextCleaner(cfg)

Validation framework

from data_cleaner.validation import Schema, Required, Email, MinLength, Range

schema = Schema({
    "name":  Required() | MinLength(2),
    "email": Required() | Email(),
    "age":   Required() | Range(0, 150),
})
result = schema.validate({"name": "Anna", "email": "anna@example.com", "age": 30})
print(result.valid, result.errors)

Text similarity & deduplication

from data_cleaner import TextSimilarity, DuplicateDetector

sim = TextSimilarity()
print(sim.levenshtein_ratio("Göteborg", "Goteborg"))   # ~0.88
print(sim.cosine_tfidf(["doc one", "doc two"], "doc one"))

# Dedup a list of strings
deduped = sim.deduplicate(texts, threshold=0.85)

# Cluster similar documents
detector = DuplicateDetector()
groups = detector.find_duplicates(documents)

HTML to Markdown (LLM-optimized)

from data_cleaner import MarkdownConverter

conv = MarkdownConverter()
result = conv.convert(html_string)
print(result.markdown, result.compression_ratio)

JSON repair

from data_cleaner import repair_json, safe_parse

# Fix truncated LLM output
fixed = repair_json('{"name": "Anna", "items": [1, 2')
data = safe_parse(llm_response_text)  # returns None on failure

GDPR anonymization

from data_cleaner import Anonymizer

anon = Anonymizer()
result = anon.anonymize(text)
print(result.anonymized_text, result.pii_found)

Swedish utilities

from data_cleaner.swedish import (
    validate_personnummer,
    mask_personnummer,
    extract_personnummer,
    validate_org_number,
    classify_org_number,
    validate_phone_se,
    parse_swedish_amount,
    normalize_vendor,
)

# Personnummer
r = validate_personnummer("19850312-4564")
print(r.valid, r.birth_date, r.age, r.gender)
print(mask_personnummer("19850312-4564"))  # "19850312-XXXX"

# Org number
print(validate_org_number("556703-7687"))   # True
info = classify_org_number("556703-7687")
print(info.type_name)  # "Aktiebolag"

# Swedish phone
result = validate_phone_se("+46 31 123 456")
print(result.valid, result.normalized)

# Swedish amounts
print(parse_swedish_amount("1 234 567,89"))  # 1234567.89
print(normalize_vendor("PAYPAL *ADOBE"))     # "ADOBE"

Package structure

data_cleaner/
├── __init__.py             ← Public API
├── py.typed                ← PEP 561
├── text_cleaner.py         ← TextCleaner (HTML, unicode, whitespace, mojibake)
├── content_cleaner.py      ← OCR/web content cleaner (88+ Swedish patterns)
├── html_converter.py       ← HTMLConverter (sanitize, extract, XSS prevention)
├── html_to_markdown.py     ← MarkdownConverter (LLM-optimized, ~80% token reduction)
├── text_similarity.py      ← TextSimilarity (levenshtein, n-gram, cosine, dedup)
├── duplicate_detector.py   ← DuplicateDetector (weighted, Union-Find clustering)
├── validation.py           ← Schema + 30 validators (Email, URL, Range, UUID…)
├── validators.py           ← Functional validators (email, url, json, credit card)
├── anonymizer.py           ← Anonymizer (GDPR, 6 strategies, role-aware)
├── json_repair.py          ← repair_json / safe_parse (5-strategy LLM JSON fixer)
└── swedish/
    ├── __init__.py
    ├── personnummer.py      ← validate/mask/extract/format (Luhn, age, gender)
    ├── org_number.py        ← validate/classify org numbers + public record URLs
    ├── validators.py        ← validate_phone_se, validate_personnummer, sanitize_text
    ├── anonymizer.py        ← Swedish PII anonymizer (role-aware, 6 levels)
    └── banking.py           ← safe_str/float/int, parse_swedish_amount, normalize_vendor

© 2025 Trollfabriken AITrix AB — MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trollfab_data_cleaner-1.0.0.tar.gz (73.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

trollfab_data_cleaner-1.0.0-py3-none-any.whl (78.4 kB view details)

Uploaded Python 3

File details

Details for the file trollfab_data_cleaner-1.0.0.tar.gz.

File metadata

  • Download URL: trollfab_data_cleaner-1.0.0.tar.gz
  • Upload date:
  • Size: 73.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for trollfab_data_cleaner-1.0.0.tar.gz
Algorithm Hash digest
SHA256 b8fe0202b117dc104556ea51dd48a24d4c6dbce1209086b3ecea25d7d90aa50a
MD5 e882f6ed864dc786924664ccd9a8828e
BLAKE2b-256 366d80cc49c18c80667e44fbcecbc2e18becdeba7a25cd8dd19672e1d21646b7

See more details on using hashes here.

File details

Details for the file trollfab_data_cleaner-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for trollfab_data_cleaner-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0c1737715a7704e14b65943d8f38b59644e563114629075b36d435c0c1387b5b
MD5 abf84595cb08686bc2deb58f91186d09
BLAKE2b-256 884cdf8f38ea9371e8a7873fa97a3cd634a67193c002afbf517c98b3c95590be

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page