Production-quality data cleaning, validation, similarity, anonymization, and Swedish-specific utilities — pure stdlib core
Project description
trollfab-data-cleaner
Production-quality data cleaning, validation, similarity, anonymization, and Swedish-specific utilities — built for Trollfabriken AITrix AB document processing pipelines. Pure Python, no required dependencies.
Modules
| Module | Purpose |
|---|---|
TextCleaner |
HTML strip, Unicode normalize, whitespace, Swedish mojibake |
ContentCleaner |
OCR/web content cleaning with 88+ Swedish legal patterns |
HTMLConverter |
Parse, sanitize, extract from HTML |
MarkdownConverter |
HTML→Markdown with LLM-optimized output |
TextSimilarity |
Levenshtein, n-gram, cosine, deduplication |
DuplicateDetector |
Weighted similarity, Union-Find clustering |
Schema + validators |
30+ composable validators (Email, URL, Range, UUID…) |
Anonymizer |
GDPR-compliant, role-aware, 6 strategies |
repair_json |
Fix truncated/malformed LLM JSON output |
swedish.* |
Personnummer, org numbers, phone, banking normalization |
Installation
# Core (no external deps)
pip install trollfab-data-cleaner
# With fuzzy matching acceleration
pip install "trollfab-data-cleaner[similarity]"
# With HTML parser (better HTML→Markdown conversion)
pip install "trollfab-data-cleaner[html]"
# Everything
pip install "trollfab-data-cleaner[all]"
Quick start
Text cleaning
from data_cleaner import TextCleaner, CleanerConfig
cleaner = TextCleaner()
result = cleaner.clean("<p>Göteborg & Stockholm </p>")
print(result.text) # "Göteborg & Stockholm"
# Custom config
cfg = CleanerConfig(strip_html=True, fix_swedish_mojibake=True, normalize_unicode=True)
cleaner = TextCleaner(cfg)
Validation framework
from data_cleaner.validation import Schema, Required, Email, MinLength, Range
schema = Schema({
"name": Required() | MinLength(2),
"email": Required() | Email(),
"age": Required() | Range(0, 150),
})
result = schema.validate({"name": "Anna", "email": "anna@example.com", "age": 30})
print(result.valid, result.errors)
Text similarity & deduplication
from data_cleaner import TextSimilarity, DuplicateDetector
sim = TextSimilarity()
print(sim.levenshtein_ratio("Göteborg", "Goteborg")) # ~0.88
print(sim.cosine_tfidf(["doc one", "doc two"], "doc one"))
# Dedup a list of strings
deduped = sim.deduplicate(texts, threshold=0.85)
# Cluster similar documents
detector = DuplicateDetector()
groups = detector.find_duplicates(documents)
HTML to Markdown (LLM-optimized)
from data_cleaner import MarkdownConverter
conv = MarkdownConverter()
result = conv.convert(html_string)
print(result.markdown, result.compression_ratio)
JSON repair
from data_cleaner import repair_json, safe_parse
# Fix truncated LLM output
fixed = repair_json('{"name": "Anna", "items": [1, 2')
data = safe_parse(llm_response_text) # returns None on failure
GDPR anonymization
from data_cleaner import Anonymizer
anon = Anonymizer()
result = anon.anonymize(text)
print(result.anonymized_text, result.pii_found)
Swedish utilities
from data_cleaner.swedish import (
validate_personnummer,
mask_personnummer,
extract_personnummer,
validate_org_number,
classify_org_number,
validate_phone_se,
parse_swedish_amount,
normalize_vendor,
)
# Personnummer
r = validate_personnummer("19850312-4564")
print(r.valid, r.birth_date, r.age, r.gender)
print(mask_personnummer("19850312-4564")) # "19850312-XXXX"
# Org number
print(validate_org_number("556703-7687")) # True
info = classify_org_number("556703-7687")
print(info.type_name) # "Aktiebolag"
# Swedish phone
result = validate_phone_se("+46 31 123 456")
print(result.valid, result.normalized)
# Swedish amounts
print(parse_swedish_amount("1 234 567,89")) # 1234567.89
print(normalize_vendor("PAYPAL *ADOBE")) # "ADOBE"
Package structure
data_cleaner/
├── __init__.py ← Public API
├── py.typed ← PEP 561
├── text_cleaner.py ← TextCleaner (HTML, unicode, whitespace, mojibake)
├── content_cleaner.py ← OCR/web content cleaner (88+ Swedish patterns)
├── html_converter.py ← HTMLConverter (sanitize, extract, XSS prevention)
├── html_to_markdown.py ← MarkdownConverter (LLM-optimized, ~80% token reduction)
├── text_similarity.py ← TextSimilarity (levenshtein, n-gram, cosine, dedup)
├── duplicate_detector.py ← DuplicateDetector (weighted, Union-Find clustering)
├── validation.py ← Schema + 30 validators (Email, URL, Range, UUID…)
├── validators.py ← Functional validators (email, url, json, credit card)
├── anonymizer.py ← Anonymizer (GDPR, 6 strategies, role-aware)
├── json_repair.py ← repair_json / safe_parse (5-strategy LLM JSON fixer)
└── swedish/
├── __init__.py
├── personnummer.py ← validate/mask/extract/format (Luhn, age, gender)
├── org_number.py ← validate/classify org numbers + public record URLs
├── validators.py ← validate_phone_se, validate_personnummer, sanitize_text
├── anonymizer.py ← Swedish PII anonymizer (role-aware, 6 levels)
└── banking.py ← safe_str/float/int, parse_swedish_amount, normalize_vendor
© 2025 Trollfabriken AITrix AB — MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file trollfab_data_cleaner-1.0.0.tar.gz.
File metadata
- Download URL: trollfab_data_cleaner-1.0.0.tar.gz
- Upload date:
- Size: 73.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b8fe0202b117dc104556ea51dd48a24d4c6dbce1209086b3ecea25d7d90aa50a
|
|
| MD5 |
e882f6ed864dc786924664ccd9a8828e
|
|
| BLAKE2b-256 |
366d80cc49c18c80667e44fbcecbc2e18becdeba7a25cd8dd19672e1d21646b7
|
File details
Details for the file trollfab_data_cleaner-1.0.0-py3-none-any.whl.
File metadata
- Download URL: trollfab_data_cleaner-1.0.0-py3-none-any.whl
- Upload date:
- Size: 78.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0c1737715a7704e14b65943d8f38b59644e563114629075b36d435c0c1387b5b
|
|
| MD5 |
abf84595cb08686bc2deb58f91186d09
|
|
| BLAKE2b-256 |
884cdf8f38ea9371e8a7873fa97a3cd634a67193c002afbf517c98b3c95590be
|