Skip to main content

Document normalization engine: learn a template from examples and convert any document automatically via LLM.

Project description

template-engine

Audit-grade document normalization engine. Regex-first, LLM-as-judge, zero LibreOffice. Built for regulated environments where document content cannot leak.

CI Python 3.11+ License: Apache 2.0 Release Code style: ruff Typed

Docs: https://luizhcrs.github.io/template-engine/ Threat model + provider data residency: SECURITY-MODEL.md README: English (this file) · Português

Why this exists

Three problems off-the-shelf solutions don't solve together:

Problem This lib's answer
Cost: paying the LLM per-doc when 95% of fields are extractable mechanically Regex-first hybrid mapper — only fields regex couldn't fill go to the LLM in a single batched call
Compliance: regulators want auditability + a guarantee that LGPD/HIPAA data never reached an external API local_only=True raises before any remote call. PII masking, append-only audit log, deterministic regex path replayable bit-for-bit
Verification: "did the candidate doc match the standard?" — text alone isn't enough; structure, layout, and required formats matter too Multi-dimensional check_conformity — text + structural + visual + design + technical, each scored independently, weighted overall verdict

How it works

Two operations. One pipeline. Five dimensions of conformity.

                  template (.docx)              source docs (N x .docx/.pdf)
                        │                                  │
                        ▼                                  ▼
        ┌──────────────────────────┐         ┌──────────────────────────┐
        │ schema_inference         │         │ extractor                │
        │  detects placeholders    │         │  text + tables           │
        │  ({{X}}, [X], ___, ...)  │         └──────────────────────────┘
        └──────────────────────────┘                       │
                        │                                  ▼
                        ▼                  ┌──────────────────────────┐
        ┌──────────────────────────┐       │ pattern_inference        │
        │ FieldSchema list         │──┐    │  10 predefined shapes    │
        │  {name, type, required}  │  │    │  + grex (optional)       │
        └──────────────────────────┘  │    └──────────────────────────┘
                                      │                    │
                                      ▼                    ▼
                          ┌─────────────────────────────────────┐
                          │ hybrid_mapper                        │
                          │  Tier 1: regex per field (free)      │
                          │  Tier 2: LLM batched on missing only │
                          │  Output: source ∈ {regex, llm, miss} │
                          └─────────────────────────────────────┘
                                          │
                                          ▼
                          ┌─────────────────────────────────────┐
                          │ batch._apply_mapping_to_template     │
                          │  token substitution in docx copy     │
                          └─────────────────────────────────────┘
                                          │
                                          ▼
                          ┌─────────────────────────────────────┐
                          │ semantic_diff (LLM as judge)         │
                          │  flags missing_in_output / mismatch  │
                          │  / extra_in_output discrepancies     │
                          └─────────────────────────────────────┘
                                          │
                                          ▼
                            BatchReport: high / medium / low / error
                            per-doc mapping summary + discrepancies

For verification, the same primitives feed check_conformity:

                             check_conformity(template, candidate)
                                          │
            ┌─────────┬─────────┬─────────┼─────────┬─────────┐
            ▼         ▼         ▼         ▼         ▼         ▼
          text   structural  visual    design    technical    │
         (LLM)   (no LLM)   (no LLM)  (LLM mm)   (no LLM)    │
            │         │         │         │         │         │
            └─────────┴─────────┴─────────┴─────────┘         │
                              │                                │
                              ▼                                │
                  weighted score + threshold                   │
                              │                                │
                              ▼                                │
            is_conformant = (score >= 0.85) AND (zero critical) ◄

Cost by tier (Gemini Flash, ~3K input tokens per LLM call):

Path LLM calls $/doc
Regex resolves everything 0 $0.0000
Some fields fall back to LLM 1 ~$0.0006
With semantic_diff enabled 2 ~$0.0012
With check_conformity(dimensions=[text, design]) 4 ~$0.0024

Section-mapper pipeline (Wave L)

For structural templates that ship with named-but-empty sections (industrial procedures, NR-12/13, ABNT-shaped academic) and rely on heading hierarchy + tables instead of {{X}} tokens, use engine.section_mapper.map_sections:

from pathlib import Path
from engine.section_mapper import map_sections

report = map_sections(
    template_path=Path("template.docx"),
    source_path=Path("source.docx"),
    output_path=Path("output.docx"),
    # similarity_mode="auto" + auto_tables=True are the defaults
)

print(f"mapped {report.mapped_count} sections; {report.tables_filled} tables filled")

End-to-end on Engeman dados.docx with zero config (rules mode): 7/8 sections mapped, header populated (IT.PRO.URE.387.0005, Rev. 01, Elaborado: ..., (PARTIDA DA ÁREA DE SÍNTESE)), Histórico table extracted from source revisions, Responsabilidade table populated from Compete à gerência / Compete aos supervisores paragraphs.

Vendor-agnostic LLM mode (Wave M)

map_sections_async(..., mode="llm", llm=provider) runs a single batched LLM call that handles ANY template + source pair. No hardcoded vendor heuristics. Validated against:

  • Engeman pair (PT-BR, XXXX/(TITULO)/Elaborado:/etc placeholders, Atividades | Responsabilidade | Responsabilidade table) — full DOcStream parity.
  • Vendor B pair (English corporate, {{DOC_CODE}}/[Title]/Author:/Reviewer: placeholders, Activity | Owner table) — fixtures in tests/vendor_b/. The migration row's text follows the source's language automatically.
import asyncio
from pathlib import Path
from engine.llm.openai_provider import OpenAIProvider
from engine.section_mapper import map_sections_async

async def run():
    provider = OpenAIProvider(api_key="sk-...", model="gpt-4o", timeout=300.0)
    await map_sections_async(
        template_path=Path("template.docx"),
        source_path=Path("source.docx"),
        output_path=Path("output.docx"),
        mode="llm",
        llm=provider,
    )

asyncio.run(run())

See Section mapper for the full pipeline (parser → numbering resolver → similarity matcher → renderer with line-kind decoration → tables → header filler) and the LLM-mode profilers + auto-mapper that drive Wave M.

Typical batch run

template-engine normalize \
  --template ./padrao.docx \
  --source-dir ./entrada/ \
  --output-dir ./normalizados/ \
  --provider gemini \
  --gold-doc gold_01.docx --gold-doc gold_02.docx --gold-doc gold_03.docx \
  --field-examples ./examples.json \
  --report ./report.json

The report.json groups every input into a tier:

  • high — regex resolved everything, no critical diff. Ship without review.
  • medium — LLM filled at least one free-text field, or warning-level diff. Spot-check.
  • low — orphan placeholder, missing required field, or critical diff. Open and edit.
  • error — extraction or render failed.

Cost depends on what fraction of docs the regex tier resolves. When it covers all required fields, the LLM is never called and the run is free; otherwise the LLM is invoked once per missing-field doc and (optionally) once for the semantic-diff QA pass.

Install

pip install template-engine-ia                 # core
pip install "template-engine-ia[gemini]"          # + Google Gemini
pip install "template-engine-ia[openai]"          # + OpenAI
pip install "template-engine-ia[anthropic]"       # + Anthropic Claude
pip install "template-engine-ia[ollama]"          # + local LLMs (LGPD-safe)
pip install "template-engine-ia[inference]"       # + grex regex learner
pip install "template-engine-ia[all]"             # everything

Quickstart — normalize a directory

import asyncio
from pathlib import Path
from engine import normalize_batch
from engine.llm.gemini_free import GeminiFreeProvider

async def main():
    report = await normalize_batch(
        template_path=Path("template.docx"),
        source_dir=Path("docs/"),
        output_dir=Path("normalized/"),
        llm=GeminiFreeProvider(api_key="AIza..."),
        gold_docs=[open(p).read() for p in Path("gold/").glob("*.txt")],
        field_examples={
            "CODIGO":      ["ABC-001", "ABC-042", "ABC-099"],
            "DATA":        ["2026-01-15", "2026-04-26", "2026-07-30"],
            "RESPONSAVEL": ["Joao Silva", "Maria Souza", "Pedro Lima"],
        },
    )
    print(report.by_tier)         # {"high": 380, "medium": 15, "low": 5, "error": 0}
    print(report.llm_call_count)  # ~25 — 380 high docs cost zero LLM

asyncio.run(main())

Conformity check

from engine import check_conformity

report = await check_conformity(
    template_path=Path("padrao.docx"),
    candidate_path=Path("candidato.docx"),
    llm=provider,
    schemas=schemas,
    mapping=mapping,
    dimensions=["text", "structural", "visual", "technical"],
    threshold=0.85,
)

print(report.summary_line)
# CONFORMANT score=0.92 threshold=0.85 failures=1 (critical=0)

is_conformant = (score >= threshold) AND (zero critical failures). A single critical (invalid CPF, orphan placeholder, lost field) invalidates the doc regardless of weighted score.

CLI: template-engine conformity --template T --candidate C --provider gemini --threshold 0.85.

Local-only mode (LGPD/HIPAA)

report = await normalize_batch(
    template_path, source_dir, output_dir,
    llm=None,
    field_examples=examples,
    gold_docs=golds,
    local_only=True,   # raises RefusedRemoteCallError if any LLM is supplied
)

In local-only mode, only the regex tier runs. Missing fields stay missing. See SECURITY-MODEL.md for the full operating-mode matrix and per-provider data residency.

PII masking

from engine.security import mask_pii, unmask

masked, mask = mask_pii(source_text)
# masked: "Cliente <CPF_001> nascido em <DATE>... contato <EMAIL_001>"
response = await llm.generate_structured(prompt(masked), schema)
restored = unmask(json.dumps(response), mask)

Detects CPF, CNPJ, email, BR phone, RG, CEP. Each unique original value gets one stable token; unmask restores originals from the response.

Multi-provider with fallback

from engine.llm import LLMRouter
from engine.llm.groq_provider import GroqProvider
from engine.llm.gemini_free import GeminiFreeProvider
from engine.llm.openai_provider import OpenAIProvider

router = LLMRouter([
    GroqProvider(api_key=g_key),         # primary: fast + cheap
    GeminiFreeProvider(api_key=ge_key),  # fallback: free tier
    OpenAIProvider(api_key=o_key),       # last resort
])

report = await normalize_batch(template, source_dir, output_dir, llm=router, ...)

Only LLMRateLimit / LLMTimeout trigger fallback. Generic LLMError propagates so the caller sees provider-specific issues.

Design decisions (why it works)

  • Stateless. Path / bytes in, paths / bytes / dataclasses out. No web framework, no ORM, no app layer to bring along.
  • Frozen dataclasses across the public API. MappingResult, Failure, ConformityReport, etc. Equality + hashing for free, no accidental mutation across pipeline boundaries.
  • Protocol-based LLM provider (not ABC). Adding a provider is implementing one method. No inheritance, no registry magic.
  • Regex tier rejects over-generalization. When grex learns a pattern that collapses to \w+ without structural anchors, the lib falls back to free-text instead of accepting a false sense of precision.
  • is_conformant requires zero criticals. A high weighted score doesn't override a single critical failure (invalid CPF, orphan placeholder). Matches the regulator's mental model: "any deal-breaker = fail".
  • Audit hashes, not raw content. AuditLog records sha256 of inputs and outputs so reviewers can prove a document was processed without the audit file becoming a secondary data store.

Add your own provider

from engine.llm.base import LLMError, LLMRateLimit, LLMTimeout

class MyProvider:
    name = "my-provider"
    model = "default"

    def __init__(self, api_key: str, model: str | None = None) -> None:
        if not api_key:
            raise RuntimeError("api_key required")

    async def generate_structured(self, prompt: str, json_schema: dict) -> dict:
        # call API, parse JSON; raise LLMRateLimit / LLMTimeout / LLMError as needed
        ...

Development

pip install -e ".[dev]"
ruff check . && ruff format --check . && mypy src/engine && pytest

189 tests across providers, pattern inference (Wave A), batch orchestrator (Wave D), conformity validator (Wave F), security primitives (Wave G).

Roadmap

ROADMAP.md — Wave A/D/E/F/G/H shipped on v0.6.

Contributing

Issues and PRs welcome. See CONTRIBUTING.md. For security issues see SECURITY.md.

License

Apache 2.0 — Copyright 2026 luizhcrs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

template_engine_ia-0.10.5.tar.gz (166.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

template_engine_ia-0.10.5-py3-none-any.whl (158.9 kB view details)

Uploaded Python 3

File details

Details for the file template_engine_ia-0.10.5.tar.gz.

File metadata

  • Download URL: template_engine_ia-0.10.5.tar.gz
  • Upload date:
  • Size: 166.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for template_engine_ia-0.10.5.tar.gz
Algorithm Hash digest
SHA256 c3fafb89e1aaa98e567f02b6e02c07f1d93ce72b35780e6f89dd7453e14c5c5f
MD5 8f361d11453d7891cc97d2609c9e11fd
BLAKE2b-256 e87208f15fca38a89d076dd82118628bfb005c310308a985f903046786d4d1fc

See more details on using hashes here.

Provenance

The following attestation bundles were made for template_engine_ia-0.10.5.tar.gz:

Publisher: publish.yml on Luizhcrs/template-engine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file template_engine_ia-0.10.5-py3-none-any.whl.

File metadata

File hashes

Hashes for template_engine_ia-0.10.5-py3-none-any.whl
Algorithm Hash digest
SHA256 22d0f11875e14d025fa3a4f6a9c15cfe90ddb6afb1e7f19dc05eae0d09a4f3ca
MD5 ad61ffa03f69398b1e72a8fdf633b2eb
BLAKE2b-256 c5d101448e405331046d5a8f3d233d0affc0a1b0b40c1ae22e5c8d0955f1d6b7

See more details on using hashes here.

Provenance

The following attestation bundles were made for template_engine_ia-0.10.5-py3-none-any.whl:

Publisher: publish.yml on Luizhcrs/template-engine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page