Skip to main content

Python SDK and CLI for ClicheFactory — structured data extraction from documents

Project description

ClicheFactory (Python SDK)

Introduction

ClicheFactory is a structured data extraction SDK. It parses documents (PDF, images, Office, email, etc.) and extracts structured data into Pydantic models — locally with your own LLM keys, or via the ClicheFactory service. Training is managed through ClicheFactory (BYOK only — supply your OpenAI, Gemini, or Anthropic key); the SDK consumes trained artifacts via artifact_id.

Installing

pip install clichefactory

For local parsing/OCR (Docling/PyMuPDF/Tesseract/etc.):

pip install "clichefactory[local]"

Quickstart

Local extraction from text

from pydantic import BaseModel
from clichefactory import Endpoint, factory

class Invoice(BaseModel):
    invoice_number: str | None = None
    total_amount: float | None = None

client = factory(
    mode="local",
    model=Endpoint(provider_model="gemini/gemini-3-flash-preview", api_key="..."),
)

c = client.cliche(Invoice)
invoice = c.extract(text="Invoice #123 total 99.00 EUR")
print(invoice)

Local mode does not pick a default model: you must pass model=Endpoint(...) (or llm= for compatibility) or set LLM_MODEL_NAME and LLM_API_KEY in the environment. If parsing options use OCR LLM fallback or VLM refinement, configure an OCR LLM the same way (ocr_model / OCR_MODEL_* env vars) or disable those fallbacks (see ParsingOptions).

Local extraction from file

Requires clichefactory[local]. Parses the document (OCR if needed), converts to markdown, then extracts structured data via the LLM.

invoice = c.extract(file="/path/to/invoice.pdf")

Fast extraction (file bytes direct to LLM)

Skips OCR/parsing entirely — sends the raw file to a multimodal LLM.

invoice = c.extract(file="invoice.pdf", mode="fast")

Service mode (SaaS)

from clichefactory import factory

client = factory(api_key="cliche-...")  # mode defaults to "service"

c = client.cliche(Invoice)
invoice = c.extract(file="/path/to/invoice.pdf")

Service URL: By default the SDK uses http://127.0.0.1:4000 (local aio-server). For production, set the environment variable CLICHEFACTORY_API_URL to https://api.clichefactory.com, or pass base_url= to factory() explicitly (this overrides the env var).

Local paths (and raw bytes) are automatically uploaded by the SDK before the service processes them.

Retries and idempotency

Service-mode requests retry automatically on transient transport errors and on the standard transient HTTP statuses (408, 425, 429, 500, 502, 503, 504). Backoff is bounded (max 4 attempts, max 8 s per sleep) and Retry-After is honored when the server sends it (capped at 30 s). Non-retryable 4xx responses (e.g. invalid API key, validation errors) still fail fast — the SDK does not retry those.

Each retried request reuses the same idempotency key, so the service replays its cached response instead of re-running the work or re-billing for it. This is invisible to your code; you don't need to do anything to opt in.

Trained extraction

Training currently runs in BYOK mode only. Train through ClicheFactory using your own LLM key (OpenAI, Gemini, or Anthropic). Once you have a trained artifact, use it via artifact_id:

from clichefactory import factory

client = factory(api_key="cliche-...")
cliche = client.cliche(Invoice, artifact_id="art_8cee...")
result = cliche.extract(file="document.pdf")

API keys: Use a key from ClicheFactory → Settings → API Keys (cliche-...). Those keys authenticate as your account and are billed against your credits. They are not the same as internal aio-server operator keys used between services.

BYOK vs hosted (service mode):

  • BYOK — Pass model=Endpoint(..., api_key=...) (and optionally ocr_model=) so extraction/OCR use your LLM credentials. Billing uses the BYOK rate. Training requires this path.
  • Hosted (extraction only) — Omit model / ocr_model so the platform runs the LLMs. Your Pydantic schema must still match the trained pipeline's output shape (as exported from ClicheFactory). Hosted training is on the roadmap.

Explicit mode vs artifact default: You can pass mode= (e.g. mode="trained") or omit it. When the artifact defines a pipeline mode (e.g. robust-trained), the service can apply that mode automatically if you do not override it.

robust-trained: Requires an artifact trained with the verification pipeline (VerifiedExtractor). If you only trained a single-step extractor, use default extraction or mode="trained" instead of forcing robust-trained.

Extraction modes

Mode Local Service Description
None (default) yes yes Parse document -> markdown -> LLM extraction
"fast" yes yes Send raw file bytes directly to LLM (no OCR)
"trained" - yes Uses a trained artifact (DSPy BaseExtractor on OCR text). Trained via ClicheFactory (BYOK).
"robust" - yes Two-stage extract + verify
"robust-trained" - yes Trained extract + verify; artifact must be trained for verification
invoice = c.extract(file="/path/to/invoice.pdf", mode="robust")

Document to markdown

Convert any supported file to a structured markdown representation.

doc = client.to_markdown(file="invoice.pdf")
print(doc.get_markdown())
print(doc.get_pages())

Service mode (set mode="service" on the call; it is separate from factory(mode=...)):

doc = client.to_markdown(file="invoice.pdf", mode="service")

# Fast mode (VLM-only, no parser pipeline)
doc = client.to_markdown(file="invoice.pdf", mode="service", parser="fast")

The returned document object provides:

  • get_markdown() — full markdown text
  • get_plain_text() — plain text without formatting
  • get_pages() — list of page objects
  • get_sections() — list of section objects
  • get_tables() — list of table objects
  • get_images() — list of image objects

Not every pipeline has all of these options.

Batch operations

Process multiple files concurrently with configurable parallelism.

Batch extraction

results = c.extract_batch(
    files=["./data/doc1.pdf", "./data/doc2.pdf", "./data/doc3.pdf"],
    max_concurrency=5,
    mode="fast",
)
for invoice in results:
    print(invoice.vendor_name, invoice.total_with_vat)

Batch markdown

docs = client.to_markdown_batch(
    files=["a.pdf", "b.pdf", "c.pdf"],
    max_concurrency=5,
)
for doc in docs:
    print(len(doc.get_markdown()), "chars")

Service mode (presign + OCR on the server for each file):

docs = client.to_markdown_batch(
    files=["a.pdf", "b.pdf"],
    mode="service",
    max_concurrency=5,
)

Long documents (chunk + merge)

Cliche.extract is designed for documents that fit in one LLM context window — in practice, roughly up to ~20 pages for dense text, more for sparse layouts. For longer files, use extract_long, which:

  1. Converts the document to markdown once.
  2. Splits the markdown into chunks (by default: token-sized).
  3. Extracts each chunk in parallel as a partial result.
  4. Merges per-chunk values field-by-field using resolvers you declare.
  5. Validates the merged dict against your Pydantic model, running the same coercion + postprocess pipeline as extract.

Every chunk is a separate extract call, so billing accrues per page across all chunks. Trained artifacts and mode="robust" / "robust-trained" are not supported in this SDK release.

Basic usage

from pydantic import BaseModel
from clichefactory import factory, Endpoint
from clichefactory.chunking import PageChunker
from clichefactory.resolvers import (
    concat_dedupe, first_non_null, last_non_null, sum_numeric,
)


class LineItem(BaseModel):
    description: str
    amount: float


class Invoice(BaseModel):
    invoice_number: str | None = None
    total: float | None = None
    customer_name: str | None = None
    line_items: list[LineItem] = []


client = factory(api_key="cliche-...", model=Endpoint(provider_model="openai/gpt-5"))

cliche = client.cliche(
    Invoice,
    resolvers={
        "invoice_number": first_non_null,
        "customer_name":  first_non_null,
        "total":          last_non_null,
        "line_items":     concat_dedupe(key="description"),
    },
)

result: Invoice = cliche.extract_long(
    file="big_invoice.pdf",
    chunker=PageChunker(pages_per_chunk=15, overlap_pages=1),
    max_concurrency=4,
)

Chunkers

clichefactory.chunking ships three strategies:

Chunker When to use Needs
TokenChunker(max_tokens=..., overlap_tokens=...) Default. Works anywhere.
PageChunker(pages_per_chunk=..., overlap_pages=...) Invoices, contracts, page-structured PDFs. Page markers in the markdown (<!-- cf:page N --> / <!-- page: N -->). Falls back to token chunking and warns otherwise.
HeadingChunker(max_tokens=..., min_heading_level=2) Manuals, long-form reports. Markdown headings.

You can also pass your own object implementing the ChunkStrategy protocol (async def chunks(markdown, meta) -> list[Chunk]).

Resolvers

A resolver reduces one field's per-chunk values to one final value. Built-ins live in clichefactory.resolvers:

Scalars: first_non_null, last_non_null, most_common, pick_by_confidence, sum_numeric, max_numeric, min_numeric.

Collections: concat, concat_dedupe(key=...), union_by(key). concat also has a factory form for strings: concat(separator="\n\n").

LLM-backed (opt-in, v1 stub falls back to most_common): llm_reconcile(instructions=..., model=...).

Custom callables follow the signature (list[FieldValue], ResolverContext) -> Any:

def pick_longest(values, ctx):
    non_null = [fv for fv in values if fv.value]
    return max(non_null, key=lambda fv: len(fv.value)).value if non_null else None


cliche.extract_long(file=..., resolvers={"description": pick_longest})

String aliases for config-driven use:

resolvers = {
    "invoice_number": "first_non_null",
    "line_items":     "concat_dedupe_by=line_id",
    "notes":          "concat",
}

Default policy

Any field without an explicit resolver is resolved by a per-JSON-type default:

  • type: arrayconcat with a UserWarning telling you which field and how to override (concat_dedupe(key=...)).
  • type: string | number | integer | boolean | objectfirst_non_null.

Warnings are intentionally loud so silent concatenation never surprises you.

Debug / review surface

Pass include_chunk_results=True to get a LongExtractionResult[T]:

detailed = cliche.extract_long(file="big.pdf", include_chunk_results=True)

detailed.value          # Invoice — the resolved, validated model
detailed.chunks         # tuple[Chunk, ...] — what got split
detailed.per_chunk      # tuple[Invoice | PartialExtraction | ..., ...]
detailed.per_field      # dict[str, tuple[FieldValue, ...]]
detailed.resolutions    # dict[str, ResolutionTrace] — which resolver won
detailed.cost           # {"by_chunk": [...], "num_chunks": N, "total_usd": ...}
detailed.warnings       # tuple[str, ...]

This is also the shape Emio needs if you ever build a long-doc review UI.

SaaS pricing (service mode)

Billing applies only when using mode="service" with a ClicheFactory API key. Local runs are not metered by the platform.

Free tier

  • 10 lifetime extraction pages (metered per processed page). Those pages are free regardless of full-service vs BYOK.

Paid usage (credit balance)

  • After free extraction pages are exhausted, extraction is billed per page from your balance.
  • Full-service means the platform runs the LLMs. BYOK (bring your own key) applies when you supply your own LLM API key on the client (for example via Endpoint(..., api_key=...) or envelope config as implemented in the SDK).

Default rates (USD; the API may override these per deployment via stored rate rows):

Operation Full-service BYOK
Extraction (per page) $0.005 $0.0005
Training (per run) — (BYOK only during MVP) flat fee — see ClicheFactory

Configuration

Endpoint (BYOK LLM config)

from clichefactory import Endpoint

model = Endpoint(
    provider_model="gemini/gemini-3-flash-preview",
    api_key="...",
    max_tokens=100000,
    temperature=1.0,
    num_retries=8,
    api_base=None,    # for Ollama: "http://localhost:11434"
)

client = factory(mode="local", model=model)

Advanced multi-model overrides

Most users should set only model. If you need role-specific endpoints, override per role:

client = factory(
    mode="service",
    api_key="cliche-...",
    model=Endpoint(provider_model="gemini/gemini-3-flash-preview", api_key="..."),  # extraction default
    ocr_model=Endpoint(provider_model="gemini/gemini-3-flash-preview", api_key="..."),  # optional
)

Per-call overrides are also available:

invoice = c.extract(file="/path/to/invoice.pdf", model=Endpoint(...), ocr_model=Endpoint(...))

ParsingOptions

Fine-grained control over local-mode document parsing. ParsingOptions only applies to local extraction — in service mode the platform selects the optimal parsing strategy and this parameter is ignored.

from clichefactory import ParsingOptions

parsing = ParsingOptions(
    pdf_image_parser="docling",              # "docling", "docling_vlm", "ocr_llm", "vision_layout" (SaaS-only)
    pdf_fallback_to_ocr_llm=True,            # fall back to LLM OCR when local parser fails
    pdf_structured_fallback_to_image=False,   # retry structured PDFs as image-scanned on failure
    pdf_ocr_engine="rapidocr",               # "rapidocr", "tesseract", "easyocr"
    pdf_ocr_lang="eng",                      # language code(s), see OCR language section below
    use_ocr_llm_body=True,                   # use LLM for body text when parser supports it

    image_parser="rapidocr",                 # "rapidocr", "pytesseract", "docling", "ocr_llm"
    image_parser_fallback=True,              # fall back to ocr_llm on failure
    image_parser_lang="eng",                 # language code(s), see OCR language section below
)

client = factory(mode="local", model=model, parsing=parsing)

Environment variables

For local runs, the primary extraction defaults are:

Role Variables Notes
Extraction LLM LLM_MODEL_NAME, LLM_API_KEY Also accepted: MODEL_NAME / MODEL_API_KEY, EXTRACTION_LLM_MODEL_NAME / EXTRACTION_LLM_API_KEY. No implicit default model — if unset, local extraction fails until you configure a model.
OCR LLM (optional) OCR_MODEL_NAME, OCR_MODEL_API_KEY Used when you set a separate OCR endpoint; otherwise OCR reuses the extraction model when your parsing options need an OCR LLM. Aliases include OCR_LLM_MODEL_NAME / OCR_LLM_API_KEY and OCR_API_KEY.

Optional endpoints override extraction/OCR on factory() via model and ocr_model.

For service mode, the only URL-related environment variable is CLICHEFACTORY_API_URL (unless you pass base_url= to factory(), which wins).

Ollama (local model inference)

curl -fsSL https://ollama.com/install.sh | sh
ollama run llama3.2:1b
client = factory(
    mode="local",
    model=Endpoint(provider_model="ollama/llama3.2:1b", api_key="", api_base="http://localhost:11434"),
)

Current scope: Ollama supports text extraction only (extract(text=...)). File parsing and OCR paths are not supported for Ollama.

PDF parser selection

Parser Config value Description
Docling "docling" Local OCR + table structure via Docling (default). Full structured output.
VLM direct "fast" extraction mode Sends the whole PDF to the LLM. No layout structure, fastest.
Docling + VLM "docling_vlm" Docling for structure + per-page VLM refinement.
OCR LLM "ocr_llm" Per-page VLM OCR for scanned/image PDFs.
Vision Layout "vision_layout" More performant layout detection. SaaS-only.

Set via ParsingOptions(pdf_image_parser=...) or on factory(parsing=...).

OCR LLM fallback for Docling-based parsers

Docling-based parsers can fall back to OCR LLM (your configured vision-capable model) when Docling produces empty or degenerate output. Controlled by pdf_fallback_to_ocr_llm (default True). That path requires a configured OCR LLM (or the same model as extraction).

Parallel OCR LLM refinement calls

VLM-oriented parsers (e.g. docling_vlm) can issue multiple parallel OCR LLM calls per document (per-page or per-table) to keep latency under control.

ClicheFactory UI integration

Documents extracted via the SDK appear in ClicheFactory only when you set both project and task on the factory. Documents without explicit scope are extraction-only and won't appear in the ClicheFactory UI.

from clichefactory import factory

client = factory(
    api_key="cliche-...",
    project="42",   # ClicheFactory Project ID (visible in URL: /projects/42/)
    task="108",      # ClicheFactory Batch ID (visible in URL: /batch/108/)
)

# Extractions will appear under that project/batch in ClicheFactory
result = client.cliche(MySchema).extract(file="document.pdf")

Documents sync automatically every ~30 minutes, or immediately via the "Sync from SDK" button in the ClicheFactory UI.

If you omit project/task, extraction works normally — your data just won't be visible in ClicheFactory.

Tenant id (HTTP APIs): User API keys resolve to a tenant id stored with the key (typically your ClicheFactory user id as a string, e.g. "1"). Envelope tenant_id="default" is rewritten server-side to that tenant for inference. When calling aio-server REST endpoints directly (e.g. listing documents), pass tenant_id matching your key’s tenant, not the literal string "default", or the request will be rejected.

OCR language configuration

Languages are specified using Tesseract format everywhere — the SDK converts internally for each engine. Use + to combine multiple languages (e.g. "slv+eng" for Slovenian + English).

parsing = ParsingOptions(
    pdf_ocr_lang="deu+eng",     # German + English for PDFs
    image_parser_lang="fra",    # French for images
)

The default language is "eng" (English).

How languages work per OCR engine

Engine Config value Language handling System dependency
Tesseract pdf_ocr_engine="tesseract" / image_parser="pytesseract" Uses Tesseract format directly ("slv+eng"). Requires matching .traineddata files under $TESSDATA_PREFIX. Tesseract binary on PATH
RapidOCR pdf_ocr_engine="rapidocr" / image_parser="rapidocr" Maps language to a script family (e.g. "eng" → English model, "deu" → Latin model). No per-language model download needed. None (pure Python, ONNX)
EasyOCR pdf_ocr_engine="easyocr" / image_parser="easyocr" Converts to ISO 639-1 codes (e.g. "eng""en", "deu""de"). Downloads per-language models on first use. None (pure Python, PyTorch)
Docling image_parser="docling" Uses Docling's built-in image conversion. No language parameter. None
OCR LLM image_parser="ocr_llm" VLM-based — the model handles language detection automatically. Configured ocr_model

Common language codes

Language Code Notes
English eng Default
German deu
French fra
Spanish spa
Italian ita
Slovenian slv
Polish pol
Russian rus Cyrillic script
Chinese (Simplified) chi_sim
Japanese jpn
Korean kor
Arabic ara RTL script

Multi-language example: "slv+eng" (Slovenian + English), "deu+fra" (German + French).

RapidOCR script families

RapidOCR operates at the script level, not individual languages. Multiple Latin-script languages (German, French, Slovenian, etc.) all map to the latin model. The SDK handles this mapping automatically — you still specify languages in Tesseract format.

Script family Covers
en English (dedicated model)
latin German, French, Spanish, Italian, Slovenian, Polish, etc.
cyrillic Russian, Ukrainian, Bulgarian, Serbian
ch Chinese (Simplified)
japan Japanese
korean Korean
arabic Arabic
devanagari Hindi, Bengali

Local parsing dependencies

DOC/ODT conversion

For legacy Office files (.doc, .odt), the parser converts files to PDF first, then processes them through the PDF pipeline.

This requires external system tools if you run it locally and not in service mode:

  • pandoc for general Office -> PDF conversion
  • LibreOffice (soffice) for legacy .doc conversion

If these tools are missing, .doc/.odt parsing will fail at runtime.

Tesseract OCR

If using a Docling Tesseract-based OCR engine, ensure:

  • Tesseract is installed and on PATH
  • The language data directory is configured via TESSDATA_PREFIX
# macOS with Homebrew
export TESSDATA_PREFIX="/opt/homebrew/opt/tesseract/share/tessdata"

Languages configured in pdf_ocr_lang must have matching .traineddata files under $TESSDATA_PREFIX.

RapidOCR font

Docling uses RapidOCR which may try to download a font (FZYTK.TTF) at runtime. Set a local font path to avoid this:

export DOCLING_OCR_FONT_PATH="/path/to/a/unicode.ttf"

On macOS, a system font is usually available automatically when this variable is unset. On Linux/Windows or restricted environments, setting DOCLING_OCR_FONT_PATH is recommended.

Supported file types

Extension(s) Parser Notes
.pdf PdfRouterParser Classifies structured vs scanned, routes accordingly
.png, .jpg, .jpeg, .webp, .gif, .bmp ImageRouterParser Routes to configured image parser
.docx DocxParser Via Docling
.doc, .odt DocParser
.xlsx XlsxParser
.csv CsvParser Auto-detect delimiter and header
.eml EmlParser RFC 2822 with recursive attachment parsing
.txt, .md TextParser Passthrough with encoding detection

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clichefactory-0.3.0.tar.gz (128.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clichefactory-0.3.0-py3-none-any.whl (158.8 kB view details)

Uploaded Python 3

File details

Details for the file clichefactory-0.3.0.tar.gz.

File metadata

  • Download URL: clichefactory-0.3.0.tar.gz
  • Upload date:
  • Size: 128.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.13 {"installer":{"name":"uv","version":"0.9.13"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for clichefactory-0.3.0.tar.gz
Algorithm Hash digest
SHA256 3df02f068d052296e321b916c80f2db0af449035ba2210968d97cadf113e60ba
MD5 7ce282b797f6424d4709bf8e53a3f4e1
BLAKE2b-256 aa1bbff2f8f05da821ac436db3e458f81388f4579bc997797b589e7c7475de1b

See more details on using hashes here.

File details

Details for the file clichefactory-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: clichefactory-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 158.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.13 {"installer":{"name":"uv","version":"0.9.13"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for clichefactory-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 eef2664b04d6d57c68d278a614f752599e2514f822b0f9d3e882b16a1afcc45b
MD5 49499a6075598f2fee9a6a49eeb85c0a
BLAKE2b-256 556ac2ac435a7e9107ddd827a7535f18a3f7480014729f47d82b3871f8eff14f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page