Skip to main content

Turn a PDF into Markdown with first-class AI figure descriptions.

Project description

figmark

CI License: MIT Python

Turn a document into Markdown where every figure is described, not dropped.

figmark extracts a document's text and replaces each image and vector diagram with an AI-generated description, producing one coherent Markdown document. Think Docling, but with first-class figure interpretation: charts, photos, and diagrams become readable prose in reading order instead of vanishing.

You need a vision-capable model behind an OpenAI-compatible API — hosted or local (e.g. vLLM or Ollama). Point api.base_url / api.model in config.yaml at your endpoint and put its key in FIGMARK_API_KEY (the variable name is historical; a provider-neutral name is tracked in T-010).

What figmark is for

figmark exists to extract as much valuable information from a document as possible, in a form LLM-based products can use effectively — RAG ingestion, a document dropped into an assistant's chat context, or an OCR backend for a platform like LibreChat. It speaks the Mistral-OCR wire format (/v1/ocr) so those products can point at it unchanged — and aims to do the job better than plain OCR by also interpreting the parts of a document that text extraction alone cannot see: charts, diagrams, photos, and other figures that carry information.

Three consequences of that goal shape the design:

  • Extraction quality is a spectrum, not a binary. Plain text extraction already gets a downstream LLM most of the way; every figure description, reconstructed table, and inferred heading on top of that makes the representation better. Partial information about a chart is far more valuable than no information — a downstream LLM is forgiving and works well with an imperfect but honest representation. figmark therefore never withholds the text just because a richer structure could not be recovered, and never asserts structure it isn't sure of (see the table notes under Known limitations).
  • Figure interpretation is the differentiator. Anything that would drop a chart or image that carries meaning — a text-only extractor, a converter that rasterises figures away — defeats the purpose. This is why Office documents go through a full-fidelity conversion rather than a lightweight text extractor (T-054).
  • OCR of scans is a supporting capability, not the product. figmark handles scanned pages (Tesseract, with a vision-model rescue) so mixed corpora don't fail, but it is not built for large-scale OCR of scanned archives — for born-digital, figure-bearing documents it shines; for messy scans a dedicated VLM-OCR service will beat it (see Known limitations).

The same figure descriptions also serve accessibility — figmark began as an alt-text generator for formal Swedish ("myndighetssvenska") and can still emit an annotated or tagged PDF alongside the Markdown.

What the output looks like

A chart page in a Bank of Canada Monetary Policy Report comes out as (real, unedited output):

![Diagram, page 4](diagrams/page-004-diagram-01.png)

> **1. What the chart shows**
> The image contains two side-by-side line charts titled "Inflation has been
> slowing," showing the year-over-year percentage change of monthly inflation
> data.
> *   **X-axis:** Time, spanning from 2019 through the end of 2023.
> *   **Y-axis:** Percentage change (%). The left chart's scale ranges from
>     -2% to 12%.
>
> **2. Data series**
> *   **Canada:** Red line. **Canadian core CPI range:** a shaded red area.
> *   **United States:** Light blue line. **Euro area:** Green line. …

A text-only extractor drops that chart entirely; an OCR engine turns it into axis-label noise. figmark hands your LLM the chart's actual content.

What it does

  • Text + figures → Markdown. Output is a single <name>.md with figures embedded as ![...](images/…) followed by their description as a caption.
  • Vector diagram detection. Matplotlib-style charts (which get_images() misses) are found by clustering vector drawings, rendered, and described with a diagram-specific prompt.
  • Scanned PDFs. Falls back to OCR — Tesseract first, a vision model when Tesseract's quality is too low.
  • Configurable input formats. PDF by default, plus the PyMuPDF-native formats (EPUB, XPS, FB2, CBZ, MOBI) via an input.formats allowlist in config — no extra dependency. MS Office (docx/xlsx/pptx) works too, via a sandboxed LibreOffice-headless conversion (requires LibreOffice; a separate Office image variant is tracked in T-054). The gate sniffs the actual content (magic bytes + container inspection), so a mislabelled file fails loud instead of being mis-parsed.
  • Context-aware descriptions. Sends the surrounding text — plus a one-line summary of what kind of document it is — to the model, so a chart is interpreted in the report's context, not just visually.
  • Matches the document's language. Descriptions follow the document's own language by default (auto-detected), or you can force one — so an English PDF gets English captions, not Swedish ones.
  • Skips decorative images. A significance gate lets the model leave out logos, dividers, and icons that carry no information — no extra API calls.
  • Parallel + cached. Descriptions run concurrently and are cached on disk; a second run re-uses them and makes no API calls.
  • Fail loudly. No silent fallbacks — strategy switches are shouted with clear !!! banners.

Install

python -m venv .venv
source .venv/bin/activate
pip install -e .

For scanned PDFs you also need Tesseract:

# macOS
brew install tesseract tesseract-lang
# Debian/Ubuntu
sudo apt-get install tesseract-ocr tesseract-ocr-swe

Point figmark at your endpoint and set your API key:

cp config.example.yaml config.yaml
# edit config.yaml: api.base_url + api.model (your OpenAI-compatible endpoint)

cp .env.example .env
# edit .env and set FIGMARK_API_KEY (or FIGMARK_API_KEY=none for keyless local endpoints)

Usage

figmark path/to/document.pdf

Output lands in output/<pdf-name>/:

  • <pdf-name>.mdthe primary output: text with figure descriptions inlined
  • raw_text.txt — text only, no descriptions
  • images/, diagrams/ — extracted figures
  • descriptions/, diagram_descriptions/ — one .txt per figure (the cache)
  • document_summary.txt, document_language.txt — cached document-level context

Produce an accessibility-annotated copy of the source PDF too:

figmark path/to/document.pdf --annotate-pdf

Run as a service (container)

figmark also ships as a hardened HTTP service for air-gapped deployment — a single container that needs only a reachable OpenAI-compatible vision endpoint.

Prebuilt images are published to GHCR — every green build of main as :edge, and releases as :<version> + :latest:

docker pull ghcr.io/ztein/figmark:edge

Or run the stack with compose (no source checkout needed — just compose.yaml and a config):

cp config.example.yaml config.yaml   # edit api.base_url + api.model
mkdir -p secrets
printf '%s' 'a-strong-token' > secrets/auth_token
printf '%s' "$FIGMARK_API_KEY" > secrets/figmark_api_key
docker compose up -d                  # pulls ghcr.io/ztein/figmark:edge

curl -s -X POST http://127.0.0.1:8000/v1/convert \
  -H "Authorization: Bearer a-strong-token" \
  -F "file=@document.pdf;type=application/pdf"

Unlike the CLI (which writes files — <name>.md, figures.json, …), the HTTP surface returns everything inline as JSON:

Field Meaning
markdown the converted document (with <!-- page N --> markers for provenance)
page_count / figure_count / skipped_count pages processed, figures described, images skipped by the significance gate
language detected document language
usage prompt_tokens, completion_tokens, total_tokens, api_calls, calls_missing_usage
estimated_cost / currency monetary estimate — null unless both token prices are set in config.yaml (never a misleading 0)

Health/metadata endpoints are auth-free: GET /readyz and GET /version.

LibreChat / Mistral-OCR-compatible endpoint

The server also speaks the Mistral OCR wire format, so tools that expect that API — LibreChat in particular — can use figmark as a self-hosted, air-gappable OCR backend. Point the client's OCR_BASEURL at http(s)://<figmark-host>/v1 and set its OCR_API_KEY to the figmark bearer token. figmark implements the four calls LibreChat's default strategy makes: POST /v1/filesGET /v1/files/{id}/urlPOST /v1/ocrDELETE /v1/files/{id}, returning { "pages": [ { "index", "markdown", "images" } ] } (docs/tickets/T-052).

Why figmark rather than the OCR service this contract comes from: figmark fulfils the same API but aims to extract more of the document's information value — for born-digital, figure/diagram-heavy documents it describes figures and diagrams with a vision model instead of OCR'ing them into broken text or dropping them — and keeps the data on your own network. Limitation: figmark's raster OCR is Tesseract, not a vision-language model, so this backend is strongest on born-digital / figure-heavy PDFs and weaker than a VLM on messy scans and handwriting. It accepts the formats in the input.formats allowlist (PDF by default; EPUB and the other PyMuPDF-native formats are free to enable); anything else — including raster image input via image_url — returns 415. Do not deploy it expecting VLM-grade scan fidelity.

When a scanned page can't be OCR'd — the rendered page is too large for the vision model even after figmark downscales it, or the model rejects/returns nothing — the request fails loud with a 422 naming the page and the reason (and the remedy: lower the OCR render DPI, or use a model with a larger image-input limit), rather than a misleading generic backend error (docs/tickets/T-053).

The image is non-root, read-only-rootfs compatible, self-contained (Tesseract + language data baked in), and passes a hard Trivy scan in CI. Secrets come from files (never the image or plaintext env). Full runbook: docs/deployment.md; security model: SECURITY.md.

Configuration

Everything beyond the API key is controlled by your config.yaml (start from config.example.yaml):

  • api.model / api.base_url — which model and endpoint to use
  • language.output — output language for descriptions/diagrams/summary: auto follows the document's own language, or name one (Swedish, English) to force it
  • description.prompt / diagrams.prompt — the figure and diagram prompts (written in Swedish by default; they set the task and register, the output language is controlled separately by language.output)
  • concurrency.max_workers — parallel API calls
  • context.* — how much surrounding text to send for context
  • significance.enabled — let the model skip purely decorative images
  • document_summary.* — generate a document-type summary and pass it as context
  • ocr.language — Tesseract language

Technical thresholds (clustering, OCR, retries, render DPI) live as documented constants in src/figmark/<module>.py.

How it works

A PDF is classified as text-encoded or scanned and its text extracted (or OCR'd), then given structure (headings/lists inferred from typography), ruled tables reconstructed as Markdown, running headers/footers stripped, hyperlinks preserved, and images + vector diagrams found and described in parallel — all woven back into the text in column-aware reading order. A figures.json indexes every figure. For the full pipeline, module map, outputs, and the open Phase-2 items, see docs/architecture.md.

Known limitations

  • Broken text layers. figmark trusts the PDF's embedded text. A PDF with a missing or broken font encoding (no/garbled ToUnicode CMap) can carry plenty of characters that are actually mojibake; figmark extracts them as-is. It does not silently swallow this — pages whose text looks broken are flagged with a loud warning — but it does not yet auto-OCR them. For such files, re-export from the source or pre-OCR them before converting.
  • Tables. Ruled data tables are reconstructed as Markdown behind a conservative filter (docs/tickets/T-031). Quantitative data drawn as a chart is captured by the figure description instead. Borderless / whitespace-aligned tables (e.g. forecast appendices with no ruling lines) are not detected and fall through to the text path, where they are flattened: row labels and cell values land on separate lines and column headers can detach, so the column↔value link is lost in the raw text (docs/tickets/T-050). The data is all still present, and a downstream LLM can often recover it — the preserved <!-- page N --> markers let you point a model (or a reader) at the source page. This is deliberate: forcing detection on these pages (PyMuPDF's whitespace strategy) does find a grid, but mis-aligns its columns — chopping labels and splitting numbers — so it would emit a table asserting the wrong column↔value mapping, which is worse than honest flat text. We keep the raw text rather than guess a structure. For number-critical lookups over such documents, treat tables as a known gap.
  • Footnotes. Footnote text is kept (in reading order, at the page bottom) but not yet segregated/marked as footnotes (docs/tickets/T-044, Phase 2).
  • Tagged PDF. --tagged-pdf writes the structure-tree foundation (figure /Alt); full PDF/UA conformance is not yet claimed (docs/tickets/T-004).

Tests

pytest -m "not live and not docker"   # fast, offline, no API key, no Docker
pytest -m docker                       # builds the image + runs the compose stack
pytest -m "live"                       # against the real API (costs money, takes minutes)
pytest                                 # everything

See examples/README.md for sample documents.

Contributing

See CONTRIBUTING.md. Issues and PRs welcome.

Roadmap

  • 0.2 — configurable pipeline. Per-task provider/model selection (a different model for image description, diagram description, and vision-OCR) via a providers / tasks config, plus all technical knobs exposed in config.
  • Document model + more formats. A typed block model (heading/paragraph/list/table/figure) that PDF maps into and Markdown renders out of (docs/tickets/T-042), so the same structure work carries over to Word/Excel/PowerPoint inputs.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

figmark-0.3.0.tar.gz (311.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

figmark-0.3.0-py3-none-any.whl (80.0 kB view details)

Uploaded Python 3

File details

Details for the file figmark-0.3.0.tar.gz.

File metadata

  • Download URL: figmark-0.3.0.tar.gz
  • Upload date:
  • Size: 311.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for figmark-0.3.0.tar.gz
Algorithm Hash digest
SHA256 a74b55036c45ed7d4554335ce68ac8509be48f170a732a9ed744c678c1e522a4
MD5 68638096871ef0dc4f21d2c18d441c92
BLAKE2b-256 44997babc63bd2f0fe580d79565b2625f3508c043c637bbd0d0e29b2a095088f

See more details on using hashes here.

Provenance

The following attestation bundles were made for figmark-0.3.0.tar.gz:

Publisher: release.yml on Ztein/figmark

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file figmark-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: figmark-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 80.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for figmark-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 449c1b66141505b58edfd612b88ab49fc9af3f3dcb99beca4cef900240f43dd8
MD5 a8be72fce1aae9f5f81dde5ff5b9d866
BLAKE2b-256 bc732e7d742f8577d5acb252691e4f39e0133ce3ac2600dbeb02b0118ca04a98

See more details on using hashes here.

Provenance

The following attestation bundles were made for figmark-0.3.0-py3-none-any.whl:

Publisher: release.yml on Ztein/figmark

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page