Turn a PDF into Markdown with first-class AI figure descriptions.
Project description
figmark
Turn a document into Markdown where every figure is described, not dropped.
figmark extracts a document's text and replaces each image and vector diagram with an AI-generated description, producing one coherent Markdown document. Think Docling, but with first-class figure interpretation: charts, photos, and diagrams become readable prose in reading order instead of vanishing.
You need a vision-capable model behind an OpenAI-compatible API — hosted or
local (e.g. vLLM or Ollama). Point api.base_url / api.model in config.yaml
at your endpoint and put its key in FIGMARK_API_KEY (the variable name is
historical; a provider-neutral name is tracked in
T-010).
What figmark is for
figmark exists to extract as much valuable information from a document as
possible, in a form LLM-based products can use effectively — RAG ingestion,
a document dropped into an assistant's chat context, or an OCR backend for a
platform like LibreChat. It speaks the Mistral-OCR wire format
(/v1/ocr) so those products can point at it unchanged — and aims to do the
job better than plain OCR by also interpreting the parts of a document that
text extraction alone cannot see: charts, diagrams, photos, and other figures
that carry information.
Three consequences of that goal shape the design:
- Extraction quality is a spectrum, not a binary. Plain text extraction already gets a downstream LLM most of the way; every figure description, reconstructed table, and inferred heading on top of that makes the representation better. Partial information about a chart is far more valuable than no information — a downstream LLM is forgiving and works well with an imperfect but honest representation. figmark therefore never withholds the text just because a richer structure could not be recovered, and never asserts structure it isn't sure of (see the table notes under Known limitations).
- Figure interpretation is the differentiator. Anything that would drop a chart or image that carries meaning — a text-only extractor, a converter that rasterises figures away — defeats the purpose. This is why Office documents go through a full-fidelity conversion rather than a lightweight text extractor (T-054).
- OCR of scans is a supporting capability, not the product. figmark handles scanned pages (Tesseract, with a vision-model rescue) so mixed corpora don't fail, but it is not built for large-scale OCR of scanned archives — for born-digital, figure-bearing documents it shines; for messy scans a dedicated VLM-OCR service will beat it (see Known limitations).
The same figure descriptions also serve accessibility — figmark began as an alt-text generator for formal Swedish ("myndighetssvenska") and can still emit an annotated or tagged PDF alongside the Markdown.
What the output looks like
A chart page in a Bank of Canada Monetary Policy Report comes out as (real, unedited output):

> **1. What the chart shows**
> The image contains two side-by-side line charts titled "Inflation has been
> slowing," showing the year-over-year percentage change of monthly inflation
> data.
> * **X-axis:** Time, spanning from 2019 through the end of 2023.
> * **Y-axis:** Percentage change (%). The left chart's scale ranges from
> -2% to 12%.
>
> **2. Data series**
> * **Canada:** Red line. **Canadian core CPI range:** a shaded red area.
> * **United States:** Light blue line. **Euro area:** Green line. …
A text-only extractor drops that chart entirely; an OCR engine turns it into axis-label noise. figmark hands your LLM the chart's actual content.
What it does
- Text + figures → Markdown. Output is a single
<name>.mdwith figures embedded asfollowed by their description as a caption. - Vector diagram detection. Matplotlib-style charts (which
get_images()misses) are found by clustering vector drawings, rendered, and described with a diagram-specific prompt. - Scanned PDFs. Falls back to OCR — Tesseract first, a vision model when Tesseract's quality is too low.
- Configurable input formats. PDF by default, plus the PyMuPDF-native
formats (EPUB, XPS, FB2, CBZ, MOBI) via an
input.formatsallowlist in config — no extra dependency. MS Office (docx/xlsx/pptx) works too, via a sandboxed LibreOffice-headless conversion (requires LibreOffice; a separate Office image variant is tracked in T-054). The gate sniffs the actual content (magic bytes + container inspection), so a mislabelled file fails loud instead of being mis-parsed. - Context-aware descriptions. Sends the surrounding text — plus a one-line summary of what kind of document it is — to the model, so a chart is interpreted in the report's context, not just visually.
- Matches the document's language. Descriptions follow the document's own language by default (auto-detected), or you can force one — so an English PDF gets English captions, not Swedish ones.
- Skips decorative images. A significance gate lets the model leave out logos, dividers, and icons that carry no information — no extra API calls.
- Parallel + cached. Descriptions run concurrently and are cached on disk; a second run re-uses them and makes no API calls.
- Fail loudly. No silent fallbacks — strategy switches are shouted with clear
!!!banners.
Install
python -m venv .venv
source .venv/bin/activate
pip install -e .
For scanned PDFs you also need Tesseract:
# macOS
brew install tesseract tesseract-lang
# Debian/Ubuntu
sudo apt-get install tesseract-ocr tesseract-ocr-swe
Point figmark at your endpoint and set your API key:
cp config.example.yaml config.yaml
# edit config.yaml: api.base_url + api.model (your OpenAI-compatible endpoint)
cp .env.example .env
# edit .env and set FIGMARK_API_KEY (or FIGMARK_API_KEY=none for keyless local endpoints)
Usage
figmark path/to/document.pdf
Output lands in output/<pdf-name>/:
<pdf-name>.md— the primary output: text with figure descriptions inlinedraw_text.txt— text only, no descriptionsimages/,diagrams/— extracted figuresdescriptions/,diagram_descriptions/— one.txtper figure (the cache)document_summary.txt,document_language.txt— cached document-level context
Produce an accessibility-annotated copy of the source PDF too:
figmark path/to/document.pdf --annotate-pdf
Run as a service (container)
figmark also ships as a hardened HTTP service for air-gapped deployment — a single container that needs only a reachable OpenAI-compatible vision endpoint.
Prebuilt images are published to GHCR — every green build of main as :edge,
and releases as :<version> + :latest:
docker pull ghcr.io/ztein/figmark:edge
Or run the stack with compose (no source checkout needed — just compose.yaml
and a config):
cp config.example.yaml config.yaml # edit api.base_url + api.model
mkdir -p secrets
printf '%s' 'a-strong-token' > secrets/auth_token
printf '%s' "$FIGMARK_API_KEY" > secrets/figmark_api_key
docker compose up -d # pulls ghcr.io/ztein/figmark:edge
curl -s -X POST http://127.0.0.1:8000/v1/convert \
-H "Authorization: Bearer a-strong-token" \
-F "file=@document.pdf;type=application/pdf"
Unlike the CLI (which writes files — <name>.md, figures.json, …), the HTTP
surface returns everything inline as JSON:
| Field | Meaning |
|---|---|
markdown |
the converted document (with <!-- page N --> markers for provenance) |
page_count / figure_count / skipped_count |
pages processed, figures described, images skipped by the significance gate |
language |
detected document language |
usage |
prompt_tokens, completion_tokens, total_tokens, api_calls, calls_missing_usage |
estimated_cost / currency |
monetary estimate — null unless both token prices are set in config.yaml (never a misleading 0) |
Health/metadata endpoints are auth-free: GET /readyz and GET /version.
LibreChat / Mistral-OCR-compatible endpoint
The server also speaks the Mistral OCR wire format, so tools that expect that
API — LibreChat in particular — can
use figmark as a self-hosted, air-gappable OCR backend. Point the client's
OCR_BASEURL at http(s)://<figmark-host>/v1 and set its OCR_API_KEY to the
figmark bearer token. figmark implements the four calls LibreChat's default
strategy makes: POST /v1/files → GET /v1/files/{id}/url → POST /v1/ocr →
DELETE /v1/files/{id}, returning { "pages": [ { "index", "markdown", "images" } ] }
(docs/tickets/T-052).
Why figmark rather than the OCR service this contract comes from: figmark
fulfils the same API but aims to extract more of the document's information
value — for born-digital, figure/diagram-heavy documents it describes
figures and diagrams with a vision model instead of OCR'ing them into broken
text or dropping them — and keeps the data on your own network. Limitation: figmark's
raster OCR is Tesseract, not a vision-language model, so this backend is strongest
on born-digital / figure-heavy PDFs and weaker than a VLM on messy scans and
handwriting. It accepts the formats in the input.formats allowlist (PDF by
default; EPUB and the other PyMuPDF-native formats are free to enable); anything
else — including raster image input via image_url — returns 415. Do not
deploy it expecting VLM-grade scan fidelity.
When a scanned page can't be OCR'd — the rendered page is too large for the
vision model even after figmark downscales it, or the model rejects/returns nothing
— the request fails loud with a 422 naming the page and the reason (and the
remedy: lower the OCR render DPI, or use a model with a larger image-input limit),
rather than a misleading generic backend error (docs/tickets/T-053).
The image is non-root, read-only-rootfs compatible, self-contained (Tesseract + language data baked in), and passes a hard Trivy scan in CI. Secrets come from files (never the image or plaintext env). Full runbook: docs/deployment.md; security model: SECURITY.md.
Configuration
Everything beyond the API key is controlled by your config.yaml (start from
config.example.yaml):
api.model/api.base_url— which model and endpoint to uselanguage.output— output language for descriptions/diagrams/summary:autofollows the document's own language, or name one (Swedish,English) to force itdescription.prompt/diagrams.prompt— the figure and diagram prompts (written in Swedish by default; they set the task and register, the output language is controlled separately bylanguage.output)concurrency.max_workers— parallel API callscontext.*— how much surrounding text to send for contextsignificance.enabled— let the model skip purely decorative imagesdocument_summary.*— generate a document-type summary and pass it as contextocr.language— Tesseract language
Technical thresholds (clustering, OCR, retries, render DPI) live as documented
constants in src/figmark/<module>.py.
How it works
A PDF is classified as text-encoded or scanned and its text extracted (or OCR'd),
then given structure (headings/lists inferred from typography), ruled tables
reconstructed as Markdown, running headers/footers stripped, hyperlinks preserved,
and images + vector diagrams found and described in parallel — all woven back into
the text in column-aware reading order. A figures.json indexes every figure. For
the full pipeline, module map, outputs, and the open Phase-2 items, see
docs/architecture.md.
Known limitations
- Broken text layers. figmark trusts the PDF's embedded text. A PDF with a missing or broken font encoding (no/garbled ToUnicode CMap) can carry plenty of characters that are actually mojibake; figmark extracts them as-is. It does not silently swallow this — pages whose text looks broken are flagged with a loud warning — but it does not yet auto-OCR them. For such files, re-export from the source or pre-OCR them before converting.
- Tables. Ruled data tables are reconstructed as Markdown behind a conservative
filter (
docs/tickets/T-031). Quantitative data drawn as a chart is captured by the figure description instead. Borderless / whitespace-aligned tables (e.g. forecast appendices with no ruling lines) are not detected and fall through to the text path, where they are flattened: row labels and cell values land on separate lines and column headers can detach, so the column↔value link is lost in the raw text (docs/tickets/T-050). The data is all still present, and a downstream LLM can often recover it — the preserved<!-- page N -->markers let you point a model (or a reader) at the source page. This is deliberate: forcing detection on these pages (PyMuPDF's whitespace strategy) does find a grid, but mis-aligns its columns — chopping labels and splitting numbers — so it would emit a table asserting the wrong column↔value mapping, which is worse than honest flat text. We keep the raw text rather than guess a structure. For number-critical lookups over such documents, treat tables as a known gap. - Footnotes. Footnote text is kept (in reading order, at the page bottom) but
not yet segregated/marked as footnotes (
docs/tickets/T-044, Phase 2). - Tagged PDF.
--tagged-pdfwrites the structure-tree foundation (figure/Alt); full PDF/UA conformance is not yet claimed (docs/tickets/T-004).
Tests
pytest -m "not live and not docker" # fast, offline, no API key, no Docker
pytest -m docker # builds the image + runs the compose stack
pytest -m "live" # against the real API (costs money, takes minutes)
pytest # everything
See examples/README.md for sample documents.
Contributing
See CONTRIBUTING.md. Issues and PRs welcome.
Roadmap
- 0.2 — configurable pipeline. Per-task provider/model selection (a different
model for image description, diagram description, and vision-OCR) via a
providers/tasksconfig, plus all technical knobs exposed in config. - Document model + more formats. A typed block model
(
heading/paragraph/list/table/figure) that PDF maps into and Markdown renders out of (docs/tickets/T-042), so the same structure work carries over to Word/Excel/PowerPoint inputs.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file figmark-0.3.0.tar.gz.
File metadata
- Download URL: figmark-0.3.0.tar.gz
- Upload date:
- Size: 311.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a74b55036c45ed7d4554335ce68ac8509be48f170a732a9ed744c678c1e522a4
|
|
| MD5 |
68638096871ef0dc4f21d2c18d441c92
|
|
| BLAKE2b-256 |
44997babc63bd2f0fe580d79565b2625f3508c043c637bbd0d0e29b2a095088f
|
Provenance
The following attestation bundles were made for figmark-0.3.0.tar.gz:
Publisher:
release.yml on Ztein/figmark
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
figmark-0.3.0.tar.gz -
Subject digest:
a74b55036c45ed7d4554335ce68ac8509be48f170a732a9ed744c678c1e522a4 - Sigstore transparency entry: 2047515391
- Sigstore integration time:
-
Permalink:
Ztein/figmark@88ee35fd3830a0d5cb3d64bc823bc9badfb30e48 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/Ztein
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@88ee35fd3830a0d5cb3d64bc823bc9badfb30e48 -
Trigger Event:
push
-
Statement type:
File details
Details for the file figmark-0.3.0-py3-none-any.whl.
File metadata
- Download URL: figmark-0.3.0-py3-none-any.whl
- Upload date:
- Size: 80.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
449c1b66141505b58edfd612b88ab49fc9af3f3dcb99beca4cef900240f43dd8
|
|
| MD5 |
a8be72fce1aae9f5f81dde5ff5b9d866
|
|
| BLAKE2b-256 |
bc732e7d742f8577d5acb252691e4f39e0133ce3ac2600dbeb02b0118ca04a98
|
Provenance
The following attestation bundles were made for figmark-0.3.0-py3-none-any.whl:
Publisher:
release.yml on Ztein/figmark
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
figmark-0.3.0-py3-none-any.whl -
Subject digest:
449c1b66141505b58edfd612b88ab49fc9af3f3dcb99beca4cef900240f43dd8 - Sigstore transparency entry: 2047515395
- Sigstore integration time:
-
Permalink:
Ztein/figmark@88ee35fd3830a0d5cb3d64bc823bc9badfb30e48 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/Ztein
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@88ee35fd3830a0d5cb3d64bc823bc9badfb30e48 -
Trigger Event:
push
-
Statement type: