Convert DOCX and PDF files to structured HTML or JATS XML for Open Journal Systems (OJS)

These details have not been verified by PyPI

Project links

Project description

OJS Galleon

OJS Galleon is an experimental application that attempts to convert DOCX and PDF files to structured HTML or JATS XML in order to provide an HTML galley for Open Journal Systems (OJS) sites.

While not perfect (this is a hard problem), OJS Galleon attempts to get you 90% there and provide you a nice looking, professional HTML galley that is ADA Accessible out-of-the-box with some minor needs for copy editing.

How it works at a high level

OJS Galleon uses two libraries in tandem for PDF extraction: pdfplumber (built on pdfminer) for text and tables, and pymupdf (fitz) for images. pdfplumber gives word-level metadata lik each word's x/y position on the page and its font size which is the foundation for everything else. Tables are detected via pdfplumber's find_tables(), which uses line detection to identify ruled grids and extract cell data. Images are pulled via pymupdf because it provides reliable cross-reference IDs (xref) needed to extract the raw image bytes, which pdfplumber alone doesn't expose cleanly. Both libraries operate on the same PDF simultaneously, one per concern.

The text pipeline then works in several passes on each page. First, running headers and footers are identified by pre-scanning all pages and counting how often each margin line appears. Text that repeats on 40%+ of pages is flagged as boilerplate and suppressed. Then, for each page, a word-density histogram across the page width is built to detect whether a gutter (near-empty vertical strip) exists in the middle third of the page (this is the magic behind how a two-column layout is identified even if it's not perfect). When a gutter is found, individual lines are further classified as either "full-width" (words on both sides with a small gap, like a title or abstract) or "column-confined" (words on both sides but with a large inter-column gap, meaning they're two independent parallel lines). Full-width regions are read straight across; column regions are read left column first, then right (sorry for lack of international support). Finally, font-size heuristics promote lines with larger-than-median text into headings, and gap-based paragraph detection groups consecutive lines into <p> elements by measuring whether the vertical space between lines exceeds 1.6× the median line spacing on that page.

Source	→ HTML	→ JATS XML
`.docx`	mammoth with a Word-style map	pandoc native JATS output
`.pdf`	pdfplumber (text + tables) + pymupdf (images)	same

PDF extraction features:

Two-column layout detection — left column is always read before right
Tables extracted as <table> / JATS <table-wrap> with accessible headers (scope="col")
Embedded raster images extracted and embedded as base64 data-URIs
Font-size heuristics distinguish headings from body text
Running headers/footers stripped by repetition detection (footnotes preserved)
Page numbers stripped
Bare URLs linkified as <a> tags

HTML output is always valid and accessible:

<html lang="..."> and <title> on every document (WCAG 2.4.2 / 3.1.1)
<main> landmark wrapping content (WCAG 1.3.6)
Empty <th> elements converted to <td> (ADA Title II / WCAG 1.3.1)
Self-contained — no external assets, images embedded as data-URIs

Requirements

Python ≥ 3.11
pandoc on $PATH (required for DOCX → JATS only)

brew install pandoc   # macOS

Optional: AI features

The AI alt text and AI accessibility review features require the anthropic package (included as a dependency) and an API key in the environment:

export CLAUDE_API=your-api-key-here

Without this variable set, both AI features are silently skipped and the standard output is returned.

Installation

From PyPI

pip install ojsgalleon
# or
uv add ojsgalleon

From source

git clone <repo-url>
cd ojsgalleon
uv sync

Usage

Command line

# PDF → HTML, saved to file
ojsgalleon convert paper.pdf --output paper.html

# DOCX → JATS XML
ojsgalleon convert paper.docx --format jats --output paper.xml

# Non-English document
ojsgalleon convert paper.pdf --lang fr --output article.html

# Start the API server
ojsgalleon serve
ojsgalleon serve --host 127.0.0.1 --port 9000

With uv run from a source checkout:

uv run ojsgalleon convert paper.pdf --output paper.html

usage: ojsgalleon <command> [options]

commands:
  convert   Convert a .docx or .pdf file
  serve     Start the HTTP API server

convert options:
  file                  Path to the input .docx or .pdf
  --format {html,jats}  Output format (default: html)
  --output, -o          Write to file instead of stdout
  --lang                BCP 47 language tag for html[lang] (default: en)

serve options:
  --host HOST           Bind host (default: 0.0.0.0)
  --port PORT           Bind port (default: 8000)

Mammoth warnings (e.g. unmapped Word styles) are written to stderr and do not appear in the output file.

As a library

from ojsgalleon import pdf_to_html, pdf_to_jats, docx_to_html, docx_to_jats

html, warnings = docx_to_html(Path("paper.docx").read_bytes())
html, warnings = pdf_to_html(Path("paper.pdf").read_bytes())
jats            = pdf_to_jats(Path("paper.pdf").read_bytes())
jats            = docx_to_jats(Path("paper.docx").read_bytes())

Both pdf_to_html and docx_to_html return a (html: str, warnings: list[str]) tuple. Warnings include any issues reported by mammoth (DOCX) or the AI passes when enabled.

Optional parameters for pdf_to_html and docx_to_html:

Parameter	Type	Default	Description
`lang`	`str`	`"en"`	BCP 47 language tag for `html[lang]`
`style_overrides`	`dict[str, str] \| None`	`None`	CSS variable overrides, e.g. `{"--accent": "#c0392b"}`
`improve_accessibility`	`bool`	`False`	Run AI accessibility review (requires `CLAUDE_API`)

pdf_to_html also accepts:

Parameter	Type	Default	Description
`generate_alt_text`	`bool`	`False`	Generate alt text for images with Claude Haiku (requires `CLAUDE_API`)

Web UI

ojsgalleon serve

Then open http://localhost:8000 in your browser. The UI supports:

Drag-and-drop or click-to-browse file upload
Output format selection (HTML or JATS XML)
Language tag input
AI alt text — generate image descriptions with Claude Haiku (PDF only; requires CLAUDE_API)
AI accessibility review — post-process the HTML with Claude Sonnet to apply WCAG 2.1 AA / ADA Title II fixes including skip navigation, heading hierarchy, focus styles, footnote labels, language tagging, and color contrast (HTML only; requires CLAUDE_API)
Galley styles — a collapsible color picker panel to customize the six main CSS design tokens before converting

API server

Interactive docs: http://localhost:8000/docs

`POST /api/convert`

Accepts multipart/form-data:

Field	Type	Required	Default	Description
`file`	file	yes	—	`.docx` or `.pdf` to convert
`output_format`	string	no	`html`	`html` or `jats`
`lang`	string	no	`en`	BCP 47 language tag

Response:

{
  "filename": "paper.docx",
  "format": "html",
  "content": "<!DOCTYPE html>...",
  "warnings": []
}

curl -X POST http://localhost:8000/api/convert \
  -F "file=@paper.pdf" \
  -F "output_format=html" \
  | jq -r .content > paper.html

Project structure

src/ojsgalleon/
├── __init__.py          # public API
├── api.py               # FastAPI app + REST endpoint
├── ui.py                # Web UI (HTMX + Tailwind) + /ui/convert endpoint
├── cli.py               # CLI (subcommands: convert, serve)
└── converters/
    ├── docx.py          # DOCX → HTML (mammoth) / JATS (pandoc)
    ├── pdf.py           # PDF → HTML or JATS (pdfplumber + pymupdf)
    └── html_wrap.py     # HTML wrapper, CSS design tokens, AI accessibility pass

Tuning PDF extraction

Two constants in src/ojsgalleon/converters/pdf.py control running header/footer suppression:

Constant	Default	Effect
`_MARGIN_RATIO`	`0.08`	Size of the top/bottom margin zone (8% of page height)
`_RUNNING_TEXT_THRESHOLD`	`0.40`	Fraction of pages a line must appear on to be suppressed

Increase _MARGIN_RATIO if a journal places running headers unusually deep into the text area. Lower _RUNNING_TEXT_THRESHOLD if headers only appear on roughly half the pages.

Known limitations

Scanned / image-only PDFs — text extraction requires a text layer; OCR is not included.
Vector graphics in PDFs — charts drawn with PDF path commands are not captured; only embedded raster images are extracted.
Bold-only headings in PDFs — section headings marked with bold weight but the same font size as body text cannot be detected by the font-size heuristic.
Borderless tables in PDFs — pdfplumber's table detection works well for ruled tables but may miss borderless ones.
JATS metadata — generated JATS lacks article metadata (title, authors, DOI); these must be filled in manually.
DOCX images — mammoth does not extract embedded images from DOCX files.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Apr 4, 2026

0.1.0

Apr 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ojsgalleon-0.2.0.tar.gz (3.6 MB view details)

Uploaded Apr 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ojsgalleon-0.2.0-py3-none-any.whl (31.1 kB view details)

Uploaded Apr 4, 2026 Python 3

File details

Details for the file ojsgalleon-0.2.0.tar.gz.

File metadata

Download URL: ojsgalleon-0.2.0.tar.gz
Upload date: Apr 4, 2026
Size: 3.6 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.13

File hashes

Hashes for ojsgalleon-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`b2fd6d6f6b413ec227f586e2652807c9fafec90ad9f9f9985e88de0a003553c4`
MD5	`d93124a75a6623ea580d1b75f0b64bc4`
BLAKE2b-256	`a2bad02202badca7ce8ee261ef142926cc2a9f73f990b2561cc9e4c88c99597a`

See more details on using hashes here.

File details

Details for the file ojsgalleon-0.2.0-py3-none-any.whl.

File metadata

Download URL: ojsgalleon-0.2.0-py3-none-any.whl
Upload date: Apr 4, 2026
Size: 31.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.13

File hashes

Hashes for ojsgalleon-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a1bbe1f541f8351d72d08e4be42c60501647a935befb0e6e21806b6c48729c77`
MD5	`7a5e7131e1f7bee6f31bffd0900e7462`
BLAKE2b-256	`ed96a4321551d26ad169ceb9ee0888f8585dfeebec742dea6b04de874fbd72e1`

See more details on using hashes here.

ojsgalleon 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

OJS Galleon

How it works at a high level

Requirements

Optional: AI features

Installation

From PyPI

From source

Usage

Command line

As a library

Web UI

API server

POST /api/convert

Project structure

Tuning PDF extraction

Known limitations

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`POST /api/convert`