Convert DOCX and PDF files to structured HTML or JATS XML for Open Journal Systems (OJS)

These details have not been verified by PyPI

Project links

Project description

OJS Galleon

Convert DOCX and PDF files to structured HTML or JATS XML for additional galleys on Open Journal Systems (OJS) sites.

How it works

Source	→ HTML	→ JATS XML
`.docx`	mammoth with a Word-style map	pandoc native JATS output
`.pdf`	pdfplumber (text + tables) + pymupdf (images)	same

PDF extraction features:

Two-column layout detection — left column is always read before right
Tables extracted as <table> / JATS <table-wrap> with accessible headers (scope="col")
Embedded raster images extracted and embedded as base64 data-URIs
Font-size heuristics distinguish headings from body text
Running headers/footers stripped by repetition detection (footnotes preserved)
Page numbers stripped
Bare URLs linkified as <a> tags

HTML output is always valid and accessible:

<html lang="..."> and <title> on every document (WCAG 2.4.2 / 3.1.1)
Empty <th> elements converted to <td> (ADA Title II / WCAG 1.3.1)
Self-contained — no external assets, images embedded as data-URIs

Requirements

Python ≥ 3.11
uv package manager
pandoc on $PATH (required for DOCX → JATS only)

brew install pandoc   # macOS

Installation

From PyPI

pip install ojsgalleon
# or
uv add ojsgalleon

From source

git clone <repo-url>
cd ojsgalleon
uv sync

Usage

Command line

# PDF → HTML, saved to file
ojsgalleon convert paper.pdf --output paper.html

# DOCX → JATS XML
ojsgalleon convert paper.docx --format jats --output paper.xml

# Non-English document
ojsgalleon convert paper.pdf --lang fr --output article.html

# Start the API server
ojsgalleon serve
ojsgalleon serve --host 127.0.0.1 --port 9000

With uv run from a source checkout:

uv run ojsgalleon convert paper.pdf --output paper.html

usage: ojsgalleon <command> [options]

commands:
  convert   Convert a .docx or .pdf file
  serve     Start the HTTP API server

convert options:
  file                  Path to the input .docx or .pdf
  --format {html,jats}  Output format (default: html)
  --output, -o          Write to file instead of stdout
  --lang                BCP 47 language tag for html[lang] (default: en)

serve options:
  --host HOST           Bind host (default: 0.0.0.0)
  --port PORT           Bind port (default: 8000)

Mammoth warnings (e.g. unmapped Word styles) are written to stderr and do not appear in the output file.

As a library

from ojsgalleon import pdf_to_html, pdf_to_jats, docx_to_html, docx_to_jats

html, warnings = docx_to_html(Path("paper.docx").read_bytes())
html            = pdf_to_html(Path("paper.pdf").read_bytes())
jats            = pdf_to_jats(Path("paper.pdf").read_bytes())
jats            = docx_to_jats(Path("paper.docx").read_bytes())

API server

ojsgalleon serve

Interactive docs: http://localhost:8000/docs

`POST /api/convert`

Accepts multipart/form-data:

Field	Type	Required	Default	Description
`file`	file	yes	—	`.docx` or `.pdf` to convert
`output_format`	string	no	`html`	`html` or `jats`
`lang`	string	no	`en`	BCP 47 language tag

Response:

{
  "filename": "paper.docx",
  "format": "html",
  "content": "<!DOCTYPE html>...",
  "warnings": []
}

curl -X POST http://localhost:8000/api/convert \
  -F "file=@paper.pdf" \
  -F "output_format=html" \
  | jq -r .content > paper.html

Project structure

src/ojsgalleon/
├── __init__.py          # public API
├── api.py               # FastAPI app
├── cli.py               # CLI (subcommands: convert, serve)
└── converters/
    ├── docx.py          # DOCX → HTML (mammoth) / JATS (pandoc)
    ├── pdf.py           # PDF → HTML or JATS (pdfplumber + pymupdf)
    └── html_wrap.py     # Wraps fragments in a valid, styled HTML5 document

Tuning PDF extraction

Two constants in src/ojsgalleon/converters/pdf.py control running header/footer suppression:

Constant	Default	Effect
`_MARGIN_RATIO`	`0.08`	Size of the top/bottom margin zone (8% of page height)
`_RUNNING_TEXT_THRESHOLD`	`0.40`	Fraction of pages a line must appear on to be suppressed

Increase _MARGIN_RATIO if a journal places running headers unusually deep into the text area. Lower _RUNNING_TEXT_THRESHOLD if headers only appear on roughly half the pages.

Known limitations

Scanned / image-only PDFs — text extraction requires a text layer; OCR is not included.
Vector graphics in PDFs — charts drawn with PDF path commands are not captured; only embedded raster images are extracted.
Bold-only headings in PDFs — section headings marked with bold weight but the same font size as body text cannot be detected by the font-size heuristic.
Borderless tables in PDFs — pdfplumber's table detection works well for ruled tables but may miss borderless ones.
JATS metadata — generated JATS lacks article metadata (title, authors, DOI); these must be filled in manually.
DOCX images — mammoth does not extract embedded images from DOCX files.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

Apr 4, 2026

This version

0.1.0

Apr 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ojsgalleon-0.1.0.tar.gz (15.8 kB view details)

Uploaded Apr 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ojsgalleon-0.1.0-py3-none-any.whl (18.9 kB view details)

Uploaded Apr 1, 2026 Python 3

File details

Details for the file ojsgalleon-0.1.0.tar.gz.

File metadata

Download URL: ojsgalleon-0.1.0.tar.gz
Upload date: Apr 1, 2026
Size: 15.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.13

File hashes

Hashes for ojsgalleon-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`9899cbaf5a54e7754fa06e366a57382a756da26cb3caa74067d16cd673a402bc`
MD5	`eae69fb91e75be758911906bf73a4e6c`
BLAKE2b-256	`a7cec2a59ae1f580594be2864bf32c8282ff126e967a2c4a005383df5a75eddd`

See more details on using hashes here.

File details

Details for the file ojsgalleon-0.1.0-py3-none-any.whl.

File metadata

Download URL: ojsgalleon-0.1.0-py3-none-any.whl
Upload date: Apr 1, 2026
Size: 18.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.13

File hashes

Hashes for ojsgalleon-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`10a666b7d7bcd1d3aab5ed41299151965667952dcf9a669ae5027090a2757795`
MD5	`f471a4103379ada9091680352b2003ff`
BLAKE2b-256	`ab24084866c8b493c7852978ec4b909fe0ae12e5f36f1d64430720ab9cafc426`

See more details on using hashes here.

ojsgalleon 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

OJS Galleon

How it works

Requirements

Installation

From PyPI

From source

Usage

Command line

As a library

API server

`POST /api/convert`

Project structure

Tuning PDF extraction

Known limitations

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes