Convert DOCX and PDF files to structured HTML or JATS XML for Open Journal Systems (OJS)
Project description
OJS Galleon
Convert DOCX and PDF files to structured HTML or JATS XML for additional galleys on Open Journal Systems (OJS) sites.
How it works
| Source | → HTML | → JATS XML |
|---|---|---|
.docx |
mammoth with a Word-style map | pandoc native JATS output |
.pdf |
pdfplumber (text + tables) + pymupdf (images) | same |
PDF extraction features:
- Two-column layout detection — left column is always read before right
- Tables extracted as
<table>/ JATS<table-wrap>with accessible headers (scope="col") - Embedded raster images extracted and embedded as base64 data-URIs
- Font-size heuristics distinguish headings from body text
- Running headers/footers stripped by repetition detection (footnotes preserved)
- Page numbers stripped
- Bare URLs linkified as
<a>tags
HTML output is always valid and accessible:
<html lang="...">and<title>on every document (WCAG 2.4.2 / 3.1.1)- Empty
<th>elements converted to<td>(ADA Title II / WCAG 1.3.1) - Self-contained — no external assets, images embedded as data-URIs
Requirements
brew install pandoc # macOS
Installation
From PyPI
pip install ojsgalleon
# or
uv add ojsgalleon
From source
git clone <repo-url>
cd ojsgalleon
uv sync
Usage
Command line
# PDF → HTML, saved to file
ojsgalleon convert paper.pdf --output paper.html
# DOCX → JATS XML
ojsgalleon convert paper.docx --format jats --output paper.xml
# Non-English document
ojsgalleon convert paper.pdf --lang fr --output article.html
# Start the API server
ojsgalleon serve
ojsgalleon serve --host 127.0.0.1 --port 9000
With uv run from a source checkout:
uv run ojsgalleon convert paper.pdf --output paper.html
usage: ojsgalleon <command> [options]
commands:
convert Convert a .docx or .pdf file
serve Start the HTTP API server
convert options:
file Path to the input .docx or .pdf
--format {html,jats} Output format (default: html)
--output, -o Write to file instead of stdout
--lang BCP 47 language tag for html[lang] (default: en)
serve options:
--host HOST Bind host (default: 0.0.0.0)
--port PORT Bind port (default: 8000)
Mammoth warnings (e.g. unmapped Word styles) are written to stderr and do not appear in the output file.
As a library
from ojsgalleon import pdf_to_html, pdf_to_jats, docx_to_html, docx_to_jats
html, warnings = docx_to_html(Path("paper.docx").read_bytes())
html = pdf_to_html(Path("paper.pdf").read_bytes())
jats = pdf_to_jats(Path("paper.pdf").read_bytes())
jats = docx_to_jats(Path("paper.docx").read_bytes())
API server
ojsgalleon serve
Interactive docs: http://localhost:8000/docs
POST /api/convert
Accepts multipart/form-data:
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
file |
file | yes | — | .docx or .pdf to convert |
output_format |
string | no | html |
html or jats |
lang |
string | no | en |
BCP 47 language tag |
Response:
{
"filename": "paper.docx",
"format": "html",
"content": "<!DOCTYPE html>...",
"warnings": []
}
curl -X POST http://localhost:8000/api/convert \
-F "file=@paper.pdf" \
-F "output_format=html" \
| jq -r .content > paper.html
Project structure
src/ojsgalleon/
├── __init__.py # public API
├── api.py # FastAPI app
├── cli.py # CLI (subcommands: convert, serve)
└── converters/
├── docx.py # DOCX → HTML (mammoth) / JATS (pandoc)
├── pdf.py # PDF → HTML or JATS (pdfplumber + pymupdf)
└── html_wrap.py # Wraps fragments in a valid, styled HTML5 document
Tuning PDF extraction
Two constants in src/ojsgalleon/converters/pdf.py control running header/footer suppression:
| Constant | Default | Effect |
|---|---|---|
_MARGIN_RATIO |
0.08 |
Size of the top/bottom margin zone (8% of page height) |
_RUNNING_TEXT_THRESHOLD |
0.40 |
Fraction of pages a line must appear on to be suppressed |
Increase _MARGIN_RATIO if a journal places running headers unusually deep into the text area. Lower _RUNNING_TEXT_THRESHOLD if headers only appear on roughly half the pages.
Known limitations
- Scanned / image-only PDFs — text extraction requires a text layer; OCR is not included.
- Vector graphics in PDFs — charts drawn with PDF path commands are not captured; only embedded raster images are extracted.
- Bold-only headings in PDFs — section headings marked with bold weight but the same font size as body text cannot be detected by the font-size heuristic.
- Borderless tables in PDFs — pdfplumber's table detection works well for ruled tables but may miss borderless ones.
- JATS metadata — generated JATS lacks article metadata (title, authors, DOI); these must be filled in manually.
- DOCX images — mammoth does not extract embedded images from DOCX files.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ojsgalleon-0.1.0.tar.gz.
File metadata
- Download URL: ojsgalleon-0.1.0.tar.gz
- Upload date:
- Size: 15.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9899cbaf5a54e7754fa06e366a57382a756da26cb3caa74067d16cd673a402bc
|
|
| MD5 |
eae69fb91e75be758911906bf73a4e6c
|
|
| BLAKE2b-256 |
a7cec2a59ae1f580594be2864bf32c8282ff126e967a2c4a005383df5a75eddd
|
File details
Details for the file ojsgalleon-0.1.0-py3-none-any.whl.
File metadata
- Download URL: ojsgalleon-0.1.0-py3-none-any.whl
- Upload date:
- Size: 18.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
10a666b7d7bcd1d3aab5ed41299151965667952dcf9a669ae5027090a2757795
|
|
| MD5 |
f471a4103379ada9091680352b2003ff
|
|
| BLAKE2b-256 |
ab24084866c8b493c7852978ec4b909fe0ae12e5f36f1d64430720ab9cafc426
|