Skip to main content

Convert DOCX and PDF files to structured HTML or JATS XML for Open Journal Systems (OJS)

Project description

OJS Galleon

Convert DOCX and PDF files to structured HTML or JATS XML for additional galleys on Open Journal Systems (OJS) sites.

How it works

Source → HTML → JATS XML
.docx mammoth with a Word-style map pandoc native JATS output
.pdf pdfplumber (text + tables) + pymupdf (images) same

PDF extraction features:

  • Two-column layout detection — left column is always read before right
  • Tables extracted as <table> / JATS <table-wrap> with accessible headers (scope="col")
  • Embedded raster images extracted and embedded as base64 data-URIs
  • Font-size heuristics distinguish headings from body text
  • Running headers/footers stripped by repetition detection (footnotes preserved)
  • Page numbers stripped
  • Bare URLs linkified as <a> tags

HTML output is always valid and accessible:

  • <html lang="..."> and <title> on every document (WCAG 2.4.2 / 3.1.1)
  • Empty <th> elements converted to <td> (ADA Title II / WCAG 1.3.1)
  • Self-contained — no external assets, images embedded as data-URIs

Requirements

  • Python ≥ 3.11
  • uv package manager
  • pandoc on $PATH (required for DOCX → JATS only)
brew install pandoc   # macOS

Installation

From PyPI

pip install ojsgalleon
# or
uv add ojsgalleon

From source

git clone <repo-url>
cd ojsgalleon
uv sync

Usage

Command line

# PDF → HTML, saved to file
ojsgalleon convert paper.pdf --output paper.html

# DOCX → JATS XML
ojsgalleon convert paper.docx --format jats --output paper.xml

# Non-English document
ojsgalleon convert paper.pdf --lang fr --output article.html

# Start the API server
ojsgalleon serve
ojsgalleon serve --host 127.0.0.1 --port 9000

With uv run from a source checkout:

uv run ojsgalleon convert paper.pdf --output paper.html
usage: ojsgalleon <command> [options]

commands:
  convert   Convert a .docx or .pdf file
  serve     Start the HTTP API server

convert options:
  file                  Path to the input .docx or .pdf
  --format {html,jats}  Output format (default: html)
  --output, -o          Write to file instead of stdout
  --lang                BCP 47 language tag for html[lang] (default: en)

serve options:
  --host HOST           Bind host (default: 0.0.0.0)
  --port PORT           Bind port (default: 8000)

Mammoth warnings (e.g. unmapped Word styles) are written to stderr and do not appear in the output file.

As a library

from ojsgalleon import pdf_to_html, pdf_to_jats, docx_to_html, docx_to_jats

html, warnings = docx_to_html(Path("paper.docx").read_bytes())
html            = pdf_to_html(Path("paper.pdf").read_bytes())
jats            = pdf_to_jats(Path("paper.pdf").read_bytes())
jats            = docx_to_jats(Path("paper.docx").read_bytes())

API server

ojsgalleon serve

Interactive docs: http://localhost:8000/docs

POST /api/convert

Accepts multipart/form-data:

Field Type Required Default Description
file file yes .docx or .pdf to convert
output_format string no html html or jats
lang string no en BCP 47 language tag

Response:

{
  "filename": "paper.docx",
  "format": "html",
  "content": "<!DOCTYPE html>...",
  "warnings": []
}
curl -X POST http://localhost:8000/api/convert \
  -F "file=@paper.pdf" \
  -F "output_format=html" \
  | jq -r .content > paper.html

Project structure

src/ojsgalleon/
├── __init__.py          # public API
├── api.py               # FastAPI app
├── cli.py               # CLI (subcommands: convert, serve)
└── converters/
    ├── docx.py          # DOCX → HTML (mammoth) / JATS (pandoc)
    ├── pdf.py           # PDF → HTML or JATS (pdfplumber + pymupdf)
    └── html_wrap.py     # Wraps fragments in a valid, styled HTML5 document

Tuning PDF extraction

Two constants in src/ojsgalleon/converters/pdf.py control running header/footer suppression:

Constant Default Effect
_MARGIN_RATIO 0.08 Size of the top/bottom margin zone (8% of page height)
_RUNNING_TEXT_THRESHOLD 0.40 Fraction of pages a line must appear on to be suppressed

Increase _MARGIN_RATIO if a journal places running headers unusually deep into the text area. Lower _RUNNING_TEXT_THRESHOLD if headers only appear on roughly half the pages.

Known limitations

  • Scanned / image-only PDFs — text extraction requires a text layer; OCR is not included.
  • Vector graphics in PDFs — charts drawn with PDF path commands are not captured; only embedded raster images are extracted.
  • Bold-only headings in PDFs — section headings marked with bold weight but the same font size as body text cannot be detected by the font-size heuristic.
  • Borderless tables in PDFs — pdfplumber's table detection works well for ruled tables but may miss borderless ones.
  • JATS metadata — generated JATS lacks article metadata (title, authors, DOI); these must be filled in manually.
  • DOCX images — mammoth does not extract embedded images from DOCX files.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ojsgalleon-0.1.0.tar.gz (15.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ojsgalleon-0.1.0-py3-none-any.whl (18.9 kB view details)

Uploaded Python 3

File details

Details for the file ojsgalleon-0.1.0.tar.gz.

File metadata

  • Download URL: ojsgalleon-0.1.0.tar.gz
  • Upload date:
  • Size: 15.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.13

File hashes

Hashes for ojsgalleon-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9899cbaf5a54e7754fa06e366a57382a756da26cb3caa74067d16cd673a402bc
MD5 eae69fb91e75be758911906bf73a4e6c
BLAKE2b-256 a7cec2a59ae1f580594be2864bf32c8282ff126e967a2c4a005383df5a75eddd

See more details on using hashes here.

File details

Details for the file ojsgalleon-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: ojsgalleon-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 18.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.13

File hashes

Hashes for ojsgalleon-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 10a666b7d7bcd1d3aab5ed41299151965667952dcf9a669ae5027090a2757795
MD5 f471a4103379ada9091680352b2003ff
BLAKE2b-256 ab24084866c8b493c7852978ec4b909fe0ae12e5f36f1d64430720ab9cafc426

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page