Skip to main content

Local-first DOCX formatter for academic papers with a content-fingerprint guard that proves the text was left untouched — only the formatting changed.

Project description

Paper Format Agent

中文说明 | English

Local-first Content Guard Python CI License

An open-source DOCX formatter for academic papers that can prove it didn't rewrite your text.

Paper Format Agent reformats a thesis or paper — fonts, indents, alignment, spacing, headings, captions — to match a target format guide, and it ships with a verifiable content fingerprint so you can confirm the wording of your paper came out unchanged. It compares a fingerprint of your body and table text (with whitespace and stray bullet characters normalized out) before and after formatting; the run is fail-closed, so if that text changed it aborts instead of writing a file. Everything runs locally on your machine. It's also packaged as an installable agent skill (SKILL.md + agents/openai.yaml), so tools like Claude Code or Codex CLI can invoke it directly instead of a human clicking through a GUI.

Proof, not a promise

Real fields from an actual run (--engine python, the fully-guarded path), taken from the produced format_report.json:

{
  "content_fingerprint_before": "793e6533fd670418141d11fdcf014be19750408129ecff8b1b78a2641a3786db",
  "content_fingerprint_after":  "793e6533fd670418141d11fdcf014be19750408129ecff8b1b78a2641a3786db",
  "content_changed": false,
  "content_guard_enforced": true
}

The before/after fingerprints match, and a paragraph-by-paragraph .text comparison I ran across the whole document confirms every word survived. What did change on that same file: body text went from unset font/indent/alignment to SimSun (宋体) 12pt, a 2-character first-line indent, and justified alignment; the abstract title became SimSun 18pt centered; the Chinese keywords line became SimSun 12pt left-aligned. The same run also reported the real problems it found — char_below_min (document under the guide's minimum length) and blank_page_risk — rather than silently claiming a perfect score.

Why This Exists

Every closed-source formatting service (论文无忧, WPS 论文排版, 大以论文, AIPoliDoc, and similar) asks you to trust that your content survives the reformatting pass — none of them let you verify it.

  • The content guard is the smallest honest promise: change the formatting, but not the wording of your body and table text — and if that can't be confirmed, the run aborts with an error (content guard failed) instead of shipping a silently-altered document. It's fail-closed and enforced by default. (Scope: it normalizes whitespace and stray bullet characters before comparing, and covers body paragraphs and tables; headers and footers, which the formatter sets on purpose, are out of scope. The fully-guarded path is --engine python; other engines run a local post-processor, e.g. to refresh the table of contents, after the check.)
  • Open-source and auditable: read the code, or just diff the fingerprint yourself.
  • Formatting-only automation across margins, fonts, line spacing, headings, captions, tables, and references, plus required-section checks (abstracts, keywords, table of contents) and running headers / centered page-number footers.
  • Reports are usable by students, supervisors, reviewers, and CI.

Status

This project is a practical open-source MVP. It is suitable for demos, internal pilots, agent workflows, and synthetic benchmark development. Before relying on it for high-stakes submissions, expand the regression corpus, template coverage, and object-level scoring for tables, figures, equations, footnotes, headers, and footers.

Agent Skill

This repository includes a top-level SKILL.md and agents/openai.yaml, so agent users can treat the repo as an installable skill.

The skill teaches an agent how to:

  • inspect input files safely
  • run the formatter in content-preserving mode
  • review format_report.json
  • validate changes before returning results
  • add new template rules with tests

MCP Server

The same pipeline is also exposed as an optional MCP server, so Claude Code, Codex CLI, or any MCP client can call it directly (requires Python 3.10+):

pip install "paper-format-agent[mcp]"
paper-format-agent-mcp

Tools: format_paper (content-guarded reformat), extract_format_rules, and score_paper (read-only). See docs/MCP.md for the client config and tool reference.

Quick Start

pip install -r requirements.txt

python -m paper_format_agent.cli \
  --format-file "format_guide.docx" \
  --paper-file "paper.docx" \
  --out-dir "./output" \
  --engine auto \
  --strict-required-sections

Optional GUI:

python run_gui.py

Batch processing:

python -m paper_format_agent.cli \
  --format-file "format_guide.docx" \
  --paper-dir "./papers" \
  --out-dir "./batch_output" \
  --engine python \
  --strict-required-sections

Batch mode writes one output folder per paper plus batch_summary.json, including pass rate, score averages, content-change count, and per-paper report locations.

Template Packs And Synthetic Examples

The repository includes privacy-safe template packs and synthetic examples so users can try the workflow without uploading real papers:

  • templates/ contains JSON presets for Chinese thesis, journal article, and IEEE-style conference formatting.
  • examples/ contains a synthetic format guide and sample reports for demos, issues, and PRs.
  • docs/TEMPLATE_PACKS.md explains the template contract and contribution checklist.

Template files are intentionally plain JSON. They are easy to review, easy to customize locally, and safe to extend through small PRs.

Outputs

File Purpose
formatted_paper_v3.docx repaired DOCX document
format_rules.json extracted formatting rules
format_report.json machine-readable score and checks
format_report.html human-readable report
modify_log.json formatting operation log
engine_report.json Word COM / LibreOffice / Python post-process result
marker_dump.json optional paragraph classification dump

Safety Model

By default, the pipeline enforces a content guard. Reports include:

  • content_changed
  • content_guard_enforced
  • content_fingerprint_before
  • content_fingerprint_after
  • diagnostics with severity, evidence, and suggested fixes for failed checks

For normal academic formatting, content_changed should be false.

Validation

python tools/validate_skill.py
python -m unittest discover -s tests -p "test_*.py"
python tools/compile_check.py
python tools/release_audit.py

Before publishing from a local workspace, also run:

python tools/release_audit.py --include-local

This optional check includes untracked and ignored local artifacts, such as generated outputs, scratch files, caches, and private document formats.

Good First PRs

We want many small, reviewable PRs. Good contribution areas:

  • Add a synthetic test for a school, journal, or conference formatting rule.
  • Add a new synthetic template pack in templates/.
  • Improve a narrowly scoped rule extractor.
  • Add scoring coverage for tables, figures, references, equations, headers, or footers.
  • Improve report wording or diagnostics.
  • Add local-first integrations such as MCP, GitHub Actions, or batch processing.
  • Improve this repo's SKILL.md workflow for agent users.

New contributors can start from the task-ready board in docs/CONTRIBUTOR_TASKS.md. Each task lists user pain, expected PR shape, and suggested labels.

See CONTRIBUTING.md, ROADMAP.md, and AGENTS.md.

Architecture

format guide + paper.docx
  -> rule extraction
  -> paragraph type tagging
  -> style application
  -> numbering cleanup
  -> optional engine post-process
  -> scoring and reports

Detailed notes:

Privacy

Do not commit real papers, private school templates, reviewer comments, API keys, or generated documents. Use synthetic fixtures or anonymized snippets in tests.

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paper_format_agent-3.1.0.tar.gz (51.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paper_format_agent-3.1.0-py3-none-any.whl (45.4 kB view details)

Uploaded Python 3

File details

Details for the file paper_format_agent-3.1.0.tar.gz.

File metadata

  • Download URL: paper_format_agent-3.1.0.tar.gz
  • Upload date:
  • Size: 51.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.19

File hashes

Hashes for paper_format_agent-3.1.0.tar.gz
Algorithm Hash digest
SHA256 ad4d33fc606a95bf08ae911daa8bdad888327a4f2168b0e763d14c5b88080766
MD5 16cb24413d60d555d6b2eb1c40672a0b
BLAKE2b-256 1d2db94dc0dd47adb8695538b84caa6bc53d82aa418b0b5312a0205b211a294c

See more details on using hashes here.

File details

Details for the file paper_format_agent-3.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for paper_format_agent-3.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 85e37249a103f605f04523f6b147b94ab3a34a775ce795b72cacb36c2d4db37f
MD5 3727a8a85274d842992ea372a3295b59
BLAKE2b-256 bc0315d77b1b88f0ed73c4d0412c9e85b12707660350fd6ab81022a99c5ae742

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page