Skip to main content

PaperFlow markdown post-processing for citations, figures, tables, and frontmatter.

Project description

PaperFlow

PaperFlow is an open-source post-processing layer for PDF-to-Markdown workflows.

It does not try to be the parser. It takes raw markdown from a parser you choose and upgrades it into structured, knowledge-ready markdown with:

  • normalized LaTeX delimiters
  • linked citations as standard footnotes
  • figure and table jump links
  • YAML frontmatter
  • cleaned repeated headers and footers

PyPI package: https://pypi.org/project/paperflow-postprocess/
GitHub: https://github.com/TylerMorrison21/paperflow
Project page: https://www.paperflowing.com

Install

pip install paperflow-postprocess

Quick Start

from paperflow_postprocess import enhance

raw_markdown = """
Text with citation [1].

## References

[1] Example Author. Example Paper.
"""

markdown = enhance(
    raw_markdown=raw_markdown,
    images={},
    metadata={
        "title": "Example Paper",
        "authors": ["Example Author"],
        "source": "https://example.com/paper",
        "date": "2026-03-11",
    },
)

print(markdown)

PDF Parsers - bring your own

PaperFlow is a post-processing layer. It enhances raw Markdown from any upstream PDF parser. You need to choose a parser:

Option 1: Datalab Marker API (recommended, easiest)

  • Sign up at datalab.to - $25/month free credits
  • Cloud API, no GPU needed
  • Set DATALAB_API_KEY in your .env
  • PaperFlow's built-in api/services/marker.py calls this automatically

Option 2: Marker (self-hosted, free)

  • github.com/datalab-to/marker
  • Run locally with GPU (CUDA) or CPU
  • Free for orgs under $5M revenue
  • You'll need to modify api/services/marker.py to call your local endpoint

Option 3: MinerU (self-hosted, free)

  • github.com/opendatalab/MinerU
  • Strong on Chinese docs, scientific papers, complex tables
  • Outputs Markdown + LaTeX - compatible with PaperFlow's postprocess
  • Needs GPU (~6GB VRAM minimum)
  • Replace marker.py with a MinerU client

Option 4: Docling, PyMuPDF4LLM, or any other parser

  • Any tool that outputs Markdown will work
  • Feed the raw Markdown into paperflow_postprocess.enhance() or save it and run it through the API pipeline

Using the pip package with any parser

from paperflow_postprocess import enhance

# Get raw markdown from ANY parser
raw_md = my_parser.convert("paper.pdf")

# Enhance with PaperFlow
result = enhance(raw_md, images={}, metadata={"title": "My Paper"})

# result has footnotes, fixed LaTeX, figure links, YAML frontmatter

The whole point of PaperFlow is that parsing is commoditized. Marker, MinerU, LlamaParse, Docling, and PyMuPDF4LLM can all produce decent raw output. The post-processing layer is where the value is, and that is what PaperFlow does.

Visual Comparison

Source PDF Generic converter output PaperFlow output
Source PDF Generic converter output PaperFlow output

Additional example:

Calligraphy comparison

What enhance() does

enhance() upgrades parser output into structured markdown with:

  • standard footnotes [^N] from inline citations like [1], [1, 2], or [1-3]
  • normalized LaTeX delimiters using $...$ and $$...$$
  • figure links like [[#^fig-3|Fig. 3]]
  • table links like [[#^tab-2|Table 2]]
  • YAML frontmatter with title, authors, source, date, and extraction hash
  • cleaned repeated headers, footers, and page number lines

API

  • enhance(raw_markdown, images=None, metadata=None)
  • postprocess(raw_markdown, images=None, metadata=None) for backward compatibility
  • fix_latex_delimiters(md)
  • clean_headers_footers(md)
  • convert_to_footnotes(md)
  • linkify_figures(md)
  • linkify_tables(md)
  • inject_frontmatter(md, metadata)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paperflow_postprocess-0.1.1.tar.gz (7.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paperflow_postprocess-0.1.1-py3-none-any.whl (7.6 kB view details)

Uploaded Python 3

File details

Details for the file paperflow_postprocess-0.1.1.tar.gz.

File metadata

  • Download URL: paperflow_postprocess-0.1.1.tar.gz
  • Upload date:
  • Size: 7.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for paperflow_postprocess-0.1.1.tar.gz
Algorithm Hash digest
SHA256 3a7ee567cb635b53540dffe600937c1fefcc237d01f2d729b04f5ce7aa59fd47
MD5 f601953465c78dc023b51c5fff9394bf
BLAKE2b-256 70551836c9191ea42ca9e2df21986d42ffa99c8940dad1b3ac2b03b9863ea058

See more details on using hashes here.

File details

Details for the file paperflow_postprocess-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for paperflow_postprocess-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e8db982540efc0ceaaa8cf612048f9238859d1e8affd3c9e44a6d9f9bf274361
MD5 489ce163cb625ee04fe0031867e42c20
BLAKE2b-256 aaf33c1405e7234d2530c51efaf81422d7e93c96c4f6eacf1da1ed21e440da8a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page