PaperFlow markdown post-processing for citations, figures, tables, and frontmatter.
Project description
PaperFlow
PaperFlow is an open-source post-processing layer for PDF-to-Markdown workflows.
It does not try to be the parser. It takes raw markdown from a parser you choose and upgrades it into structured, knowledge-ready markdown with:
- normalized LaTeX delimiters
- linked citations as standard footnotes
- figure and table jump links
- YAML frontmatter
- cleaned repeated headers and footers
PyPI package: https://pypi.org/project/paperflow-postprocess/
GitHub: https://github.com/TylerMorrison21/paperflow
Project page: https://www.paperflowing.com
Install
pip install paperflow-postprocess
Quick Start
from paperflow_postprocess import enhance
raw_markdown = """
Text with citation [1].
## References
[1] Example Author. Example Paper.
"""
markdown = enhance(
raw_markdown=raw_markdown,
images={},
metadata={
"title": "Example Paper",
"authors": ["Example Author"],
"source": "https://example.com/paper",
"date": "2026-03-11",
},
)
print(markdown)
PDF Parsers - bring your own
PaperFlow is a post-processing layer. It enhances raw Markdown from any upstream PDF parser. You need to choose a parser:
Option 1: Datalab Marker API (recommended, easiest)
- Sign up at datalab.to - $25/month free credits
- Cloud API, no GPU needed
- Set
DATALAB_API_KEYin your.env - PaperFlow's built-in
api/services/marker.pycalls this automatically
Option 2: Marker (self-hosted, free)
- github.com/datalab-to/marker
- Run locally with GPU (CUDA) or CPU
- Free for orgs under $5M revenue
- You'll need to modify
api/services/marker.pyto call your local endpoint
Option 3: MinerU (self-hosted, free)
- github.com/opendatalab/MinerU
- Strong on Chinese docs, scientific papers, complex tables
- Outputs Markdown + LaTeX - compatible with PaperFlow's postprocess
- Needs GPU (~6GB VRAM minimum)
- Replace
marker.pywith a MinerU client
Option 4: Docling, PyMuPDF4LLM, or any other parser
- Any tool that outputs Markdown will work
- Feed the raw Markdown into
paperflow_postprocess.enhance()or save it and run it through the API pipeline
Using the pip package with any parser
from paperflow_postprocess import enhance
# Get raw markdown from ANY parser
raw_md = my_parser.convert("paper.pdf")
# Enhance with PaperFlow
result = enhance(raw_md, images={}, metadata={"title": "My Paper"})
# result has footnotes, fixed LaTeX, figure links, YAML frontmatter
The whole point of PaperFlow is that parsing is commoditized. Marker, MinerU, LlamaParse, Docling, and PyMuPDF4LLM can all produce decent raw output. The post-processing layer is where the value is, and that is what PaperFlow does.
Visual Comparison
| Source PDF | Generic converter output | PaperFlow output |
|---|---|---|
Additional example:
What enhance() does
enhance() upgrades parser output into structured markdown with:
- standard footnotes
[^N]from inline citations like[1],[1, 2], or[1-3] - normalized LaTeX delimiters using
$...$and$$...$$ - figure links like
[[#^fig-3|Fig. 3]] - table links like
[[#^tab-2|Table 2]] - YAML frontmatter with title, authors, source, date, and extraction hash
- cleaned repeated headers, footers, and page number lines
API
enhance(raw_markdown, images=None, metadata=None)postprocess(raw_markdown, images=None, metadata=None)for backward compatibilityfix_latex_delimiters(md)clean_headers_footers(md)convert_to_footnotes(md)linkify_figures(md)linkify_tables(md)inject_frontmatter(md, metadata)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file paperflow_postprocess-0.1.1.tar.gz.
File metadata
- Download URL: paperflow_postprocess-0.1.1.tar.gz
- Upload date:
- Size: 7.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a7ee567cb635b53540dffe600937c1fefcc237d01f2d729b04f5ce7aa59fd47
|
|
| MD5 |
f601953465c78dc023b51c5fff9394bf
|
|
| BLAKE2b-256 |
70551836c9191ea42ca9e2df21986d42ffa99c8940dad1b3ac2b03b9863ea058
|
File details
Details for the file paperflow_postprocess-0.1.1-py3-none-any.whl.
File metadata
- Download URL: paperflow_postprocess-0.1.1-py3-none-any.whl
- Upload date:
- Size: 7.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e8db982540efc0ceaaa8cf612048f9238859d1e8affd3c9e44a6d9f9bf274361
|
|
| MD5 |
489ce163cb625ee04fe0031867e42c20
|
|
| BLAKE2b-256 |
aaf33c1405e7234d2530c51efaf81422d7e93c96c4f6eacf1da1ed21e440da8a
|