Skip to main content

Convert arXiv papers to clean Markdown. Particularly useful for prompting LLMs.

Project description

arxiv2md

arxiv2md

Convert arXiv papers to clean Markdown for LLMs

Live Demo · Documentation · Report Bug


Why?

I got tired of copy-pasting arXiv PDFs/HTML into LLMs and fighting references, TOCs, and token bloat. So I made gitingest.com but for arXiv papers.

The trick: Just append 2md to any arXiv URL:

https://arxiv.org/abs/2501.11120v1  →  https://arxiv2md.org/abs/2501.11120v1

Features

  • Zero friction: Append 2md to any arXiv URL (works with /abs/, /html/, /pdf/)
  • Section filtering: Remove references, appendix, or select only specific sections
  • Clean output: No messy PDFs or broken formatting
  • Section tree: Visual overview - click to include/exclude sections
  • LLM-optimized: Token counts, clean citations
  • Fast: Cached results, direct HTML parsing

How It Works

arxiv2md is fast because it takes advantage of arXiv's HTML format for papers. Instead of parsing PDFs (slow, error-prone), we directly parse the structured HTML that arXiv provides for newer papers. This gives us:

  • Clean section boundaries and hierarchies
  • Proper math rendering (MathML → Markdown)
  • Reliable table extraction
  • Fast processing (no OCR or PDF parsing)

The HTML is converted to Markdown using BeautifulSoup4, with custom logic for handling citations, math equations, and paper structure.

Usage

Web App

Visit arxiv2md.org and paste any arXiv URL, or append 2md to an arXiv URL in your browser.

CLI

# Install
pip install -e .

# Basic usage
arxiv2md 2501.11120v1 -o paper.md

# Only include specific sections
arxiv2md 2501.11120v1 --section-filter-mode include --sections "Abstract,Introduction" -o -

# Remove references and TOC
arxiv2md 2501.11120v1 --remove-refs --remove-toc -o -

# Include YAML frontmatter with paper metadata
arxiv2md 2501.11120v1 --frontmatter -o paper.md

API

Two GET endpoints for programmatic access:

# JSON response (with metadata)
curl "https://arxiv2md.org/api/json?url=2312.00752"

# Raw markdown
curl "https://arxiv2md.org/api/markdown?url=2312.00752"

Parameters:

Param Default Description
url required arXiv URL or ID
remove_refs true Remove references
remove_toc true Remove table of contents
remove_citations true Remove inline citations
frontmatter false Prepend YAML frontmatter with paper metadata (/api/markdown only)

Rate limit: 30 requests/minute per IP.

Section Filtering

Exclude mode (default): Remove unwanted sections like References or Appendix Include mode: Extract only what you need like "Abstract,Introduction,Conclusion"

The section tree in the UI lets you click sections to toggle them in/out.

Development

# Run locally
python -m venv .venv
source .venv/bin/activate
pip install -e .[server]
uvicorn server.main:app --reload --app-dir src

# Run tests
pip install -e .[dev]
pytest tests

Deployment

One-command deployment to DigitalOcean with Docker, Nginx, and SSL:

git clone https://github.com/timf34/arxiv2md.git /root/arxiv2md
cd /root/arxiv2md
chmod +x deploy.sh
sudo ./deploy.sh

Contributing

PRs welcome! Fork the repo, create a feature branch, add tests if applicable, and submit a PR.

License

MIT


Inspired by gitingest for digesting Git repos.

Star this repo if you find it useful!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arxiv2markdown-0.1.0-py3-none-any.whl (1.8 MB view details)

Uploaded Python 3

File details

Details for the file arxiv2markdown-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: arxiv2markdown-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for arxiv2markdown-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5adabc46ca4e1a55921304df0b5d284efa7f5ec1d7c7684e96b19f5452355232
MD5 65a31200b409c233f46eff64565ecd5e
BLAKE2b-256 e4db758b5c589be8187e856e0820cf67a8565af99f8b6b6ee380217e459b0814

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page