Convert arXiv papers to clean Markdown. Particularly useful for prompting LLMs.
Project description
arxiv2md
Why?
I got tired of copy-pasting arXiv PDFs/HTML into LLMs and fighting references, TOCs, and token bloat. So I made gitingest.com but for arXiv papers.
The trick: Just append 2md to any arXiv URL:
https://arxiv.org/abs/2501.11120v1 → https://arxiv2md.org/abs/2501.11120v1
Features
- Zero friction: Append
2mdto any arXiv URL (works with/abs/,/html/,/pdf/) - Section filtering: Remove references, appendix, or select only specific sections
- Clean output: No messy PDFs or broken formatting
- Section tree: Visual overview - click to include/exclude sections
- LLM-optimized: Token counts, clean citations
- Fast: Cached results, direct HTML parsing
How It Works
arxiv2md is fast because it takes advantage of arXiv's HTML format for papers. Instead of parsing PDFs (slow, error-prone), we directly parse the structured HTML that arXiv provides for newer papers. This gives us:
- Clean section boundaries and hierarchies
- Proper math rendering (MathML → Markdown)
- Reliable table extraction
- Fast processing (no OCR or PDF parsing)
The HTML is converted to Markdown using BeautifulSoup4, with custom logic for handling citations, math equations, and paper structure.
Usage
Web App
Visit arxiv2md.org and paste any arXiv URL, or append 2md to an arXiv URL in your browser.
CLI
# Install
pip install -e .
# Basic usage
arxiv2md 2501.11120v1 -o paper.md
# Only include specific sections
arxiv2md 2501.11120v1 --section-filter-mode include --sections "Abstract,Introduction" -o -
# Remove references and TOC
arxiv2md 2501.11120v1 --remove-refs --remove-toc -o -
# Include YAML frontmatter with paper metadata
arxiv2md 2501.11120v1 --frontmatter -o paper.md
API
Two GET endpoints for programmatic access:
# JSON response (with metadata)
curl "https://arxiv2md.org/api/json?url=2312.00752"
# Raw markdown
curl "https://arxiv2md.org/api/markdown?url=2312.00752"
Parameters:
| Param | Default | Description |
|---|---|---|
url |
required | arXiv URL or ID |
remove_refs |
true |
Remove references |
remove_toc |
true |
Remove table of contents |
remove_citations |
true |
Remove inline citations |
frontmatter |
false |
Prepend YAML frontmatter with paper metadata (/api/markdown only) |
Rate limit: 30 requests/minute per IP.
Section Filtering
Exclude mode (default): Remove unwanted sections like References or Appendix Include mode: Extract only what you need like "Abstract,Introduction,Conclusion"
The section tree in the UI lets you click sections to toggle them in/out.
Development
# Run locally
python -m venv .venv
source .venv/bin/activate
pip install -e .[server]
uvicorn server.main:app --reload --app-dir src
# Run tests
pip install -e .[dev]
pytest tests
Deployment
One-command deployment to DigitalOcean with Docker, Nginx, and SSL:
git clone https://github.com/timf34/arxiv2md.git /root/arxiv2md
cd /root/arxiv2md
chmod +x deploy.sh
sudo ./deploy.sh
Contributing
PRs welcome! Fork the repo, create a feature branch, add tests if applicable, and submit a PR.
License
MIT
Inspired by gitingest for digesting Git repos.
Star this repo if you find it useful!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arxiv2markdown-0.1.0-py3-none-any.whl.
File metadata
- Download URL: arxiv2markdown-0.1.0-py3-none-any.whl
- Upload date:
- Size: 1.8 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5adabc46ca4e1a55921304df0b5d284efa7f5ec1d7c7684e96b19f5452355232
|
|
| MD5 |
65a31200b409c233f46eff64565ecd5e
|
|
| BLAKE2b-256 |
e4db758b5c589be8187e856e0820cf67a8565af99f8b6b6ee380217e459b0814
|