Convert arXiv TeX sources to Markdown
Project description
arxiv-md
Good enough™ LaTeX to Markdown converter. Optimized on hundreds of arXiv papers from 2016 to 2026.
arxiv-md is a Python package and CLI for turning arXiv e-print bundles,
source directories, or single .tex files into:
- document.md # rendered Markdown
- conversion.json # warnings, stats, options, paths
- images/ # optional rendered/copied figures
Unsupported TeX is preserved as raw LaTeX when possible and reported through typed diagnostics. Output locations are explicit: callers choose the exact output directory, and conversion never writes into the source tree.
Install
# Core library and CLIs
uv add arxiv-md
# Add optional pypdfium2/Pillow-backed PDF/JPEG → PNG rendering
uv add 'arxiv-md[assets]'
# CLI-only install
uv tool install 'arxiv-md[assets]'
Supports Python 3.10–3.13. Core library has no required runtime dependencies.
Quickstart
# Single .tex file
tex-to-md paper.tex --outdir out/paper
# Source archive (.tar, .tar.gz, .tgz, .zip, .gz)
tex-to-md paper.tar.gz --outdir out/paper --json
# Download arXiv source by id and convert
arxiv-to-md 1706.03762 --outdir out
--outdir is mandatory. tex-to-md writes one document directly into that
folder. arxiv-to-md writes one subdirectory per paper.
Common asset modes:
- Default
rasterize— PDF→PNG via pypdfium2, JPEG→PNG via Pillow; best compatibility, highest CPU/native-code exposure. --asset-mode copy— copy PDF/JPEG verbatim; faster, no optional asset deps needed.--asset-mode skip— resolve/count figures but write no images.--no-assets— text-only conversion; noimages/directory.
Asset rendering extra uses pypdfium2 + Pillow; core install has no required deps and no AGPL PyMuPDF runtime dependency.
Full CLI reference: docs/cli.md.
Python API
from pathlib import Path
from arxiv_md import ConvertOptions, convert_path, write_result
result = convert_path(Path("paper.tex"), ConvertOptions(render_assets=False))
print(result.markdown)
write_result(result, "out/paper")
convert_path returns ConvertResult with rendered Markdown, typed document IR, warnings, stats, and output paths when files are written.
Full API reference: docs/api.md.
At a glance
Inputs:
.texfiles.- Source directories with a detectable main
.texfile. .tar,.tar.gz,.tgz,.zip, and single-file.gzsource bundles.- arXiv IDs or search queries via
arxiv-to-md.
Native handling covers common paper structure: sections, frontmatter, inline formatting, math, lists, figures, tables, bibliography, citations, cross-references, theorem-like environments, algorithm pseudocode, common glyphs, siunitx, and common macro definitions.
Unsupported inputs (.pdf, .html, .rar, .7z, lone .bib).
When to use it
Use arxiv-md when you need semantic Markdown for search, indexing, previews, datasets, or downstream text processing.
Do not use it when exact PDF layout matters. arxiv-md is not a TeX engine: it does not run pdflatex, reproduce page layout, or guarantee full macro and environment expansion.
Docs
docs/README.md— documentation index.docs/cli.md— CLI usage, flags, output layout.docs/api.md— library options, result shape, document IR.docs/supported-latex.md— inputs, non-goals, support matrix.docs/diagnostics.md— warnings, fatal errors, JSON envelopes.docs/security.md— archive hardening, isolation, resource limits.docs/performance.md— benchmarks and asset-mode tradeoffs.
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arxiv_md-0.1.0.tar.gz.
File metadata
- Download URL: arxiv_md-0.1.0.tar.gz
- Upload date:
- Size: 94.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
56c1be8474ba1d26dc9a65e473fb037f1f38f0d1b2ab320ecf93e582d2e2db1e
|
|
| MD5 |
cfde6386316607a3bd625a7d6fd42efb
|
|
| BLAKE2b-256 |
7e203d35c9c595e77cd04ea6e1f5320ca50cb5e92ca8d56ba702209083d450ca
|
File details
Details for the file arxiv_md-0.1.0-py3-none-any.whl.
File metadata
- Download URL: arxiv_md-0.1.0-py3-none-any.whl
- Upload date:
- Size: 116.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
26f49f463f0b650822cb09413d54719ef03e1afd6cf333bb5953eafc60393658
|
|
| MD5 |
ff88a02ea8c65f9129121ae3e743e079
|
|
| BLAKE2b-256 |
feab8535b786f55187b328370f1c60af196d35ffbefe9a39abb00ae5ed662bbd
|