Skip to main content

Convert arXiv TeX sources to Markdown

Project description

arxiv-md

Good enough™ LaTeX to Markdown converter. Optimized on hundreds of arXiv papers from 2016 to 2026.

arxiv-md is a Python package and CLI for turning arXiv e-print bundles, source directories, or single .tex files into:

- document.md       # rendered Markdown
- conversion.json   # warnings, stats, options, paths
- images/           # optional rendered/copied figures

Unsupported TeX is preserved as raw LaTeX when possible and reported through typed diagnostics. Output locations are explicit: callers choose the exact output directory, and conversion never writes into the source tree.

Install

# Core library and CLIs
uv add arxiv-md

# Add optional pypdfium2/Pillow-backed PDF/JPEG → PNG rendering
uv add 'arxiv-md[assets]'

# CLI-only install
uv tool install 'arxiv-md[assets]'

Supports Python 3.10–3.13. Core library has no required runtime dependencies.

Quickstart

# Single .tex file
tex-to-md paper.tex --outdir out/paper

# Source archive (.tar, .tar.gz, .tgz, .zip, .gz)
tex-to-md paper.tar.gz --outdir out/paper --json

# Download arXiv source by id and convert
arxiv-to-md 1706.03762 --outdir out

--outdir is mandatory. tex-to-md writes one document directly into that folder. arxiv-to-md writes one subdirectory per paper.

Common asset modes:

  • Default rasterize — PDF→PNG via pypdfium2, JPEG→PNG via Pillow; best compatibility, highest CPU/native-code exposure.
  • --asset-mode copy — copy PDF/JPEG verbatim; faster, no optional asset deps needed.
  • --asset-mode skip — resolve/count figures but write no images.
  • --no-assets — text-only conversion; no images/ directory.

Asset rendering extra uses pypdfium2 + Pillow; core install has no required deps and no AGPL PyMuPDF runtime dependency.

Full CLI reference: docs/cli.md.

Python API

from pathlib import Path

from arxiv_md import ConvertOptions, convert_path, write_result

result = convert_path(Path("paper.tex"), ConvertOptions(render_assets=False))
print(result.markdown)
write_result(result, "out/paper")

convert_path returns ConvertResult with rendered Markdown, typed document IR, warnings, stats, and output paths when files are written.

Full API reference: docs/api.md.

At a glance

Inputs:

  • .tex files.
  • Source directories with a detectable main .tex file.
  • .tar, .tar.gz, .tgz, .zip, and single-file .gz source bundles.
  • arXiv IDs or search queries via arxiv-to-md.

Native handling covers common paper structure: sections, frontmatter, inline formatting, math, lists, figures, tables, bibliography, citations, cross-references, theorem-like environments, algorithm pseudocode, common glyphs, siunitx, and common macro definitions.

Unsupported inputs (.pdf, .html, .rar, .7z, lone .bib).

When to use it

Use arxiv-md when you need semantic Markdown for search, indexing, previews, datasets, or downstream text processing.

Do not use it when exact PDF layout matters. arxiv-md is not a TeX engine: it does not run pdflatex, reproduce page layout, or guarantee full macro and environment expansion.

Docs

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arxiv_md-0.1.0.tar.gz (94.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arxiv_md-0.1.0-py3-none-any.whl (116.2 kB view details)

Uploaded Python 3

File details

Details for the file arxiv_md-0.1.0.tar.gz.

File metadata

  • Download URL: arxiv_md-0.1.0.tar.gz
  • Upload date:
  • Size: 94.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for arxiv_md-0.1.0.tar.gz
Algorithm Hash digest
SHA256 56c1be8474ba1d26dc9a65e473fb037f1f38f0d1b2ab320ecf93e582d2e2db1e
MD5 cfde6386316607a3bd625a7d6fd42efb
BLAKE2b-256 7e203d35c9c595e77cd04ea6e1f5320ca50cb5e92ca8d56ba702209083d450ca

See more details on using hashes here.

File details

Details for the file arxiv_md-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: arxiv_md-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 116.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for arxiv_md-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 26f49f463f0b650822cb09413d54719ef03e1afd6cf333bb5953eafc60393658
MD5 ff88a02ea8c65f9129121ae3e743e079
BLAKE2b-256 feab8535b786f55187b328370f1c60af196d35ffbefe9a39abb00ae5ed662bbd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page