Skip to main content

Convert arXiv papers to plain text using LaTeXML

Project description

arxivparser

Convert arXiv papers to clean plain text using LaTeXML. Downloads LaTeX source from arXiv, converts to XML, and extracts body text (title, abstract, sections, math, captions) — no HTML intermediary, no bibliography, no footnotes.

Prerequisites

  • Python >= 3.13
  • uv (or pip)
  • LaTeXML (v0.8.x) — install via Homebrew:
brew install latexml

Verify it's on your PATH:

latexml --VERSION

Install

# From PyPI
pip install arxivparser

# Or from source
git clone <repo-url> arxivparser
cd arxivparser
uv sync

Quick Start

from arxivparser import arxiv_to_text

text = arxiv_to_text("1706.03762")
print(text[:200])

CLI Usage

# Single paper
arxivparser 1706.03762

# Multiple papers (sequential)
arxivparser 1706.03762 2301.07041

# Custom output path
arxivparser -o output.txt 1706.03762

# Custom output directory
arxivparser -d ./papers 1706.03762 2301.07041

# Verbose output (download, convert, extract steps)
arxivparser -v 1706.03762

# Keep temp files for debugging
arxivparser --keep-temp 1706.03762

Each paper produces a <arxiv_id>.txt file.

Library Usage

Simple: get text as a string

from arxivparser import arxiv_to_text

text = arxiv_to_text("1706.03762")

Full pipeline control

from arxivparser.pipeline import convert_arxiv_to_text
from arxivparser.errors import Arxiv2TextError

try:
    output_path = convert_arxiv_to_text("1706.03762")
    text = output_path.read_text(encoding="utf-8")
except Arxiv2TextError as e:
    print(f"Failed: {e}")

Call the CLI from code

from main import main

main(["1706.03762", "2301.07041"])
main(["-o", "output.txt", "1706.03762"])

Arguments

Param Type Default Description
arxiv_id str arXiv ID (e.g. "1706.03762")
output_path Path <arxiv_id>.txt Where to write the output
keep_temp bool False Keep temp files after conversion

Error Handling

from arxivparser.errors import (
    DownloadError,         # network/HTTP failure
    NoLatexSourceError,    # paper is PDF-only
    ConversionError,       # latexml failed
    MainTexNotFoundError,  # no .tex file found in bundle
)

Build / Publish

# Build a wheel
uv build

# Publish to PyPI
uv publish

# Or with twine
twine upload dist/*

Output Format

QASPER-style plain text: title, abstract, section headings, body paragraphs, inline math (LaTeX notation like h_{t}), and figure/table captions. No bibliography entries, footnotes, author affiliations, or citation numbers.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arxivparse-0.1.0.tar.gz (8.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arxivparse-0.1.0-py3-none-any.whl (11.8 kB view details)

Uploaded Python 3

File details

Details for the file arxivparse-0.1.0.tar.gz.

File metadata

  • Download URL: arxivparse-0.1.0.tar.gz
  • Upload date:
  • Size: 8.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arxivparse-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e06d8daa0383969c30049b497f45acc951142e25b28510f41db41999316d7e50
MD5 322cf682c2e978869bf8185882b0dff4
BLAKE2b-256 294f2c43a2dbc608b4cd837a3c9db8e2b5c24f4eea986e72bcb46336b4c412b6

See more details on using hashes here.

Provenance

The following attestation bundles were made for arxivparse-0.1.0.tar.gz:

Publisher: python-publish.yml on flrjrf/arxivparse

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file arxivparse-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: arxivparse-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arxivparse-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7bd3e2d9306725ed616bc83c42d3f8bc4d762be0f6e644e29d146fdec748101b
MD5 86b173729586a1dbd036ba9a0fc4c3b5
BLAKE2b-256 ba42f14247883ff7c4c88e0a170954245cf3349556e7000030993167530e82bc

See more details on using hashes here.

Provenance

The following attestation bundles were made for arxivparse-0.1.0-py3-none-any.whl:

Publisher: python-publish.yml on flrjrf/arxivparse

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page