Skip to main content

Convert arXiv papers to plain text using LaTeXML

Project description

arxivparse

Convert arXiv papers to clean plain text using LaTeXML. Downloads LaTeX source from arXiv, converts to XML, and extracts body text (title, abstract, sections, math, captions) — no HTML intermediary, no bibliography, no footnotes.

Prerequisites

  • Python >= 3.13
  • uv (or pip)
  • LaTeXML (v0.8.x) — install via Homebrew:
brew install latexml

Verify it's on your PATH:

latexml --VERSION

Install

# From PyPI
pip install arxivparse

# Or from source
git clone <repo-url> arxivparse
cd arxivparse
uv sync

Quick Start

from arxivparse import arxiv_to_text

text = arxiv_to_text("1706.03762")
print(text[:200])

CLI Usage

# Single paper
arxivparse 1706.03762

# Multiple papers (sequential)
arxivparse 1706.03762 2301.07041

# Custom output path
arxivparse -o output.txt 1706.03762

# Custom output directory
arxivparse -d ./papers 1706.03762 2301.07041

# Verbose output (download, convert, extract steps)
arxivparse -v 1706.03762

# Keep temp files for debugging
arxivparse --keep-temp 1706.03762

Each paper produces a <arxiv_id>.txt file.

Library Usage

Simple: get text as a string

from arxivparse import arxiv_to_text

text = arxiv_to_text("1706.03762")

Full pipeline control

from arxivparse.pipeline import convert_arxiv_to_text
from arxivparse.errors import Arxiv2TextError

try:
    output_path = convert_arxiv_to_text("1706.03762")
    text = output_path.read_text(encoding="utf-8")
except Arxiv2TextError as e:
    print(f"Failed: {e}")

Call the CLI from code

from main import main

main(["1706.03762", "2301.07041"])
main(["-o", "output.txt", "1706.03762"])

Arguments

Param Type Default Description
arxiv_id str arXiv ID (e.g. "1706.03762")
output_path Path <arxiv_id>.txt Where to write the output
keep_temp bool False Keep temp files after conversion

Error Handling

from arxivparse.errors import (
    DownloadError,         # network/HTTP failure
    NoLatexSourceError,    # paper is PDF-only
    ConversionError,       # latexml failed
    MainTexNotFoundError,  # no .tex file found in bundle
)

Build / Publish

# Build a wheel
uv build

# Publish to PyPI
uv publish

# Or with twine
twine upload dist/*

Output Format

QASPER-style plain text: title, abstract, section headings, body paragraphs, inline math (LaTeX notation like h_{t}), and figure/table captions. No bibliography entries, footnotes, author affiliations, or citation numbers.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arxivparse-0.1.1.tar.gz (8.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arxivparse-0.1.1-py3-none-any.whl (11.8 kB view details)

Uploaded Python 3

File details

Details for the file arxivparse-0.1.1.tar.gz.

File metadata

  • Download URL: arxivparse-0.1.1.tar.gz
  • Upload date:
  • Size: 8.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arxivparse-0.1.1.tar.gz
Algorithm Hash digest
SHA256 0df10c633954207902ba733d1f049f5bba2bfd19f216cefaeae61588fb570282
MD5 b014e5c6c1fd8df8a87e451e18d60757
BLAKE2b-256 8fd5baf2693a0b66eb68f9b3a39b18f50948ff407cbb7492f6c5362dc907bfd9

See more details on using hashes here.

Provenance

The following attestation bundles were made for arxivparse-0.1.1.tar.gz:

Publisher: python-publish.yml on flrjrf/arxivparse

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file arxivparse-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: arxivparse-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 11.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arxivparse-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 22af26aafb720ef9b7520768359c7d26cd3b20309db9367fbf89436411ce94ee
MD5 1c08fe5183952654b3f783c8610f31f5
BLAKE2b-256 1a5a447f70e7c2a6128bdbb46d0f9ad8cd8a1e22fbb03d337c7cbac5b1d8ba59

See more details on using hashes here.

Provenance

The following attestation bundles were made for arxivparse-0.1.1-py3-none-any.whl:

Publisher: python-publish.yml on flrjrf/arxivparse

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page