Skip to main content

Convert arXiv papers to plain text using LTeXML

Project description

arxivparse

Convert arXiv papers to clean plain text using LaTeXML. Downloads LaTeX source from arXiv, converts to XML, and extracts body text (title, abstract, sections, math, captions) — no HTML intermediary, no bibliography, no footnotes.

Prerequisites

  • Python >= 3.13
  • uv (or pip)
  • LaTeXML (v0.8.x) — install via Homebrew:
brew install latexml

Verify it's on your PATH:

latexml --VERSION

Install

# From PyPI
pip install arxivparse

# Or from source
git clone <repo-url> arxivparse
cd arxivparse
uv sync

Quick Start

from arxivparse import arxiv_to_text

text = arxiv_to_text("1706.03762")
print(text[:200])

CLI Usage

# Single paper
arxivparse 1706.03762

# Multiple papers (sequential)
arxivparse 1706.03762 2301.07041

# Custom output path
arxivparse -o output.txt 1706.03762

# Custom output directory
arxivparse -d ./papers 1706.03762 2301.07041

# Verbose output (download, convert, extract steps)
arxivparse -v 1706.03762

# Keep temp files for debugging
arxivparse --keep-temp 1706.03762

Each paper produces a <arxiv_id>.txt file.

Library Usage

Simple: get text as a string

from arxivparse import arxiv_to_text

text = arxiv_to_text("1706.03762")

Full pipeline control

from arxivparse.pipeline import convert_arxiv_to_text
from arxivparse.errors import Arxiv2TextError

try:
    output_path = convert_arxiv_to_text("1706.03762")
    text = output_path.read_text(encoding="utf-8")
except Arxiv2TextError as e:
    print(f"Failed: {e}")

Call the CLI from code

from main import main

main(["1706.03762", "2301.07041"])
main(["-o", "output.txt", "1706.03762"])

Arguments

Param Type Default Description
arxiv_id str arXiv ID (e.g. "1706.03762")
output_path Path <arxiv_id>.txt Where to write the output
keep_temp bool False Keep temp files after conversion

Error Handling

from arxivparse.errors import (
    DownloadError,         # network/HTTP failure
    NoLatexSourceError,    # paper is PDF-only
    ConversionError,       # latexml failed
    MainTexNotFoundError,  # no .tex file found in bundle
)

Build / Publish

# Build a wheel
uv build

# Publish to PyPI
uv publish

# Or with twine
twine upload dist/*

Output Format

QASPER-style plain text: title, abstract, section headings, body paragraphs, inline math (LaTeX notation like h_{t}), and figure/table captions. No bibliography entries, footnotes, author affiliations, or citation numbers.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arxivparse-0.1.3.tar.gz (8.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arxivparse-0.1.3-py3-none-any.whl (12.2 kB view details)

Uploaded Python 3

File details

Details for the file arxivparse-0.1.3.tar.gz.

File metadata

  • Download URL: arxivparse-0.1.3.tar.gz
  • Upload date:
  • Size: 8.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arxivparse-0.1.3.tar.gz
Algorithm Hash digest
SHA256 f5e0d9d60660ea9fdb62dac80c5a837706538ab5081428a261efcc3176a923c9
MD5 8792b7333dc10fdba7065b990987454b
BLAKE2b-256 eb89989370bc7f2e8d9b07d932b761ce7fbf252c57fe4ca329ab3c39ca56549d

See more details on using hashes here.

Provenance

The following attestation bundles were made for arxivparse-0.1.3.tar.gz:

Publisher: python-publish.yml on flrjrf/arxivparse

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file arxivparse-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: arxivparse-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 12.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arxivparse-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 0bbd077e4f1db2932a0e54271793c77910cff1ad7d4744370ee8cbf006c55c69
MD5 cad1577164d37cf9b667244821af3521
BLAKE2b-256 4a95bce4ae630485e17f7678ba2f20cb95313d14d8fb32fca2f92b74e5cd18ce

See more details on using hashes here.

Provenance

The following attestation bundles were made for arxivparse-0.1.3-py3-none-any.whl:

Publisher: python-publish.yml on flrjrf/arxivparse

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page