Skip to main content

Convert arXiv papers to plain text using LTeXML

Project description

arxivparse

Convert arXiv papers to clean plain text using LaTeXML. Downloads LaTeX source from arXiv, converts to XML, and extracts body text (title, abstract, sections, math, captions) — no HTML intermediary, no bibliography, no footnotes.

Prerequisites

  • Python >= 3.13
  • uv (or pip)
  • LaTeXML (v0.8.x) — install via Homebrew:
brew install latexml

Verify it's on your PATH:

latexml --VERSION

Install

# From PyPI
pip install arxivparse

# Or from source
git clone <repo-url> arxivparse
cd arxivparse
uv sync

Quick Start

from arxivparse import arxiv_to_text

text = arxiv_to_text("1706.03762")
print(text[:200])

CLI Usage

# Single paper
arxivparse 1706.03762

# Multiple papers (sequential)
arxivparse 1706.03762 2301.07041

# Custom output path
arxivparse -o output.txt 1706.03762

# Custom output directory
arxivparse -d ./papers 1706.03762 2301.07041

# Verbose output (download, convert, extract steps)
arxivparse -v 1706.03762

# Keep temp files for debugging
arxivparse --keep-temp 1706.03762

Each paper produces a <arxiv_id>.txt file.

Library Usage

Simple: get text as a string

from arxivparse import arxiv_to_text

text = arxiv_to_text("1706.03762")

Full pipeline control

from arxivparse.pipeline import convert_arxiv_to_text
from arxivparse.errors import Arxiv2TextError

try:
    output_path = convert_arxiv_to_text("1706.03762")
    text = output_path.read_text(encoding="utf-8")
except Arxiv2TextError as e:
    print(f"Failed: {e}")

Call the CLI from code

from main import main

main(["1706.03762", "2301.07041"])
main(["-o", "output.txt", "1706.03762"])

Arguments

Param Type Default Description
arxiv_id str arXiv ID (e.g. "1706.03762")
output_path Path <arxiv_id>.txt Where to write the output
keep_temp bool False Keep temp files after conversion

Error Handling

from arxivparse.errors import (
    DownloadError,         # network/HTTP failure
    NoLatexSourceError,    # paper is PDF-only
    ConversionError,       # latexml failed
    MainTexNotFoundError,  # no .tex file found in bundle
)

Build / Publish

# Build a wheel
uv build

# Publish to PyPI
uv publish

# Or with twine
twine upload dist/*

Output Format

QASPER-style plain text: title, abstract, section headings, body paragraphs, inline math (LaTeX notation like h_{t}), and figure/table captions. No bibliography entries, footnotes, author affiliations, or citation numbers.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arxivparse-0.1.2.tar.gz (8.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arxivparse-0.1.2-py3-none-any.whl (11.9 kB view details)

Uploaded Python 3

File details

Details for the file arxivparse-0.1.2.tar.gz.

File metadata

  • Download URL: arxivparse-0.1.2.tar.gz
  • Upload date:
  • Size: 8.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arxivparse-0.1.2.tar.gz
Algorithm Hash digest
SHA256 b51dee1c4f2b369a91dd8176802c1d58196dc0dc881af4f73e62cbeeb430dd45
MD5 73ebef27ecfb08f0f4c99e712bfb9760
BLAKE2b-256 91fc7711b441611aa987f7fb3619d975c68e542f0e403f8d89611b915863a012

See more details on using hashes here.

Provenance

The following attestation bundles were made for arxivparse-0.1.2.tar.gz:

Publisher: python-publish.yml on flrjrf/arxivparse

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file arxivparse-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: arxivparse-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 11.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arxivparse-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 1b9ecc4578a557c0e06fbaf72ee605fbd9cd38d86384a79bb7117aa66a0fcb8e
MD5 67f3300c682848fc948db47b59cc5867
BLAKE2b-256 46b954d581bf4a6bd92a64c5b4052ecac4015590f8916a2e95832a13d4a381e3

See more details on using hashes here.

Provenance

The following attestation bundles were made for arxivparse-0.1.2-py3-none-any.whl:

Publisher: python-publish.yml on flrjrf/arxivparse

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page