Skip to main content

Extract clean Markdown from academic PDFs - like drinking through a straw

Project description

Paper Siphon

Extract clean Markdown from academic PDFs - like drinking through a straw.

Academic papers come with artifacts: awkward page breaks, mangled tables, or even line numbers. Paper Siphon filters them out, leaving you with clean, readable Markdown.

paper-siphon paper.pdf

That's it. Your paper is now paper.md.


Features

  • Smart whitespace - Collapses excessive blank lines, normalizes spacing
  • Table preservation - Keeps your data tables intact and formatted
  • Formula support - Optional enrichment for mathematical expressions
  • Line number removal - Automatically strips the margin numbers (when present)
  • VLM pipeline - Use vision-language models for complex layouts
  • Apple Silicon acceleration - MLX support for fast processing on M-series Macs

Installation

# With uv (recommended)
uv pip install paper-siphon

# With pip
pip install paper-siphon

For Apple Silicon acceleration (optional):

uv pip install paper-siphon[mlx]

Usage

Quick start (no install)

uvx paper-siphon paper.pdf                # Run directly with uvx

Basic

paper-siphon paper.pdf                    # Creates paper.md
paper-siphon paper.pdf -o notes.md        # Custom output path

From URL (including arXiv)

paper-siphon https://arxiv.org/pdf/1706.03762.pdf

Tip: For arXiv papers, just change /abs/ to /pdf/ in the URL:

https://arxiv.org/abs/1706.03762  →  https://arxiv.org/pdf/1706.03762.pdf

(That's "Attention Is All You Need" - the Transformer paper)

Advanced

paper-siphon --vlm paper.pdf              # Use VLM for complex layouts
paper-siphon --enrich-formula paper.pdf   # Enable formula enrichment
paper-siphon --no-mlx --vlm paper.pdf     # VLM without MLX acceleration
paper-siphon -v paper.pdf                 # Verbose logging

How It Works

Paper Siphon uses Docling for PDF parsing, then applies post-processing to clean up common academic paper artifacts:

  1. PDF parsing - Extracts structure, text, and tables
  2. Line number filtering - Removes standalone 1-4 digit numbers (common in journal formats)
  3. Whitespace normalization - Collapses multiple blank lines

Options

Flag Description
-o, --output Output file path (default: input with .md extension)
--vlm Use VLM pipeline for complex layouts
--mlx/--no-mlx Toggle MLX acceleration (Apple Silicon, default: on)
--enrich-formula Enable formula enrichment (slow, CPU-bound)
-v, --verbose Enable debug logging

Development

# Clone and install
git clone https://github.com/mrshu/paper-siphon.git
cd paper-siphon
uv sync --dev

# Run tests
uv run pytest

# Run tests with coverage
uv run pytest --cov=paper_siphon

License

MIT


Stop wrestling with PDFs. Just siphon the good stuff.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paper_siphon-0.2.0.tar.gz (222.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paper_siphon-0.2.0-py3-none-any.whl (5.9 kB view details)

Uploaded Python 3

File details

Details for the file paper_siphon-0.2.0.tar.gz.

File metadata

  • Download URL: paper_siphon-0.2.0.tar.gz
  • Upload date:
  • Size: 222.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for paper_siphon-0.2.0.tar.gz
Algorithm Hash digest
SHA256 445a0ad0dc86fb3626c52736787b609a60a9e56456e17fc8800dfa19ec725f61
MD5 4b15d6900b56f0b36ab472de5a31dc2d
BLAKE2b-256 16ab2c9e0076ec04a921ceade1c16c908cf2fc291cf98ceb2f968929761dad28

See more details on using hashes here.

File details

Details for the file paper_siphon-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: paper_siphon-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 5.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for paper_siphon-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4d91fd650f8200be62b0b6b38ac6d3564a7ef6ccf508de3a27bb431ce99c111c
MD5 9e08c04019fa77e4e19aba2cc7838548
BLAKE2b-256 a38b90924ce422991d455f4d931b777a49c08995426f80720301f5a3aaf19ec8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page