Skip to main content

Extract clean Markdown from academic PDFs - like drinking through a straw

Project description

Paper Siphon

Extract clean Markdown from academic PDFs - like drinking through a straw.

Academic papers come with artifacts: awkward page breaks, mangled tables, or even line numbers. Paper Siphon filters them out, leaving you with clean, readable Markdown.

paper-siphon paper.pdf

That's it. Your paper is now paper.md.


Features

  • Smart whitespace - Collapses excessive blank lines, normalizes spacing
  • Table preservation - Keeps your data tables intact and formatted
  • Formula support - Optional enrichment for mathematical expressions
  • Line number removal - Automatically strips the margin numbers (when present)
  • VLM pipeline - Use vision-language models for complex layouts
  • Apple Silicon acceleration - MLX support for fast processing on M-series Macs

Installation

# With uv (recommended)
uv pip install paper-siphon

# With pip
pip install paper-siphon

For Apple Silicon acceleration (optional):

uv pip install paper-siphon[mlx]

Usage

Quick start (no install)

uvx paper-siphon paper.pdf                # Run directly with uvx

Basic

paper-siphon paper.pdf                    # Creates paper.md
paper-siphon paper.pdf -o notes.md        # Custom output path

From URL (including arXiv)

paper-siphon https://arxiv.org/pdf/1706.03762.pdf

Tip: For arXiv papers, just change /abs/ to /pdf/ in the URL:

https://arxiv.org/abs/1706.03762  →  https://arxiv.org/pdf/1706.03762.pdf

(That's "Attention Is All You Need" - the Transformer paper)

Advanced

paper-siphon --vlm paper.pdf              # Use VLM for complex layouts
paper-siphon --enrich-formula paper.pdf   # Enable formula enrichment
paper-siphon --no-mlx --vlm paper.pdf     # VLM without MLX acceleration
paper-siphon -v paper.pdf                 # Verbose logging

How It Works

Paper Siphon uses Docling for PDF parsing, then applies post-processing to clean up common academic paper artifacts:

  1. PDF parsing - Extracts structure, text, and tables
  2. Line number filtering - Removes standalone 1-4 digit numbers (common in journal formats)
  3. Whitespace normalization - Collapses multiple blank lines

Options

Flag Description
-o, --output Output file path (default: input with .md extension)
--vlm Use VLM pipeline for complex layouts
--mlx/--no-mlx Toggle MLX acceleration (Apple Silicon, default: on)
--enrich-formula Enable formula enrichment (slow, CPU-bound)
-v, --verbose Enable debug logging

Development

# Clone and install
git clone https://github.com/mrshu/paper-siphon.git
cd paper-siphon
uv sync --dev

# Run tests
uv run pytest

# Run tests with coverage
uv run pytest --cov=paper_siphon

License

MIT


Stop wrestling with PDFs. Just siphon the good stuff.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paper_siphon-0.3.0.tar.gz (222.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paper_siphon-0.3.0-py3-none-any.whl (5.9 kB view details)

Uploaded Python 3

File details

Details for the file paper_siphon-0.3.0.tar.gz.

File metadata

  • Download URL: paper_siphon-0.3.0.tar.gz
  • Upload date:
  • Size: 222.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for paper_siphon-0.3.0.tar.gz
Algorithm Hash digest
SHA256 e44cb1dfb1e8c86217825fcf852ebe7b13c8f694c125f60cd9badd4efa561709
MD5 699f2dcc698adf2888ab779774b90f56
BLAKE2b-256 1b3179e0f542f59fd02b46d5bdb1fd87f864c0a923b250422d736bf239abe238

See more details on using hashes here.

File details

Details for the file paper_siphon-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: paper_siphon-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 5.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for paper_siphon-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7064c7d8b2536295a5d324a3c9207eaa858193fa9af96452377ff2722a373444
MD5 6983738b0c0bbd990b913c165a125659
BLAKE2b-256 057cea820bed688711eb24147dd68b7cacf198037d6acea0fa576b4dca07ce5c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page