Skip to main content

Extract clean Markdown from academic PDFs - like drinking through a straw

Project description

Paper Siphon

Extract clean Markdown from academic PDFs - like drinking through a straw.

Academic papers come with artifacts: awkward page breaks, mangled tables, or even line numbers. Paper Siphon filters them out, leaving you with clean, readable Markdown.

paper-siphon paper.pdf

That's it. Your paper is now paper.md.


Features

  • Smart whitespace - Collapses excessive blank lines, normalizes spacing
  • Table preservation - Keeps your data tables intact and formatted
  • Formula support - Optional enrichment for mathematical expressions
  • Line number removal - Automatically strips the margin numbers (when present)
  • VLM pipeline - Use vision-language models for complex layouts
  • Apple Silicon acceleration - MLX support for fast processing on M-series Macs

Installation

# With uv (recommended)
uv pip install paper-siphon

# With pip
pip install paper-siphon

For Apple Silicon acceleration (optional):

uv pip install paper-siphon[mlx]

Usage

Quick start (no install)

uvx paper-siphon paper.pdf                # Run directly with uvx

Basic

paper-siphon paper.pdf                    # Creates paper.md
paper-siphon paper.pdf -o notes.md        # Custom output path

Advanced

paper-siphon --vlm paper.pdf              # Use VLM for complex layouts
paper-siphon --enrich-formula paper.pdf   # Enable formula enrichment
paper-siphon --no-mlx --vlm paper.pdf     # VLM without MLX acceleration
paper-siphon -v paper.pdf                 # Verbose logging

How It Works

Paper Siphon uses Docling for PDF parsing, then applies post-processing to clean up common academic paper artifacts:

  1. PDF parsing - Extracts structure, text, and tables
  2. Line number filtering - Removes standalone 1-4 digit numbers (common in journal formats)
  3. Whitespace normalization - Collapses multiple blank lines

Options

Flag Description
-o, --output Output file path (default: input with .md extension)
--vlm Use VLM pipeline for complex layouts
--mlx/--no-mlx Toggle MLX acceleration (Apple Silicon, default: on)
--enrich-formula Enable formula enrichment (slow, CPU-bound)
-v, --verbose Enable debug logging

Development

# Clone and install
git clone https://github.com/mrshu/paper-siphon.git
cd paper-siphon
uv sync --dev

# Run tests
uv run pytest

# Run tests with coverage
uv run pytest --cov=paper_siphon

License

MIT


Stop wrestling with PDFs. Just siphon the good stuff.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paper_siphon-0.1.0.tar.gz (233.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paper_siphon-0.1.0-py3-none-any.whl (5.0 kB view details)

Uploaded Python 3

File details

Details for the file paper_siphon-0.1.0.tar.gz.

File metadata

  • Download URL: paper_siphon-0.1.0.tar.gz
  • Upload date:
  • Size: 233.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for paper_siphon-0.1.0.tar.gz
Algorithm Hash digest
SHA256 54601c0232aec4025d0cd1bdbb94ae2d79cc030aeec33f754eb00269699f6672
MD5 7eab3b198f55db240b0cd221cefb56f5
BLAKE2b-256 c3fc2f3fa3c6be6961905c9cdfd4d1288e329c955b3da373dda6701823e4de76

See more details on using hashes here.

File details

Details for the file paper_siphon-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: paper_siphon-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 5.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for paper_siphon-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e9bbd06ff77df81d9b9b2d5ac2d23c75a5c3dd38689a913356409112486f1485
MD5 c34c18da5937dc9ec6a4865f99129b97
BLAKE2b-256 16a6fecf8a56a8fceb83653ae594c7f5a21c017f77499a73f44c1b621bd86ff0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page