Extract clean Markdown from academic PDFs - like drinking through a straw
Project description
Paper Siphon
Extract clean Markdown from academic PDFs - like drinking through a straw.
Academic papers come with artifacts: awkward page breaks, mangled tables, or even line numbers. Paper Siphon filters them out, leaving you with clean, readable Markdown.
paper-siphon paper.pdf
That's it. Your paper is now paper.md.
Features
- Smart whitespace - Collapses excessive blank lines, normalizes spacing
- Table preservation - Keeps your data tables intact and formatted
- Formula support - Optional enrichment for mathematical expressions
- Line number removal - Automatically strips the margin numbers (when present)
- VLM pipeline - Use vision-language models for complex layouts
- Apple Silicon acceleration - MLX support for fast processing on M-series Macs
Installation
# With uv (recommended)
uv pip install paper-siphon
# With pip
pip install paper-siphon
For Apple Silicon acceleration (optional):
uv pip install paper-siphon[mlx]
Usage
Quick start (no install)
uvx paper-siphon paper.pdf # Run directly with uvx
Basic
paper-siphon paper.pdf # Creates paper.md
paper-siphon paper.pdf -o notes.md # Custom output path
Advanced
paper-siphon --vlm paper.pdf # Use VLM for complex layouts
paper-siphon --enrich-formula paper.pdf # Enable formula enrichment
paper-siphon --no-mlx --vlm paper.pdf # VLM without MLX acceleration
paper-siphon -v paper.pdf # Verbose logging
How It Works
Paper Siphon uses Docling for PDF parsing, then applies post-processing to clean up common academic paper artifacts:
- PDF parsing - Extracts structure, text, and tables
- Line number filtering - Removes standalone 1-4 digit numbers (common in journal formats)
- Whitespace normalization - Collapses multiple blank lines
Options
| Flag | Description |
|---|---|
-o, --output |
Output file path (default: input with .md extension) |
--vlm |
Use VLM pipeline for complex layouts |
--mlx/--no-mlx |
Toggle MLX acceleration (Apple Silicon, default: on) |
--enrich-formula |
Enable formula enrichment (slow, CPU-bound) |
-v, --verbose |
Enable debug logging |
Development
# Clone and install
git clone https://github.com/mrshu/paper-siphon.git
cd paper-siphon
uv sync --dev
# Run tests
uv run pytest
# Run tests with coverage
uv run pytest --cov=paper_siphon
License
MIT
Stop wrestling with PDFs. Just siphon the good stuff.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file paper_siphon-0.1.0.tar.gz.
File metadata
- Download URL: paper_siphon-0.1.0.tar.gz
- Upload date:
- Size: 233.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
54601c0232aec4025d0cd1bdbb94ae2d79cc030aeec33f754eb00269699f6672
|
|
| MD5 |
7eab3b198f55db240b0cd221cefb56f5
|
|
| BLAKE2b-256 |
c3fc2f3fa3c6be6961905c9cdfd4d1288e329c955b3da373dda6701823e4de76
|
File details
Details for the file paper_siphon-0.1.0-py3-none-any.whl.
File metadata
- Download URL: paper_siphon-0.1.0-py3-none-any.whl
- Upload date:
- Size: 5.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e9bbd06ff77df81d9b9b2d5ac2d23c75a5c3dd38689a913356409112486f1485
|
|
| MD5 |
c34c18da5937dc9ec6a4865f99129b97
|
|
| BLAKE2b-256 |
16a6fecf8a56a8fceb83653ae594c7f5a21c017f77499a73f44c1b621bd86ff0
|