Extract and map content from academic papers for LLM processing
Project description
Papercut
Extract knowledge from academic papers. A CLI-first Python package for researchers.
Installation
pip install papercutter
With LLM features (summarization, reports, study aids):
pip install papercutter[llm]
With fast PDF processing (PyMuPDF):
pip install papercutter[fast]
All optional dependencies:
pip install papercutter[all]
Development Installation
git clone https://github.com/pranjalrawat007/papercut.git
cd papercut
pip install -e ".[dev]"
Quick Start
Fetch Papers
Download papers from various academic sources:
# From arXiv
papercut fetch arxiv 2301.00001
# From DOI
papercut fetch doi 10.1257/aer.20180779
# From SSRN
papercut fetch ssrn 4123456
# From NBER
papercut fetch nber w29000
# From direct URL
papercut fetch url "https://example.com/paper.pdf" --name smith_2024
Extract Text
Extract clean text from PDFs:
# Full text to stdout
papercut extract text paper.pdf
# Save to file
papercut extract text paper.pdf --output paper.txt
# Chunk for LLM processing
papercut extract text paper.pdf --chunk-size 4000 --overlap 200
# Extract specific pages
papercut extract text paper.pdf --pages "1-10,15"
Extract Tables
Extract tables from PDFs as CSV or JSON:
# All tables to stdout as JSON
papercut extract tables paper.pdf
# Save as CSV files
papercut extract tables paper.pdf --output ./tables/ --format csv
# Extract from specific pages
papercut extract tables paper.pdf --pages "5-10" --format json
Extract References
Extract bibliography as BibTeX:
# BibTeX to stdout
papercut extract refs paper.pdf
# Save to file
papercut extract refs paper.pdf --output refs.bib
# As JSON
papercut extract refs paper.pdf --format json
Configuration
Papercut stores configuration in ~/.papercut/config.yaml:
output:
directory: ~/papers
extraction:
backend: pdfplumber
text:
chunk_size: null
chunk_overlap: 200
tables:
format: csv
# LLM settings (v0.2)
llm:
default_provider: anthropic
default_model: claude-sonnet-4-20250514
Environment variables override config:
export PAPERCUT_ANTHROPIC_API_KEY=sk-ant-...
export PAPERCUT_OPENAI_API_KEY=sk-...
Development
Run tests:
pytest tests/
Run linting:
ruff check src/
mypy src/
License
MIT License - see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file papercutter-1.1.0.tar.gz.
File metadata
- Download URL: papercutter-1.1.0.tar.gz
- Upload date:
- Size: 99.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
64fb2ad61390bcfcdfdc30c492ab0499ba390f8ccc29b21af1885ef6dfe08d0c
|
|
| MD5 |
85e25dd1eb2945cd4b95d7bfa91c76a3
|
|
| BLAKE2b-256 |
62772eae732d36a4b25e9527854a38da47c63fcf984a145b4a81ef6e8bfb8fed
|
File details
Details for the file papercutter-1.1.0-py3-none-any.whl.
File metadata
- Download URL: papercutter-1.1.0-py3-none-any.whl
- Upload date:
- Size: 107.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
237e49bab726ed5e5ec7fd8a7e79565d327902a60a32f9b44a5b2940279be69c
|
|
| MD5 |
6c0be7f44630cf6dc0b33d0f571c7cc3
|
|
| BLAKE2b-256 |
47eee42927a361c4eb6117cb03a68df3d53528869d22b7235c915668275caac8
|