Extract and map content from academic papers for LLM processing
Project description
Papercutter
Extract knowledge from academic papers. A CLI-first Python package for researchers.
Installation
pip install papercutter
With LLM features (summarization, reports, study aids):
pip install papercutter[llm]
With fast PDF processing (PyMuPDF):
pip install papercutter[fast]
All optional dependencies:
pip install papercutter[all]
Development Installation
git clone https://github.com/rawatpranjal/papercutter.git
cd papercutter
pip install -e ".[dev]"
Quick Start
Fetch Papers
Download papers from various academic sources:
# From arXiv
papercutter fetch arxiv 2301.00001
# From DOI
papercutter fetch doi 10.1257/aer.20180779
# From SSRN
papercutter fetch ssrn 4123456
# From NBER
papercutter fetch nber w29000
# From direct URL
papercutter fetch url "https://example.com/paper.pdf" --name smith_2024
Extract Text
Extract clean text from PDFs:
# Full text to stdout
papercutter extract text paper.pdf
# Save to file
papercutter extract text paper.pdf --output paper.txt
# Chunk for LLM processing
papercutter extract text paper.pdf --chunk-size 4000 --overlap 200
# Extract specific pages
papercutter extract text paper.pdf --pages "1-10,15"
Extract Tables
Extract tables from PDFs as CSV or JSON:
# All tables to stdout as JSON
papercutter extract tables paper.pdf
# Save as CSV files
papercutter extract tables paper.pdf --output ./tables/ --format csv
# Extract from specific pages
papercutter extract tables paper.pdf --pages "5-10" --format json
Extract References
Extract bibliography as BibTeX:
# BibTeX to stdout
papercutter extract refs paper.pdf
# Save to file
papercutter extract refs paper.pdf --output refs.bib
# As JSON
papercutter extract refs paper.pdf --format json
Configuration
Papercutter stores configuration in ~/.papercutter/config.yaml:
output:
directory: ~/papers
extraction:
backend: pdfplumber
text:
chunk_size: null
chunk_overlap: 200
tables:
format: csv
# LLM settings (v0.2)
llm:
default_provider: anthropic
default_model: claude-sonnet-4-20250514
Environment variables override config:
export PAPERCUTTER_ANTHROPIC_API_KEY=sk-ant-...
export PAPERCUTTER_OPENAI_API_KEY=sk-...
Migration from Papercut
Papercutter is a direct rename of the original Papercut project. To upgrade an existing installation:
- Reinstall the package:
pip uninstall papercut && pip install papercutter. - Update scripts and shell aliases to call
papercutterinstead ofpapercut. - Rename your config directory if you have custom settings:
mv ~/.papercut ~/.papercutter. - (Optional) Rename the cache directory to retain cached artifacts:
mv ~/.cache/papercut ~/.cache/papercutter. - Update any
PAPERCUT_*environment variables to the newPAPERCUTTER_*prefix.
Development
Run tests:
pytest tests/
Run linting:
ruff check src/
mypy src/
License
MIT License - see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file papercutter-1.2.0.tar.gz.
File metadata
- Download URL: papercutter-1.2.0.tar.gz
- Upload date:
- Size: 180.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
43b083e239746ec97836c5bc9f8d6c732a198ea988884f72b111e76fd3e61ff9
|
|
| MD5 |
9bbf6bac4cc126f0476c69b68ff27845
|
|
| BLAKE2b-256 |
18e73d2fe8d6f8f5c8c62341d5a5ac5083276ab41337a0cccf0759e81dbdf3bb
|
File details
Details for the file papercutter-1.2.0-py3-none-any.whl.
File metadata
- Download URL: papercutter-1.2.0-py3-none-any.whl
- Upload date:
- Size: 216.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf4c575e0c1445bfa7dcce23de311c9acd06e75ddc3c2eb8ec44ed39a2a6eeff
|
|
| MD5 |
62dc91750e1d7d7e60879668fd6b624f
|
|
| BLAKE2b-256 |
f6c433335165df5beccdf6b6ddc0aa753106aacacdf9f125112dbeefa1b7dc7f
|