Skip to main content

Extract and map content from academic papers for LLM processing

Project description

Papercut

PyPI version Python 3.10+ License: MIT CI

Extract knowledge from academic papers. A CLI-first Python package for researchers.

Installation

pip install papercutter

With LLM features (summarization, reports, study aids):

pip install papercutter[llm]

With fast PDF processing (PyMuPDF):

pip install papercutter[fast]

All optional dependencies:

pip install papercutter[all]

Development Installation

git clone https://github.com/pranjalrawat007/papercut.git
cd papercut
pip install -e ".[dev]"

Quick Start

Fetch Papers

Download papers from various academic sources:

# From arXiv
papercut fetch arxiv 2301.00001

# From DOI
papercut fetch doi 10.1257/aer.20180779

# From SSRN
papercut fetch ssrn 4123456

# From NBER
papercut fetch nber w29000

# From direct URL
papercut fetch url "https://example.com/paper.pdf" --name smith_2024

Extract Text

Extract clean text from PDFs:

# Full text to stdout
papercut extract text paper.pdf

# Save to file
papercut extract text paper.pdf --output paper.txt

# Chunk for LLM processing
papercut extract text paper.pdf --chunk-size 4000 --overlap 200

# Extract specific pages
papercut extract text paper.pdf --pages "1-10,15"

Extract Tables

Extract tables from PDFs as CSV or JSON:

# All tables to stdout as JSON
papercut extract tables paper.pdf

# Save as CSV files
papercut extract tables paper.pdf --output ./tables/ --format csv

# Extract from specific pages
papercut extract tables paper.pdf --pages "5-10" --format json

Extract References

Extract bibliography as BibTeX:

# BibTeX to stdout
papercut extract refs paper.pdf

# Save to file
papercut extract refs paper.pdf --output refs.bib

# As JSON
papercut extract refs paper.pdf --format json

Configuration

Papercut stores configuration in ~/.papercut/config.yaml:

output:
  directory: ~/papers

extraction:
  backend: pdfplumber
  text:
    chunk_size: null
    chunk_overlap: 200
  tables:
    format: csv

# LLM settings (v0.2)
llm:
  default_provider: anthropic
  default_model: claude-sonnet-4-20250514

Environment variables override config:

export PAPERCUT_ANTHROPIC_API_KEY=sk-ant-...
export PAPERCUT_OPENAI_API_KEY=sk-...

Development

Run tests:

pytest tests/

Run linting:

ruff check src/
mypy src/

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

papercutter-1.1.0.tar.gz (99.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

papercutter-1.1.0-py3-none-any.whl (107.8 kB view details)

Uploaded Python 3

File details

Details for the file papercutter-1.1.0.tar.gz.

File metadata

  • Download URL: papercutter-1.1.0.tar.gz
  • Upload date:
  • Size: 99.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.6

File hashes

Hashes for papercutter-1.1.0.tar.gz
Algorithm Hash digest
SHA256 64fb2ad61390bcfcdfdc30c492ab0499ba390f8ccc29b21af1885ef6dfe08d0c
MD5 85e25dd1eb2945cd4b95d7bfa91c76a3
BLAKE2b-256 62772eae732d36a4b25e9527854a38da47c63fcf984a145b4a81ef6e8bfb8fed

See more details on using hashes here.

File details

Details for the file papercutter-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: papercutter-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 107.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.6

File hashes

Hashes for papercutter-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 237e49bab726ed5e5ec7fd8a7e79565d327902a60a32f9b44a5b2940279be69c
MD5 6c0be7f44630cf6dc0b33d0f571c7cc3
BLAKE2b-256 47eee42927a361c4eb6117cb03a68df3d53528869d22b7235c915668275caac8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page