Skip to main content

Extract and map content from academic papers for LLM processing

Project description

Papercutter

PyPI version Python 3.10+ License: MIT CI

Extract knowledge from academic papers. A CLI-first Python package for researchers.

Installation

pip install papercutter

With LLM features (summarization, reports, study aids):

pip install papercutter[llm]

With fast PDF processing (PyMuPDF):

pip install papercutter[fast]

All optional dependencies:

pip install papercutter[all]

Development Installation

git clone https://github.com/rawatpranjal/papercutter.git
cd papercutter
pip install -e ".[dev]"

Quick Start

Fetch Papers

Download papers from various academic sources:

# From arXiv
papercutter fetch arxiv 2301.00001

# From DOI
papercutter fetch doi 10.1257/aer.20180779

# From SSRN
papercutter fetch ssrn 4123456

# From NBER
papercutter fetch nber w29000

# From direct URL
papercutter fetch url "https://example.com/paper.pdf" --name smith_2024

Extract Text

Extract clean text from PDFs:

# Full text to stdout
papercutter extract text paper.pdf

# Save to file
papercutter extract text paper.pdf --output paper.txt

# Chunk for LLM processing
papercutter extract text paper.pdf --chunk-size 4000 --overlap 200

# Extract specific pages
papercutter extract text paper.pdf --pages "1-10,15"

Extract Tables

Extract tables from PDFs as CSV or JSON:

# All tables to stdout as JSON
papercutter extract tables paper.pdf

# Save as CSV files
papercutter extract tables paper.pdf --output ./tables/ --format csv

# Extract from specific pages
papercutter extract tables paper.pdf --pages "5-10" --format json

Extract References

Extract bibliography as BibTeX:

# BibTeX to stdout
papercutter extract refs paper.pdf

# Save to file
papercutter extract refs paper.pdf --output refs.bib

# As JSON
papercutter extract refs paper.pdf --format json

Configuration

Papercutter stores configuration in ~/.papercutter/config.yaml:

output:
  directory: ~/papers

extraction:
  backend: pdfplumber
  text:
    chunk_size: null
    chunk_overlap: 200
  tables:
    format: csv

# LLM settings (v0.2)
llm:
  default_provider: anthropic
  default_model: claude-sonnet-4-20250514

Environment variables override config:

export PAPERCUTTER_ANTHROPIC_API_KEY=sk-ant-...
export PAPERCUTTER_OPENAI_API_KEY=sk-...

Migration from Papercut

Papercutter is a direct rename of the original Papercut project. To upgrade an existing installation:

  1. Reinstall the package: pip uninstall papercut && pip install papercutter.
  2. Update scripts and shell aliases to call papercutter instead of papercut.
  3. Rename your config directory if you have custom settings: mv ~/.papercut ~/.papercutter.
  4. (Optional) Rename the cache directory to retain cached artifacts: mv ~/.cache/papercut ~/.cache/papercutter.
  5. Update any PAPERCUT_* environment variables to the new PAPERCUTTER_* prefix.

Development

Run tests:

pytest tests/

Run linting:

ruff check src/
mypy src/

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

papercutter-1.2.0.tar.gz (180.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

papercutter-1.2.0-py3-none-any.whl (216.2 kB view details)

Uploaded Python 3

File details

Details for the file papercutter-1.2.0.tar.gz.

File metadata

  • Download URL: papercutter-1.2.0.tar.gz
  • Upload date:
  • Size: 180.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.6

File hashes

Hashes for papercutter-1.2.0.tar.gz
Algorithm Hash digest
SHA256 43b083e239746ec97836c5bc9f8d6c732a198ea988884f72b111e76fd3e61ff9
MD5 9bbf6bac4cc126f0476c69b68ff27845
BLAKE2b-256 18e73d2fe8d6f8f5c8c62341d5a5ac5083276ab41337a0cccf0759e81dbdf3bb

See more details on using hashes here.

File details

Details for the file papercutter-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: papercutter-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 216.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.6

File hashes

Hashes for papercutter-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bf4c575e0c1445bfa7dcce23de311c9acd06e75ddc3c2eb8ec44ed39a2a6eeff
MD5 62dc91750e1d7d7e60879668fd6b624f
BLAKE2b-256 f6c433335165df5beccdf6b6ddc0aa753106aacacdf9f125112dbeefa1b7dc7f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page