Skip to main content

Extract structured data from academic papers into analysis-ready datasets

Project description

Papercutter

Turn your PDF collection into a dataset you can actually use.

For researchers doing systematic reviews, meta-analyses, or literature surveys who have PDFs piling up but need structured data for analysis.

Requires Python 3.10+ and an OpenAI or Anthropic API key.

Installation

pip install papercutter[full]

Optional extras:

pip install papercutter[docling]  # PDF processing only
pip install papercutter[llm]      # LLM extraction only
pip install papercutter[report]   # Report generation only

Usage

1. Ingest

Convert PDFs to Markdown and extract tables.

papercutter ingest ./pdfs/

Output: markdown/, tables/, figures/, inventory.json

2. Configure

Generate an extraction schema from paper abstracts.

papercutter configure

Output: columns.yaml

columns:
  - key: sample_size
    description: "Total observations (N)"
    type: integer
  - key: method
    description: "Estimation strategy (DiD, RDD, OLS, etc.)"
    type: string
  - key: effect
    description: "Main treatment coefficient"
    type: float

3. Extract

Extract data from all papers using LLM.

papercutter extract

Output: extractions.json

4. Report

Generate analysis outputs.

papercutter report            # matrix.csv + review.pdf
papercutter report --condensed  # appendix format

Book Pipeline

Process books or handbooks with chapter-level summaries.

papercutter book index ./book.pdf   # Detect chapters
papercutter book extract            # Extract text
papercutter book summarize          # Summarize chapters
papercutter book report             # Generate PDF

Output: output/book_summary.pdf

Output Files

File Description
inventory.json Processing status for each PDF
columns.yaml Extraction schema definition
extractions.json Extracted data per paper
matrix.csv Flattened dataset for R/Stata/Python
review.pdf LaTeX report with structured summaries

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

papercutter-3.1.0.tar.gz (258.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

papercutter-3.1.0-py3-none-any.whl (38.2 kB view details)

Uploaded Python 3

File details

Details for the file papercutter-3.1.0.tar.gz.

File metadata

  • Download URL: papercutter-3.1.0.tar.gz
  • Upload date:
  • Size: 258.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.6

File hashes

Hashes for papercutter-3.1.0.tar.gz
Algorithm Hash digest
SHA256 2ee877fa9ba6143ed6ff81bd774b8998d0c10127c7527140e631a302ef1f994a
MD5 5210cf9910b12c5c45edf3540f5b4246
BLAKE2b-256 bed4144d65872f0d876bef68fc6db21133f8c7ea499476b3cd51e13fa9c9fc65

See more details on using hashes here.

File details

Details for the file papercutter-3.1.0-py3-none-any.whl.

File metadata

  • Download URL: papercutter-3.1.0-py3-none-any.whl
  • Upload date:
  • Size: 38.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.6

File hashes

Hashes for papercutter-3.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bdf0dc1d8fc17a13899d82290c674331a3e33ce2bce79a6990deca32646a97d8
MD5 397360bce9099664a7eb952d1575f48a
BLAKE2b-256 152853910dcf65426a9b2e9606d32a2fcc3bfac132cabb214c7e0df0ffc63390

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page