Skip to main content

Extract structured data from academic papers into analysis-ready datasets

Project description

Papercutter

Extract structured data from academic papers into analysis-ready datasets.

Papercutter converts PDF collections into structured Markdown using Docling, then applies LLM-based extraction to produce CSV datasets and LaTeX reports suitable for systematic reviews and meta-analyses.

Installation

pip install papercutter[full]

Requires Python 3.10+ and an OpenAI or Anthropic API key.

Optional extras:

pip install papercutter[docling]  # PDF processing only
pip install papercutter[llm]      # LLM extraction only
pip install papercutter[report]   # Report generation only

Usage

1. Ingest

Convert PDFs to Markdown and extract tables.

papercutter ingest ./pdfs/

Output: markdown/, tables/, figures/, inventory.json

2. Configure

Generate an extraction schema from paper abstracts.

papercutter configure

Output: columns.yaml

columns:
  - key: sample_size
    description: "Total observations (N)"
    type: integer
  - key: method
    description: "Estimation strategy (DiD, RDD, OLS, etc.)"
    type: string
  - key: effect
    description: "Main treatment coefficient"
    type: float

3. Grind

Extract data from all papers.

papercutter grind

Output: extractions.json

4. Report

Generate analysis outputs.

papercutter report            # matrix.csv + review.pdf
papercutter report --condensed  # appendix format

Book Pipeline

Process books or handbooks with chapter-level extraction.

papercutter book index ./book.pdf   # Detect chapters
papercutter book extract            # Extract text
papercutter book grind              # Summarize chapters
papercutter book report             # Generate PDF

Output: output/book_summary.pdf

Output Files

File Description
inventory.json Processing status for each PDF
columns.yaml Extraction schema definition
extractions.json Extracted data per paper
matrix.csv Flattened dataset for R/Stata/Python
review.pdf LaTeX report with structured summaries

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

papercutter-3.0.2.tar.gz (258.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

papercutter-3.0.2-py3-none-any.whl (38.2 kB view details)

Uploaded Python 3

File details

Details for the file papercutter-3.0.2.tar.gz.

File metadata

  • Download URL: papercutter-3.0.2.tar.gz
  • Upload date:
  • Size: 258.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.6

File hashes

Hashes for papercutter-3.0.2.tar.gz
Algorithm Hash digest
SHA256 292f909c395047b0c26ca6cbd521dee4eaa1ec0ef57984ac5a7822bd8db43773
MD5 3ee5d2d87a83780686496bb446b2d469
BLAKE2b-256 91c4733caf349d5d524fd759222c6de847d68da9adf61ea61ef710e4b7359a6c

See more details on using hashes here.

File details

Details for the file papercutter-3.0.2-py3-none-any.whl.

File metadata

  • Download URL: papercutter-3.0.2-py3-none-any.whl
  • Upload date:
  • Size: 38.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.6

File hashes

Hashes for papercutter-3.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 bc423320dcea6efaa41a3be2773f05ee6faa71f0542663e5c4d25723cdf25d7b
MD5 162fe3fefc08f193f6f386aab95e2118
BLAKE2b-256 8f00c5321324b567e7caf1dec003e10a6980231f64c7840240fbd8d80a0f7aa9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page