Skip to main content

Extract structured data from academic papers into analysis-ready datasets

Project description

Papercutter

Extract structured data from academic papers into analysis-ready datasets.

Papercutter converts PDF collections into structured Markdown using Docling, then applies LLM-based extraction to produce CSV datasets and LaTeX reports suitable for systematic reviews and meta-analyses.

Installation

pip install papercutter[full]

Requires Python 3.10+ and an OpenAI or Anthropic API key.

Optional extras:

pip install papercutter[docling]  # PDF processing only
pip install papercutter[llm]      # LLM extraction only
pip install papercutter[report]   # Report generation only

Usage

1. Ingest

Convert PDFs to Markdown and extract tables.

papercutter ingest ./pdfs/

Output: markdown/, tables/, figures/, inventory.json

2. Configure

Generate an extraction schema from paper abstracts.

papercutter configure

Output: columns.yaml

columns:
  - key: sample_size
    description: "Total observations (N)"
    type: integer
  - key: method
    description: "Estimation strategy (DiD, RDD, OLS, etc.)"
    type: string
  - key: effect
    description: "Main treatment coefficient"
    type: float

3. Grind

Extract data from all papers.

papercutter grind

Output: extractions.json

4. Report

Generate analysis outputs.

papercutter report            # matrix.csv + review.pdf
papercutter report --condensed  # appendix format

Book Pipeline

Process books or handbooks with chapter-level extraction.

papercutter book index ./book.pdf   # Detect chapters
papercutter book extract            # Extract text
papercutter book grind              # Summarize chapters
papercutter book report             # Generate PDF

Output: output/book_summary.pdf

Output Files

File Description
inventory.json Processing status for each PDF
columns.yaml Extraction schema definition
extractions.json Extracted data per paper
matrix.csv Flattened dataset for R/Stata/Python
review.pdf LaTeX report with structured summaries

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

papercutter-3.0.1.tar.gz (258.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

papercutter-3.0.1-py3-none-any.whl (38.2 kB view details)

Uploaded Python 3

File details

Details for the file papercutter-3.0.1.tar.gz.

File metadata

  • Download URL: papercutter-3.0.1.tar.gz
  • Upload date:
  • Size: 258.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.6

File hashes

Hashes for papercutter-3.0.1.tar.gz
Algorithm Hash digest
SHA256 4c8263044dd5f75a844adc61d11fcbfd7ae4b6e433fc8a24bc1101a650ecdddd
MD5 3fd28eac4ad053cbac1e7b6ac68cbe4d
BLAKE2b-256 45c097c0fdeb0545364d2df9c008777c9a0dadd676b1472d40f7a291c6242759

See more details on using hashes here.

File details

Details for the file papercutter-3.0.1-py3-none-any.whl.

File metadata

  • Download URL: papercutter-3.0.1-py3-none-any.whl
  • Upload date:
  • Size: 38.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.6

File hashes

Hashes for papercutter-3.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d438da8d9f9a2d0fe2a9ea4aac32dbcc6e3f3da339c5c9738b8c4013556cd2f4
MD5 38f24c092025c17e40688aeea8751dd7
BLAKE2b-256 fc4c140ef0f1d4a4f4bea81a687a7c9fda6b15affe9c00cfdf916992c877eabd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page