Extract structured data from academic papers into analysis-ready datasets
Project description
Papercutter
Turn your PDF collection into a dataset you can actually use.
For researchers doing systematic reviews, meta-analyses, or literature surveys who have PDFs piling up but need structured data for analysis.
Requires Python 3.10+ and an OpenAI or Anthropic API key.
Installation
pip install papercutter[full]
Optional extras:
pip install papercutter[docling] # PDF processing only
pip install papercutter[llm] # LLM extraction only
pip install papercutter[report] # Report generation only
Usage
1. Ingest
Convert PDFs to Markdown and extract tables.
papercutter ingest ./pdfs/
Output: markdown/, tables/, figures/, inventory.json
2. Configure
Generate an extraction schema from paper abstracts.
papercutter configure
Output: columns.yaml
columns:
- key: sample_size
description: "Total observations (N)"
type: integer
- key: method
description: "Estimation strategy (DiD, RDD, OLS, etc.)"
type: string
- key: effect
description: "Main treatment coefficient"
type: float
3. Extract
Extract data from all papers using LLM.
papercutter extract
Output: extractions.json
4. Report
Generate analysis outputs.
papercutter report # matrix.csv + review.pdf
papercutter report --condensed # appendix format
Book Pipeline
Process books or handbooks with chapter-level summaries.
papercutter book index ./book.pdf # Detect chapters
papercutter book extract # Extract text
papercutter book summarize # Summarize chapters
papercutter book report # Generate PDF
Output: output/book_summary.pdf
Output Files
| File | Description |
|---|---|
inventory.json |
Processing status for each PDF |
columns.yaml |
Extraction schema definition |
extractions.json |
Extracted data per paper |
matrix.csv |
Flattened dataset for R/Stata/Python |
review.pdf |
LaTeX report with structured summaries |
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file papercutter-3.1.0.tar.gz.
File metadata
- Download URL: papercutter-3.1.0.tar.gz
- Upload date:
- Size: 258.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2ee877fa9ba6143ed6ff81bd774b8998d0c10127c7527140e631a302ef1f994a
|
|
| MD5 |
5210cf9910b12c5c45edf3540f5b4246
|
|
| BLAKE2b-256 |
bed4144d65872f0d876bef68fc6db21133f8c7ea499476b3cd51e13fa9c9fc65
|
File details
Details for the file papercutter-3.1.0-py3-none-any.whl.
File metadata
- Download URL: papercutter-3.1.0-py3-none-any.whl
- Upload date:
- Size: 38.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bdf0dc1d8fc17a13899d82290c674331a3e33ce2bce79a6990deca32646a97d8
|
|
| MD5 |
397360bce9099664a7eb952d1575f48a
|
|
| BLAKE2b-256 |
152853910dcf65426a9b2e9606d32a2fcc3bfac132cabb214c7e0df0ffc63390
|