Extract structured data from academic papers into analysis-ready datasets
Project description
Papercutter
Extract structured data from academic papers into analysis-ready datasets.
Papercutter converts PDF collections into structured Markdown using Docling, then applies LLM-based extraction to produce CSV datasets and LaTeX reports suitable for systematic reviews and meta-analyses.
Installation
pip install papercutter[full]
Requires Python 3.10+ and an OpenAI or Anthropic API key.
Optional extras:
pip install papercutter[docling] # PDF processing only
pip install papercutter[llm] # LLM extraction only
pip install papercutter[report] # Report generation only
Usage
1. Ingest
Convert PDFs to Markdown and extract tables.
papercutter ingest ./pdfs/
Output: markdown/, tables/, figures/, inventory.json
2. Configure
Generate an extraction schema from paper abstracts.
papercutter configure
Output: columns.yaml
columns:
- key: sample_size
description: "Total observations (N)"
type: integer
- key: method
description: "Estimation strategy (DiD, RDD, OLS, etc.)"
type: string
- key: effect
description: "Main treatment coefficient"
type: float
3. Grind
Extract data from all papers.
papercutter grind
Output: extractions.json
4. Report
Generate analysis outputs.
papercutter report # matrix.csv + review.pdf
papercutter report --condensed # appendix format
Book Pipeline
Process books or handbooks with chapter-level extraction.
papercutter book index ./book.pdf # Detect chapters
papercutter book extract # Extract text
papercutter book grind # Summarize chapters
papercutter book report # Generate PDF
Output: output/book_summary.pdf
Output Files
| File | Description |
|---|---|
inventory.json |
Processing status for each PDF |
columns.yaml |
Extraction schema definition |
extractions.json |
Extracted data per paper |
matrix.csv |
Flattened dataset for R/Stata/Python |
review.pdf |
LaTeX report with structured summaries |
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file papercutter-3.0.2.tar.gz.
File metadata
- Download URL: papercutter-3.0.2.tar.gz
- Upload date:
- Size: 258.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
292f909c395047b0c26ca6cbd521dee4eaa1ec0ef57984ac5a7822bd8db43773
|
|
| MD5 |
3ee5d2d87a83780686496bb446b2d469
|
|
| BLAKE2b-256 |
91c4733caf349d5d524fd759222c6de847d68da9adf61ea61ef710e4b7359a6c
|
File details
Details for the file papercutter-3.0.2-py3-none-any.whl.
File metadata
- Download URL: papercutter-3.0.2-py3-none-any.whl
- Upload date:
- Size: 38.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bc423320dcea6efaa41a3be2773f05ee6faa71f0542663e5c4d25723cdf25d7b
|
|
| MD5 |
162fe3fefc08f193f6f386aab95e2118
|
|
| BLAKE2b-256 |
8f00c5321324b567e7caf1dec003e10a6980231f64c7840240fbd8d80a0f7aa9
|