PDF to Data Factory: Extract structured data from academic papers
Project description
Papercutter Factory
Automated Evidence Synthesis Pipeline for Research
Papercutter Factory is a local, batch-processing pipeline designed to transform unstructured academic PDF collections into structured datasets and systematic review reports.
It addresses the specific tooling gap between reference managers (Zotero, Mendeley) and analysis software (R, Stata). Unlike generic "Chat with PDF" tools, Papercutter is architected for extraction reliability, reproducibility, and scale. It utilizes Docling to convert PDFs into structured Markdown and JSON before applying LLM-based extraction, ensuring tabular data and complex layouts are preserved.
Key Capabilities
- Pipeline Architecture: A stateless, resumable workflow. Processing status is tracked per file, allowing large batches to be paused and resumed without data loss.
- High-Fidelity Digitization: Utilizes IBM's Docling to convert PDFs into structured Markdown, preserving table geometry and section hierarchy better than standard text extraction.
- Schema Validation: Test extraction schemas on samples with source quotes for every extracted data point to verify accuracy.
- Book Summarization: Process entire books or handbooks with chapter detection, extraction, and synthesis into formatted PDF reports.
Installation
pip install papercutter[full]
System Requirements:
- Python 3.10 or higher
- API Access: Requires
OPENAI_API_KEYenvironment variable (or Anthropic API key) - For PDF Reports: LaTeX installation (MacTeX, TeX Live, or MiKTeX)
Modular Installation:
pip install papercutter # Core only
pip install papercutter[docling] # PDF processing
pip install papercutter[llm] # LLM extraction
pip install papercutter[report] # PDF report generation
Workflow Overview
The system operates in four phases to ensure data integrity.
1. Ingest (Digitization)
Converts raw PDFs into structured Markdown and extracts tables.
papercutter ingest ./pdfs/
- Process: Scans directory, runs Docling conversion (PDF -> Markdown + Tables)
- Output:
markdown/,tables/,figures/,inventory.json
2. Configure (Schema Definition)
Generates an extraction schema by analyzing paper abstracts.
papercutter configure
- Process: Samples papers and uses LLM to propose extraction fields
- Output:
columns.yaml
Example columns.yaml:
columns:
- key: sample_size
description: "The total number of observations (N). Exclude year ranges."
type: integer
- key: estimation_method
description: "The primary statistical strategy (e.g. DiD, RDD, OLS)."
type: string
- key: treatment_effect
description: "The extracted coefficient for the main treatment."
type: float
3. Grind (Extraction)
Executes LLM-based extraction for all papers.
papercutter grind
- Process: Extracts metadata, narrative fields, and custom schema fields
- Output:
extractions.json
4. Report (Synthesis)
Compiles final artifacts for analysis and reading.
papercutter report
- Outputs:
matrix.csv: Flattened dataset ready for R/Stata/Pandasreview.pdf: LaTeX document with structured summaries
Condensed Mode (for appendix tables):
papercutter report --condensed
Book Summarization Pipeline
Process entire books, textbooks, or handbooks with chapter-level analysis.
# 1. Detect chapters from PDF bookmarks
papercutter book index ./book.pdf
# 2. Extract chapter text
papercutter book extract
# 3. Summarize each chapter with LLM
papercutter book grind
# 4. Generate formatted PDF report
papercutter book report
Output: output/book_summary.pdf with one page per chapter including:
- Main thesis and unique insights
- Key evidence and counterexamples
- Key terms and definitions
- Book-level synthesis with themes and intellectual journey
Project Structure
my_project/
├── pdfs/ # Raw PDF repository
├── markdown/ # Docling-converted Markdown
├── tables/ # Extracted tables (JSON)
├── figures/ # Extracted figures
├── columns.yaml # Extraction schema
├── inventory.json # Processing status tracker
├── extractions.json # Extracted data
├── matrix.csv # Final dataset
└── review.pdf # Compiled report
License
MIT License. Open for academic and commercial use.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file papercutter-3.0.0.tar.gz.
File metadata
- Download URL: papercutter-3.0.0.tar.gz
- Upload date:
- Size: 259.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3583053c440417853102f0e0085ecb632c59a1848cfb21b00d30d53b4827c068
|
|
| MD5 |
f718f13399f7eab702f6294a7b65792d
|
|
| BLAKE2b-256 |
04005fba5b462199848d1d4a1c751f97c4b25b45aad6922f6122b9aad92fa9f8
|
File details
Details for the file papercutter-3.0.0-py3-none-any.whl.
File metadata
- Download URL: papercutter-3.0.0-py3-none-any.whl
- Upload date:
- Size: 39.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4b27b3a851afdd33e28a60de1e35c6672968649aa55ba3bff59b76eec758be63
|
|
| MD5 |
f3884a6084c4c87016cef41bbf1141d3
|
|
| BLAKE2b-256 |
c23cd17cf257e26593144743a1abc1ff79a7d09000de9030b6ddc8aa09f5968f
|