PDF to Data Factory: Extract structured data from academic papers

These details have not been verified by PyPI

Project links

Project description

Papercutter Factory

Automated Evidence Synthesis Pipeline for Research

Papercutter Factory is a local, batch-processing pipeline designed to transform unstructured academic PDF collections into structured datasets and systematic review reports.

It addresses the specific tooling gap between reference managers (Zotero, Mendeley) and analysis software (R, Stata). Unlike generic "Chat with PDF" tools, Papercutter is architected for extraction reliability, reproducibility, and scale. It utilizes Docling to convert PDFs into structured Markdown and JSON before applying LLM-based extraction, ensuring tabular data and complex layouts are preserved.

Key Capabilities

Pipeline Architecture: A stateless, resumable workflow. Processing status is tracked per file, allowing large batches to be paused and resumed without data loss.
High-Fidelity Digitization: Utilizes IBM's Docling to convert PDFs into structured Markdown, preserving table geometry and section hierarchy better than standard text extraction.
Schema Validation: Test extraction schemas on samples with source quotes for every extracted data point to verify accuracy.
Book Summarization: Process entire books or handbooks with chapter detection, extraction, and synthesis into formatted PDF reports.

Installation

pip install papercutter[full]

System Requirements:

Python 3.10 or higher
API Access: Requires OPENAI_API_KEY environment variable (or Anthropic API key)
For PDF Reports: LaTeX installation (MacTeX, TeX Live, or MiKTeX)

Modular Installation:

pip install papercutter           # Core only
pip install papercutter[docling]  # PDF processing
pip install papercutter[llm]      # LLM extraction
pip install papercutter[report]   # PDF report generation

Workflow Overview

The system operates in four phases to ensure data integrity.

1. Ingest (Digitization)

Converts raw PDFs into structured Markdown and extracts tables.

papercutter ingest ./pdfs/

Process: Scans directory, runs Docling conversion (PDF -> Markdown + Tables)
Output: markdown/, tables/, figures/, inventory.json

2. Configure (Schema Definition)

Generates an extraction schema by analyzing paper abstracts.

papercutter configure

Process: Samples papers and uses LLM to propose extraction fields
Output: columns.yaml

Example columns.yaml:

columns:
  - key: sample_size
    description: "The total number of observations (N). Exclude year ranges."
    type: integer
  - key: estimation_method
    description: "The primary statistical strategy (e.g. DiD, RDD, OLS)."
    type: string
  - key: treatment_effect
    description: "The extracted coefficient for the main treatment."
    type: float

3. Grind (Extraction)

Executes LLM-based extraction for all papers.

papercutter grind

Process: Extracts metadata, narrative fields, and custom schema fields
Output: extractions.json

4. Report (Synthesis)

Compiles final artifacts for analysis and reading.

papercutter report

Outputs:
- matrix.csv: Flattened dataset ready for R/Stata/Pandas
- review.pdf: LaTeX document with structured summaries

Condensed Mode (for appendix tables):

papercutter report --condensed

Book Summarization Pipeline

Process entire books, textbooks, or handbooks with chapter-level analysis.

# 1. Detect chapters from PDF bookmarks
papercutter book index ./book.pdf

# 2. Extract chapter text
papercutter book extract

# 3. Summarize each chapter with LLM
papercutter book grind

# 4. Generate formatted PDF report
papercutter book report

Output: output/book_summary.pdf with one page per chapter including:

Main thesis and unique insights
Key evidence and counterexamples
Key terms and definitions
Book-level synthesis with themes and intellectual journey

Project Structure

my_project/
├── pdfs/                   # Raw PDF repository
├── markdown/               # Docling-converted Markdown
├── tables/                 # Extracted tables (JSON)
├── figures/                # Extracted figures
├── columns.yaml            # Extraction schema
├── inventory.json          # Processing status tracker
├── extractions.json        # Extracted data
├── matrix.csv              # Final dataset
└── review.pdf              # Compiled report

License

MIT License. Open for academic and commercial use.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

3.1.0

Jan 12, 2026

3.0.2

Jan 12, 2026

3.0.1

Jan 12, 2026

This version

3.0.0

Jan 12, 2026

2.0.2

Jan 9, 2026

2.0.1

Jan 9, 2026

2.0.0

Jan 9, 2026

1.2.0

Jan 9, 2026

1.1.0

Jan 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

papercutter-3.0.0.tar.gz (259.9 kB view details)

Uploaded Jan 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

papercutter-3.0.0-py3-none-any.whl (39.3 kB view details)

Uploaded Jan 12, 2026 Python 3

File details

Details for the file papercutter-3.0.0.tar.gz.

File metadata

Download URL: papercutter-3.0.0.tar.gz
Upload date: Jan 12, 2026
Size: 259.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.6

File hashes

Hashes for papercutter-3.0.0.tar.gz
Algorithm	Hash digest
SHA256	`3583053c440417853102f0e0085ecb632c59a1848cfb21b00d30d53b4827c068`
MD5	`f718f13399f7eab702f6294a7b65792d`
BLAKE2b-256	`04005fba5b462199848d1d4a1c751f97c4b25b45aad6922f6122b9aad92fa9f8`

See more details on using hashes here.

File details

Details for the file papercutter-3.0.0-py3-none-any.whl.

File metadata

Download URL: papercutter-3.0.0-py3-none-any.whl
Upload date: Jan 12, 2026
Size: 39.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.6

File hashes

Hashes for papercutter-3.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4b27b3a851afdd33e28a60de1e35c6672968649aa55ba3bff59b76eec758be63`
MD5	`f3884a6084c4c87016cef41bbf1141d3`
BLAKE2b-256	`c23cd17cf257e26593144743a1abc1ff79a7d09000de9030b6ddc8aa09f5968f`

See more details on using hashes here.

papercutter 3.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Papercutter Factory

Automated Evidence Synthesis Pipeline for Research

Key Capabilities

Installation

Workflow Overview

1. Ingest (Digitization)

2. Configure (Schema Definition)

3. Grind (Extraction)

4. Report (Synthesis)

Book Summarization Pipeline

Project Structure

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes