Transform academic PDFs into structured literature notes and critical-thinking canvases for Obsidian

These details have not been verified by PyPI

Project links

Project description

feature: thumbnails/external/74a4c4ea2d920c8d9a05a7420946145d.svg thumbnail: thumbnails/external/74a4c4ea2d920c8d9a05a7420946145d.svg

PhD Deep Read Workflow

Transform academic PDFs into structured literature notes and critical-thinking canvases for Obsidian using AI-assisted analysis.

🎯 What is PhD Deep Read?

PhD Deep Read is a sophisticated workflow that helps researchers and PhD students process academic literature efficiently. It transforms raw PDFs into:

Structured literature notes following a comprehensive academic template
Critical-thinking canvases with 9 interconnected nodes for deep analysis
Extracted text and images using a smart Text-First decision tree
Claude Code-assisted analysis for high-quality note generation

Perfect for literature reviews, dissertation research, and systematic knowledge building in Obsidian.

✨ Key Features

🚀 Smart PDF Extraction

Text-First Decision Tree: Pre-scans PDFs, uses fast PyMuPDF extraction for searchable text (80%+ of academic PDFs)
Intelligent OCR Fallback: Only uses Tesseract OCR for scanned/complex pages
Image Extraction: Preserves figures, tables, and diagrams as embedded images
Metadata Tracking: Records extraction method per page for transparency

📝 AI-Assisted Note Generation

Comprehensive Template: 175+ line .clauderules template with YAML frontmatter, Dataview callouts, and academic sections
Claude Code Integration: Uses Claude's reasoning for critical analysis and synthesis
Wikilink Rich: Extensive linking of concepts, methods, proteins, and diseases
Obsidian Ready: Fully compatible with Dataview plugin and Obsidian Canvas

🧠 Critical-Thinking Canvases

9 Interconnected Nodes: Core argument, assumptions, evidence assessment, alternative explanations, methodological critique, personal relevance, future directions, critical questions, hypothesis center
Visual Analysis: Spatial arrangement facilitates deep critical thinking
JSON Format: Compatible with Obsidian Canvas plugin

⚙️ Automated Workflow

Batch Processing: Process entire directories of PDFs overnight
Quality Verification: Automated checks for format consistency and quality
Modular Commands: Separate commands for each workflow stage
Configurable: Adjust thresholds, templates, and output formats

🏗️ Architecture

graph TD
    A[PDF Input] --> B{Text-First Decision Tree}
    B --> C[Searchable?]
    C -->|Yes| D[PyMuPDF<br/>Fast Text Extraction]
    C -->|No| E[Tesseract OCR<br/>Scanned Pages]

    D --> F[Markdown + Images + Metadata]
    E --> F

    F --> G[Claude Code Analysis]
    G --> H[Structured Literature Note]
    F --> I[Canvas Template]
    I --> J[Critical-Thinking Canvas]

    H --> K[Obsidian Integration]
    J --> K

    K --> L[Verified Output]

📦 Installation

Prerequisites

Python 3.10+ and pip
Tesseract OCR (optional, for scanned PDFs)
Claude Code (for note generation)

Quick Install

# Clone the repository
git clone https://github.com/Helen-insights/phd-deepread-workflow.git
cd phd-deepread-workflow

# Install Python dependencies
pip install -r requirements.txt

# Install Tesseract OCR (optional but recommended)
# macOS:
brew install tesseract
# Ubuntu/Debian:
sudo apt install tesseract-ocr

# Install Python OCR wrapper
pip install pytesseract pillow

# Verify installation
python scripts/verify.py --extract test_paper.pdf

As a Claude Code Skill

# Copy the skill to your Claude Code skills directory
cp -r phd-deepread-workflow ~/.claude/skills/phd-deepread

# Use the skill in Claude Code
phd-deepread setup
phd-deepread extract paper.pdf

Docker Installation

For consistent environments or containerized deployment:

# Build the Docker image
docker build -t phd-deepread .

# Run a single extraction
docker run -v $(pwd)/input:/input -v $(pwd)/output:/output \
  phd-deepread extract /input/paper.pdf --output /output/

# Or use Docker Compose
docker-compose up --build

See Dockerfile and docker-compose.yml for advanced configurations.

🚀 Quick Start

Process a Single Paper

# 1. Extract text and images
phd-deepread extract paper.pdf --output markdown_output/

# 2. Generate structured literature note (requires Claude Code)
phd-deepread generate markdown_output/paper/ --template templates/.clauderules

# 3. Create critical-thinking canvas
phd-deepread canvas markdown_output/paper/ --output structured_notes/

Batch Process a Directory

# Process all PDFs in a directory
phd-deepread batch papers/ --output literature-notes/

# Limit to first N pages (for testing)
phd-deepread batch papers/ --output literature-notes/ --max-pages 3

Interactive Guide

# Show workflow guide with decision-tree visualization
phd-deepread guide

📖 Detailed Usage

Stage 1: Text-First PDF Extraction

The extraction uses a smart decision tree:

phd-deepread extract paper.pdf \
  --output markdown_output/ \
  --threshold 100 \        # Min chars for "searchable"
  --percentage 0.8 \       # Use PyMuPDF if 80%+ pages searchable
  --lang eng \            # OCR language
  --disable-image-extraction  # Skip images if needed

Output Structure:

markdown_output/paper/
├── paper.md              # Raw markdown with embedded images
├── paper_meta.json       # Metadata, extraction methods per page
├── blocks.json          # Block-level segmentation data
└── _page_*_*.png        # Extracted images

Stage 2: Structured Note Generation

Uses Claude Code with the .clauderules template:

phd-deepread generate markdown_output/paper/ \
  --template templates/.clauderules \
  --output structured_notes/ \
  --skeleton              # Generate placeholder note without Claude

Template Features:

YAML frontmatter (category, tags, citekey, status, dateread)
Dataview callouts ([!Citation], [!Synthesis], [!Metadata], [!Abstract])
7 academic sections with critical analysis
Wikilinks for concepts, methods, tools

Stage 3: Critical-Thinking Canvas

Creates 9-node JSON Canvas for deep analysis:

phd-deepread canvas markdown_output/paper/ \
  --output structured_notes/ \
  --template templates/critical-thinking.canvas

Canvas Nodes:

core-argument - Primary claim and logical chain
assumptions - Explicit, implicit, questionable assumptions
evidence-assessment - Strength of evidence
alternative-explanations - Competing hypotheses
methodological-critique - Study limitations
personal-relevance - Connections to your research
future-directions - Research goals
critical-questions-enhanced - Hypothesis testing questions
hypothesis-center - Hypothesis re-evaluation

Stage 4: Verification

Quality checks and pattern matching:

phd-deepread verify --all literature-notes/

🔧 Configuration

Custom Templates

Modify templates/.clauderules for different academic fields:

category: literaturenote
tags:
  - #{{Field}}           # e.g., #Neuroscience, #Bioinformatics
  - #{{Topic1}}
  - #{{Topic2}}
citekey: {{camelCase: FirstAuthor+FirstWordOfTitle+Year}}
# ... rest of template

Extraction Parameters

Adjust in scripts/extract.py or via command line:

Parameter	Default	Description
`--threshold`	100	Min characters to consider page searchable
`--percentage`	0.8	Use PyMuPDF if this % of pages are searchable
`--lang`	eng	Tesseract OCR language code
`--force-ocr`	false	Force OCR for all pages
`--force-text`	false	Force PyMuPDF for all pages
`--no-ocr`	false	Disable OCR entirely

Output Directories

Configure in config/config.yaml:

# Directory paths
directories:
  markdown_output: "./markdown_output"
  structured_notes: "./structured_literature_notes"
  canvas_templates: "./canvas_templates"
  generation_prompts: "./generation_prompts"

# Extraction thresholds
extraction:
  searchable_threshold: 100
  searchable_percentage: 0.8
  ocr_language: "eng"

See config/config.yaml for all available options.

📊 Performance

Scenario	Time per Paper	Accuracy	Best For
All searchable text	10-30 minutes	~100%	Modern digital PDFs
Mixed (80% text, 20% OCR)	30-60 minutes	~98%	Typical academic PDFs
All OCR required	60-120 minutes	~95%	Scanned documents

Typical workflow: 27-80 minutes per paper total (extraction + note generation + canvas creation)

🛠️ Troubleshooting

Common Issues

Tesseract OCR not installed:

# macOS
brew install tesseract

# Ubuntu/Debian
sudo apt install tesseract-ocr

# Python wrapper
pip install pytesseract pillow

PyMuPDF missing:

pip install PyMuPDF

Missing images in extraction:

Check PDF contains extractable images
Ensure --disable-image-extraction not set
Verify PyMuPDF version supports image extraction

Virtual environment issues:

# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate  # Linux/macOS
# or .venv\Scripts\activate  # Windows
pip install -r requirements.txt

Claude Code integration:

Ensure Claude Code is installed and authenticated
Check skill is properly installed in ~/.claude/skills/
Verify skill has necessary permissions

Debug Mode

Run with verbose output:

phd-deepread extract paper.pdf --verbose --debug

Check logs in logs/ directory (if configured).

🔄 Integration with Other Tools

Zotero

Use Zotero citation keys as citekey in frontmatter
Export PDFs from Zotero to processing directory
Import generated notes back into Zotero as linked files

Obsidian

Notes ready for Dataview queries
Canvases work with Obsidian Canvas plugin
Wikilinks connect to existing or future notes
Use with Obsidian Git for version control

Reference Managers

BibTeX export for generated citations
RIS format integration (planned)
DOI lookup and metadata fetching (planned)

📚 Examples

See the examples/ directory for:

example-output.md - Complete structured literature note
example-canvas.canvas - 9-node critical-thinking canvas
test_paper.pdf - Sample PDF for testing

Example output from the batch test is in batch_test_output/.

🧪 Testing

Run the test suite:

# Run all tests
python -m pytest tests/ -v

# Test specific component
python scripts/verify.py --extract test_paper.pdf
python scripts/verify.py --note examples/example-output.md
python scripts/verify.py --canvas examples/example-canvas.canvas

Note: For the extraction test, use your own PDF file named test_paper.pdf in the examples directory, or replace the command with a path to your test PDF.

Continuous Integration: Tests run automatically on GitHub Actions for each commit and pull request. See .github/workflows/test.yml.

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for details.

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open a Pull Request

Development Setup

# Clone and install development dependencies
git clone https://github.com/Helen-insights/phd-deepread-workflow.git
cd phd-deepread-workflow
pip install -r requirements-dev.txt
pre-commit install  # Install git hooks

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Claude Code for AI-assisted note generation
PyMuPDF for fast PDF text extraction
Tesseract OCR for optical character recognition
Obsidian for the excellent note-taking platform
All contributors who help improve this workflow

📞 Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Email: heleninsights@gmail.com

📈 Roadmap

Web UI for configuration and monitoring
Integration with more reference managers (Mendeley, EndNote)
Advanced layout detection for complex PDFs
Multi-language support for OCR and analysis
Plugin system for custom templates and processors
Cloud processing for large PDF collections
Mobile app for on-the-go paper processing

Made with ❤️ for the academic community

If this workflow helps your research, consider giving it a ⭐ on GitHub!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

May 6, 2026

0.1.6

Mar 6, 2026

0.1.5

Mar 6, 2026

0.1.4

Mar 6, 2026

0.1.3

Mar 6, 2026

0.1.2

Mar 6, 2026

0.1.1

Mar 5, 2026

This version

0.1.0

Mar 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phd_deepread_workflow-0.1.0.tar.gz (51.4 kB view details)

Uploaded Mar 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

phd_deepread_workflow-0.1.0-py3-none-any.whl (24.9 kB view details)

Uploaded Mar 5, 2026 Python 3

File details

Details for the file phd_deepread_workflow-0.1.0.tar.gz.

File metadata

Download URL: phd_deepread_workflow-0.1.0.tar.gz
Upload date: Mar 5, 2026
Size: 51.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for phd_deepread_workflow-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`ef23b13e9091ba0cf102fb9bf728223c65686ce74dbffa0d9074d7a42d02076d`
MD5	`d6159e4d50d7d8b80fe1bdc1c6046e19`
BLAKE2b-256	`0cf070b7b14916205200019ba31c80f264900a93d21076f367a4e28965a6cb72`

See more details on using hashes here.

Provenance

The following attestation bundles were made for phd_deepread_workflow-0.1.0.tar.gz:

Publisher: publish.yml on heleninsights-dot/phd-deepread-workflow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: phd_deepread_workflow-0.1.0.tar.gz
- Subject digest: ef23b13e9091ba0cf102fb9bf728223c65686ce74dbffa0d9074d7a42d02076d
- Sigstore transparency entry: 1043988020
- Sigstore integration time: Mar 5, 2026
Source repository:
- Permalink: heleninsights-dot/phd-deepread-workflow@35efaf2eec5e8880c2814e55cd91b58d6a6f2f69
- Branch / Tag: refs/heads/main
- Owner: https://github.com/heleninsights-dot
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@35efaf2eec5e8880c2814e55cd91b58d6a6f2f69
- Trigger Event: push

File details

Details for the file phd_deepread_workflow-0.1.0-py3-none-any.whl.

File metadata

Download URL: phd_deepread_workflow-0.1.0-py3-none-any.whl
Upload date: Mar 5, 2026
Size: 24.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for phd_deepread_workflow-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e9f2cbae9ce19675421b0362b8cff9d1a3f90ef1b98fa5a1166dc24e6000f664`
MD5	`d055c3c6546ef3181c6c1ba3d091e6d7`
BLAKE2b-256	`1f28f4fb2bd4b0079e483d5be31dcaf3331bb1e792c95b204da7340857ba8cac`

See more details on using hashes here.

Provenance

The following attestation bundles were made for phd_deepread_workflow-0.1.0-py3-none-any.whl:

Publisher: publish.yml on heleninsights-dot/phd-deepread-workflow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: phd_deepread_workflow-0.1.0-py3-none-any.whl
- Subject digest: e9f2cbae9ce19675421b0362b8cff9d1a3f90ef1b98fa5a1166dc24e6000f664
- Sigstore transparency entry: 1043988078
- Sigstore integration time: Mar 5, 2026
Source repository:
- Permalink: heleninsights-dot/phd-deepread-workflow@35efaf2eec5e8880c2814e55cd91b58d6a6f2f69
- Branch / Tag: refs/heads/main
- Owner: https://github.com/heleninsights-dot
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@35efaf2eec5e8880c2814e55cd91b58d6a6f2f69
- Trigger Event: push

phd-deepread-workflow 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

feature: thumbnails/external/74a4c4ea2d920c8d9a05a7420946145d.svg thumbnail: thumbnails/external/74a4c4ea2d920c8d9a05a7420946145d.svg

PhD Deep Read Workflow

🎯 What is PhD Deep Read?

✨ Key Features

🚀 Smart PDF Extraction

📝 AI-Assisted Note Generation

🧠 Critical-Thinking Canvases

⚙️ Automated Workflow

🏗️ Architecture

📦 Installation

Prerequisites

Quick Install

As a Claude Code Skill

Docker Installation

🚀 Quick Start

Process a Single Paper

Batch Process a Directory

Interactive Guide

📖 Detailed Usage

Stage 1: Text-First PDF Extraction

Stage 2: Structured Note Generation

Stage 3: Critical-Thinking Canvas

Stage 4: Verification

🔧 Configuration

Custom Templates

Extraction Parameters

Output Directories

📊 Performance

🛠️ Troubleshooting

Common Issues

Debug Mode

🔄 Integration with Other Tools

Zotero

Obsidian

Reference Managers

📚 Examples

🧪 Testing

🤝 Contributing

Development Setup

📄 License

🙏 Acknowledgments

📞 Support

📈 Roadmap

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance