Transform academic PDFs into structured literature notes and critical-thinking canvases for Obsidian
Project description
feature: thumbnails/external/74a4c4ea2d920c8d9a05a7420946145d.svg thumbnail: thumbnails/external/74a4c4ea2d920c8d9a05a7420946145d.svg
PhD Deep Read Workflow
Transform academic PDFs into structured literature notes and critical-thinking canvases for Obsidian using AI-assisted analysis.
🎯 What is PhD Deep Read?
PhD Deep Read is a sophisticated workflow that helps researchers and PhD students process academic literature efficiently. It transforms raw PDFs into:
- Structured literature notes following a comprehensive academic template
- Critical-thinking canvases with 9 interconnected nodes for deep analysis
- Extracted text and images using a smart Text-First decision tree
- Claude Code-assisted analysis for high-quality note generation
Perfect for literature reviews, dissertation research, and systematic knowledge building in Obsidian.
✨ Key Features
🚀 Smart PDF Extraction
- Text-First Decision Tree: Pre-scans PDFs, uses fast PyMuPDF extraction for searchable text (80%+ of academic PDFs)
- Intelligent OCR Fallback: Only uses Tesseract OCR for scanned/complex pages
- Image Extraction: Preserves figures, tables, and diagrams as embedded images
- Metadata Tracking: Records extraction method per page for transparency
📝 AI-Assisted Note Generation
- Comprehensive Template: 175+ line
.clauderulestemplate with YAML frontmatter, Dataview callouts, and academic sections - Claude Code Integration: Uses Claude's reasoning for critical analysis and synthesis
- Wikilink Rich: Extensive linking of concepts, methods, proteins, and diseases
- Obsidian Ready: Fully compatible with Dataview plugin and Obsidian Canvas
🧠 Critical-Thinking Canvases
- 9 Interconnected Nodes: Core argument, assumptions, evidence assessment, alternative explanations, methodological critique, personal relevance, future directions, critical questions, hypothesis center
- Visual Analysis: Spatial arrangement facilitates deep critical thinking
- JSON Format: Compatible with Obsidian Canvas plugin
⚙️ Automated Workflow
- Batch Processing: Process entire directories of PDFs overnight
- Quality Verification: Automated checks for format consistency and quality
- Modular Commands: Separate commands for each workflow stage
- Configurable: Adjust thresholds, templates, and output formats
🏗️ Architecture
graph TD
A[PDF Input] --> B{Text-First Decision Tree}
B --> C[Searchable?]
C -->|Yes| D[PyMuPDF<br/>Fast Text Extraction]
C -->|No| E[Tesseract OCR<br/>Scanned Pages]
D --> F[Markdown + Images + Metadata]
E --> F
F --> G[Claude Code Analysis]
G --> H[Structured Literature Note]
F --> I[Canvas Template]
I --> J[Critical-Thinking Canvas]
H --> K[Obsidian Integration]
J --> K
K --> L[Verified Output]
📦 Installation
Prerequisites
- Python 3.10+ and
pip - Tesseract OCR (optional, for scanned PDFs)
- Claude Code (for note generation)
Quick Install
# Clone the repository
git clone https://github.com/Helen-insights/phd-deepread-workflow.git
cd phd-deepread-workflow
# Install Python dependencies
pip install -r requirements.txt
# Install Tesseract OCR (optional but recommended)
# macOS:
brew install tesseract
# Ubuntu/Debian:
sudo apt install tesseract-ocr
# Install Python OCR wrapper
pip install pytesseract pillow
# Verify installation
python scripts/verify.py --extract test_paper.pdf
As a Claude Code Skill
# Copy the skill to your Claude Code skills directory
cp -r phd-deepread-workflow ~/.claude/skills/phd-deepread
# Use the skill in Claude Code
phd-deepread setup
phd-deepread extract paper.pdf
Docker Installation
For consistent environments or containerized deployment:
# Build the Docker image
docker build -t phd-deepread .
# Run a single extraction
docker run -v $(pwd)/input:/input -v $(pwd)/output:/output \
phd-deepread extract /input/paper.pdf --output /output/
# Or use Docker Compose
docker-compose up --build
See Dockerfile and docker-compose.yml for advanced configurations.
🚀 Quick Start
Process a Single Paper
# 1. Extract text and images
phd-deepread extract paper.pdf --output markdown_output/
# 2. Generate structured literature note (requires Claude Code)
phd-deepread generate markdown_output/paper/ --template templates/.clauderules
# 3. Create critical-thinking canvas
phd-deepread canvas markdown_output/paper/ --output structured_notes/
Batch Process a Directory
# Process all PDFs in a directory
phd-deepread batch papers/ --output literature-notes/
# Limit to first N pages (for testing)
phd-deepread batch papers/ --output literature-notes/ --max-pages 3
Interactive Guide
# Show workflow guide with decision-tree visualization
phd-deepread guide
📖 Detailed Usage
Stage 1: Text-First PDF Extraction
The extraction uses a smart decision tree:
phd-deepread extract paper.pdf \
--output markdown_output/ \
--threshold 100 \ # Min chars for "searchable"
--percentage 0.8 \ # Use PyMuPDF if 80%+ pages searchable
--lang eng \ # OCR language
--disable-image-extraction # Skip images if needed
Output Structure:
markdown_output/paper/
├── paper.md # Raw markdown with embedded images
├── paper_meta.json # Metadata, extraction methods per page
├── blocks.json # Block-level segmentation data
└── _page_*_*.png # Extracted images
Stage 2: Structured Note Generation
Uses Claude Code with the .clauderules template:
phd-deepread generate markdown_output/paper/ \
--template templates/.clauderules \
--output structured_notes/ \
--skeleton # Generate placeholder note without Claude
Template Features:
- YAML frontmatter (
category,tags,citekey,status,dateread) - Dataview callouts (
[!Citation],[!Synthesis],[!Metadata],[!Abstract]) - 7 academic sections with critical analysis
- Wikilinks for concepts, methods, tools
Stage 3: Critical-Thinking Canvas
Creates 9-node JSON Canvas for deep analysis:
phd-deepread canvas markdown_output/paper/ \
--output structured_notes/ \
--template templates/critical-thinking.canvas
Canvas Nodes:
- core-argument - Primary claim and logical chain
- assumptions - Explicit, implicit, questionable assumptions
- evidence-assessment - Strength of evidence
- alternative-explanations - Competing hypotheses
- methodological-critique - Study limitations
- personal-relevance - Connections to your research
- future-directions - Research goals
- critical-questions-enhanced - Hypothesis testing questions
- hypothesis-center - Hypothesis re-evaluation
Stage 4: Verification
Quality checks and pattern matching:
phd-deepread verify --all literature-notes/
🔧 Configuration
Custom Templates
Modify templates/.clauderules for different academic fields:
category: literaturenote
tags:
- #{{Field}} # e.g., #Neuroscience, #Bioinformatics
- #{{Topic1}}
- #{{Topic2}}
citekey: {{camelCase: FirstAuthor+FirstWordOfTitle+Year}}
# ... rest of template
Extraction Parameters
Adjust in scripts/extract.py or via command line:
| Parameter | Default | Description |
|---|---|---|
--threshold |
100 | Min characters to consider page searchable |
--percentage |
0.8 | Use PyMuPDF if this % of pages are searchable |
--lang |
eng | Tesseract OCR language code |
--force-ocr |
false | Force OCR for all pages |
--force-text |
false | Force PyMuPDF for all pages |
--no-ocr |
false | Disable OCR entirely |
Output Directories
Configure in config/config.yaml:
# Directory paths
directories:
markdown_output: "./markdown_output"
structured_notes: "./structured_literature_notes"
canvas_templates: "./canvas_templates"
generation_prompts: "./generation_prompts"
# Extraction thresholds
extraction:
searchable_threshold: 100
searchable_percentage: 0.8
ocr_language: "eng"
See config/config.yaml for all available options.
📊 Performance
| Scenario | Time per Paper | Accuracy | Best For |
|---|---|---|---|
| All searchable text | 10-30 minutes | ~100% | Modern digital PDFs |
| Mixed (80% text, 20% OCR) | 30-60 minutes | ~98% | Typical academic PDFs |
| All OCR required | 60-120 minutes | ~95% | Scanned documents |
Typical workflow: 27-80 minutes per paper total (extraction + note generation + canvas creation)
🛠️ Troubleshooting
Common Issues
Tesseract OCR not installed:
# macOS
brew install tesseract
# Ubuntu/Debian
sudo apt install tesseract-ocr
# Python wrapper
pip install pytesseract pillow
PyMuPDF missing:
pip install PyMuPDF
Missing images in extraction:
- Check PDF contains extractable images
- Ensure
--disable-image-extractionnot set - Verify PyMuPDF version supports image extraction
Virtual environment issues:
# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate # Linux/macOS
# or .venv\Scripts\activate # Windows
pip install -r requirements.txt
Claude Code integration:
- Ensure Claude Code is installed and authenticated
- Check skill is properly installed in
~/.claude/skills/ - Verify skill has necessary permissions
Debug Mode
Run with verbose output:
phd-deepread extract paper.pdf --verbose --debug
Check logs in logs/ directory (if configured).
🔄 Integration with Other Tools
Zotero
- Use Zotero citation keys as
citekeyin frontmatter - Export PDFs from Zotero to processing directory
- Import generated notes back into Zotero as linked files
Obsidian
- Notes ready for Dataview queries
- Canvases work with Obsidian Canvas plugin
- Wikilinks connect to existing or future notes
- Use with Obsidian Git for version control
Reference Managers
- BibTeX export for generated citations
- RIS format integration (planned)
- DOI lookup and metadata fetching (planned)
📚 Examples
See the examples/ directory for:
example-output.md- Complete structured literature noteexample-canvas.canvas- 9-node critical-thinking canvastest_paper.pdf- Sample PDF for testing
Example output from the batch test is in batch_test_output/.
🧪 Testing
Run the test suite:
# Run all tests
python -m pytest tests/ -v
# Test specific component
python scripts/verify.py --extract test_paper.pdf
python scripts/verify.py --note examples/example-output.md
python scripts/verify.py --canvas examples/example-canvas.canvas
Note: For the extraction test, use your own PDF file named test_paper.pdf in the examples directory, or replace the command with a path to your test PDF.
Continuous Integration: Tests run automatically on GitHub Actions for each commit and pull request. See .github/workflows/test.yml.
🤝 Contributing
We welcome contributions! Please see CONTRIBUTING.md for details.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
Development Setup
# Clone and install development dependencies
git clone https://github.com/Helen-insights/phd-deepread-workflow.git
cd phd-deepread-workflow
pip install -r requirements-dev.txt
pre-commit install # Install git hooks
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
- Claude Code for AI-assisted note generation
- PyMuPDF for fast PDF text extraction
- Tesseract OCR for optical character recognition
- Obsidian for the excellent note-taking platform
- All contributors who help improve this workflow
📞 Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: heleninsights@gmail.com
📈 Roadmap
- Web UI for configuration and monitoring
- Integration with more reference managers (Mendeley, EndNote)
- Advanced layout detection for complex PDFs
- Multi-language support for OCR and analysis
- Plugin system for custom templates and processors
- Cloud processing for large PDF collections
- Mobile app for on-the-go paper processing
Made with ❤️ for the academic community
If this workflow helps your research, consider giving it a ⭐ on GitHub!
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file phd_deepread_workflow-0.1.1.tar.gz.
File metadata
- Download URL: phd_deepread_workflow-0.1.1.tar.gz
- Upload date:
- Size: 51.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4f0cb8ebf75c2cc771ca475fc8abd34c42f336f2980c40b05060347218b60342
|
|
| MD5 |
f8846c74d97100938fb45679370414e8
|
|
| BLAKE2b-256 |
1bced9b2df522e7660ccac7c42676f333c066052976fc7505782ed862ef88317
|
Provenance
The following attestation bundles were made for phd_deepread_workflow-0.1.1.tar.gz:
Publisher:
publish.yml on heleninsights-dot/phd-deepread-workflow
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
phd_deepread_workflow-0.1.1.tar.gz -
Subject digest:
4f0cb8ebf75c2cc771ca475fc8abd34c42f336f2980c40b05060347218b60342 - Sigstore transparency entry: 1044130814
- Sigstore integration time:
-
Permalink:
heleninsights-dot/phd-deepread-workflow@293de3ee481e9a54290980e39b9f19a761edbb76 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/heleninsights-dot
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@293de3ee481e9a54290980e39b9f19a761edbb76 -
Trigger Event:
push
-
Statement type:
File details
Details for the file phd_deepread_workflow-0.1.1-py3-none-any.whl.
File metadata
- Download URL: phd_deepread_workflow-0.1.1-py3-none-any.whl
- Upload date:
- Size: 24.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ca61fdc4cc57cc01546e68ab4d5daefb61797193b60f748ecc7b5711016987f
|
|
| MD5 |
fd67420c31cb688460a20a59ea8308b9
|
|
| BLAKE2b-256 |
3676f476417dc13a893f1ad0b97a920e75ce45d43ec9c0e6b760647d19a3973f
|
Provenance
The following attestation bundles were made for phd_deepread_workflow-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on heleninsights-dot/phd-deepread-workflow
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
phd_deepread_workflow-0.1.1-py3-none-any.whl -
Subject digest:
6ca61fdc4cc57cc01546e68ab4d5daefb61797193b60f748ecc7b5711016987f - Sigstore transparency entry: 1044130898
- Sigstore integration time:
-
Permalink:
heleninsights-dot/phd-deepread-workflow@293de3ee481e9a54290980e39b9f19a761edbb76 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/heleninsights-dot
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@293de3ee481e9a54290980e39b9f19a761edbb76 -
Trigger Event:
push
-
Statement type: