Workflow tools for paper extraction, review, and research automation.
Project description
ai-deepresearch-flow
From documents to deep research insight — automatically.
The Core Pain Points
- OCR Chaos: Raw markdown from OCR tools is often broken -- tables drift, formulas break, and references are non-clickable.
- Translation Nightmares: Translating technical papers often destroys code blocks, LaTeX formulas, and table structures.
- Information Overload: Extracting structured insights (authors, venues, summaries) from hundreds of PDFs manually is impossible.
- Context Switching: Managing PDFs, summaries, and translations in different windows kills focus.
The Solution
DeepResearch Flow provides a unified pipeline to Repair, Translate, Extract, and Serve your research library.
Key Features
- Smart Extraction: Turn unstructured Markdown into schema-enforced JSON (summaries, metadata, Q&A) using LLMs (OpenAI, Claude, Gemini, etc.).
- Precision Translation: Translate OCR Markdown to Chinese/Japanese (
.zh.md,.ja.md) while freezing formulas, code, tables, and references. No more broken layout. - Local Knowledge DB: A high-performance local Web UI to browse papers with Split View (Source vs. Translated vs. Summary), full-text search, and multi-dimensional filtering.
- Snapshot + API Serve: Build a production-ready SQLite snapshot with static assets, then serve a read-only JSON API for a separate frontend.
- Coverage Compare: Compare JSON/PDF/Markdown/Translated datasets to find missing artifacts and export CSV reports.
- Matched Export: Extract matched JSON or translated Markdown after coverage checks.
- OCR Post-Processing: Automatically fix broken references (
[1]->[^1]), merge split paragraphs, and standardize layouts.
Quick Start
1) Installation
# Recommended: using uv for speed
uv pip install deepresearch-flow
# Or standard pip
pip install deepresearch-flow
2) Configuration
Set up your LLM providers. We support OpenAI, Claude, Gemini, Ollama, and more.
cp config.example.toml config.toml
# Edit config.toml to add your API keys (e.g., env:OPENAI_API_KEY)
Multiple keys per provider are supported. Keys rotate per request and enter a short cooldown on retryable errors. You can also provide quota metadata per key:
api_keys = [
"env:OPENAI_API_KEY",
{ key = "env:OPENAI_API_KEY_2", quota_duration = 18000, reset_time = "2026-01-23 18:04:25 +0800 CST", quota_error_tokens = ["exceed", "quota"] }
]
3) The "Zero to Hero" Workflow
Step 1: Extract Insights
Scan a folder of markdown files and extract structured summaries.
uv run deepresearch-flow paper extract \
--input ./docs \
--model openai/gpt-4o-mini \
--prompt-template deep_read
Step 1.1: Verify & Retry Missing Fields
Validate extracted JSON against the template schema and retry only the missing items.
uv run deepresearch-flow paper db verify \
--input-json ./paper_infos.json \
--prompt-template deep_read \
--output-json ./paper_verify.json
uv run deepresearch-flow paper extract \
--input ./docs \
--model openai/gpt-4o-mini \
--prompt-template deep_read \
--retry-list-json ./paper_verify.json
Step 2: Translate Safely
Translate papers to Chinese, protecting LaTeX and tables.
uv run deepresearch-flow translator translate \
--input ./docs \
--target-lang zh \
--model openai/gpt-4o-mini \
--fix-level moderate
Step 2.5: Run OCR on PDFs/Images (Optional)
If your source documents are PDFs or scanned images, run OCR first to produce markdown:
# 1) Copy and edit the OCR config
cp ocr.example.toml ocr.toml
# Set your PaddleOCR token: export PADDLE_OCR_TOKEN=xxx
# 2) Run OCR on a directory of PDFs
uv run deepresearch-flow recognize ocr ./pdfs --config ocr.toml --output-dir ./ocr_output
Output follows the mineru layout (full.md + images/ per document), compatible with the repair steps below.
See ocr.example.toml for backend configuration. Currently supports PaddleOCR; more backends planned.
Step 3: Repair OCR Outputs (Recommended)
Recommended sequence to stabilize markdown before serving:
# 1) Fix OCR markdown (auto-detects JSON if inputs are .json)
uv run deepresearch-flow recognize fix \
--input ./docs \
--in-place
# 2) Fix LaTeX formulas
uv run deepresearch-flow recognize fix-math \
--input ./docs \
--model openai/gpt-4o-mini \
--in-place
# 3) Fix Mermaid diagrams
uv run deepresearch-flow recognize fix-mermaid \
--input ./paper_outputs \
--json \
--model openai/gpt-4o-mini \
--in-place
# (optional) Retry failed formulas/diagrams only
uv run deepresearch-flow recognize fix-math \
--input ./docs \
--model openai/gpt-4o-mini \
--retry-failed
uv run deepresearch-flow recognize fix-mermaid \
--input ./paper_outputs \
--json \
--model openai/gpt-4o-mini \
--retry-failed
# 4) Fix again to normalize formatting
uv run deepresearch-flow recognize fix \
--input ./docs \
--in-place
Step 4: Serve Your Database
Launch a local UI to read and manage your papers.
uv run deepresearch-flow paper db serve \
--input paper_infos.json \
--md-root ./docs \
--md-translated-root ./docs \
--host 127.0.0.1
Step 4.5: Build Snapshot + Serve API + Frontend (Recommended)
Build a production snapshot (SQLite + static assets), serve a read-only API, and run the frontend.
# 1) Build snapshot + static export
uv run deepresearch-flow paper db snapshot build \
--input ./paper_infos.json \
--bibtex ./papers.bib \
--md-root ./docs \
--md-translated-root ./docs \
--pdf-root ./pdfs \
--output-db ./dist/paper_snapshot.db \
--static-export-dir ./dist/paper-static
# 2) Serve static assets (CORS required for ZIP export)
npx http-server ./dist/paper-static -p 8002 --cors
# 3) Serve API (read-only)
PAPER_DB_STATIC_BASE_URL=http://127.0.0.1:8002 \
uv run deepresearch-flow paper db api serve \
--snapshot-db ./dist/paper_snapshot.db \
--cors-origin http://127.0.0.1:5173 \
--host 127.0.0.1 --port 8001
# 4) Run frontend
cd frontend
npm install
VITE_PAPER_DB_API_BASE=http://127.0.0.1:8001/api/v1 \
VITE_PAPER_DB_STATIC_BASE=http://127.0.0.1:8002 \
npm run dev
Step 4.6: Supplement Missing Templates (Optional)
If some papers are missing specific templates (e.g., deep_read), you can identify gaps and supplement extract them:
# 1) Check missing templates in snapshot
uv run deepresearch-flow paper db snapshot show-missing \
--snapshot-db ./dist/paper_snapshot.db
# 2) Export papers missing specific template (with file paths for extraction)
uv run deepresearch-flow paper db snapshot export-missing \
--snapshot-db ./dist/paper_snapshot.db \
--type template \
--template deep_read \
--static-export-dir ./dist/paper-static \
--output ./missing_deep_read.json \
--txt-output ./missing_ids.txt \
--output-paths ./extractable_paths.txt
# 3) Extract missing templates (only for papers with source markdown)
uv run deepresearch-flow paper extract \
--model openai/gpt-4o-mini \
--prompt-template deep_read \
--input-list ./extractable_paths.txt \
--output ./deep_read_supplement.json
# 4) Merge with existing paper_infos.json
uv run deepresearch-flow paper db merge library \
--inputs ./paper_infos.json \
--inputs ./deep_read_supplement.json \
--output ./paper_infos_complete.json
# 5) Rebuild snapshot with complete data
uv run deepresearch-flow paper db snapshot build \
--input ./paper_infos_complete.json \
--bibtex ./papers.bib \
--md-root ./docs \
--md-translated-root ./docs \
--pdf-root ./pdfs \
--output-db ./dist/paper_snapshot_complete.db \
--static-export-dir ./dist/paper-static-complete
Alternative 1: Supplement Missing Content (Templates/Translations)
If existing papers are missing templates or translations, supplement them without rebuilding:
# Supplement missing templates for existing papers (in-place)
uv run deepresearch-flow paper db snapshot supplement \
--snapshot-db ./dist/paper_snapshot.db \
--static-export-dir ./dist/paper-static \
-i ./deep_read_supplement.json \
--in-place
# Or output to new location
uv run deepresearch-flow paper db snapshot supplement \
--snapshot-db ./dist/paper_snapshot.db \
--static-export-dir ./dist/paper-static \
-i ./deep_read_supplement.json \
--output-db ./dist/paper_snapshot_supplemented.db \
--output-static-dir ./dist/paper-static-supplemented
Notes:
--md-rootand--md-translated-rootare optional forsnapshot supplement.- Use them only when you want to resolve/copy markdown files from local source directories.
Alternative 2: Add New Papers
If you have completely new papers to add to the snapshot:
# Add new papers to existing snapshot (in-place)
uv run deepresearch-flow paper db snapshot update \
--snapshot-db ./dist/paper_snapshot.db \
--static-export-dir ./dist/paper-static \
-i ./new_papers.json \
-b ./new_papers.bib \
--md-root ./docs \
--md-translated-root ./docs_translated \
--pdf-root ./pdfs \
--in-place
# Or output to new location
uv run deepresearch-flow paper db snapshot update \
--snapshot-db ./dist/paper_snapshot.db \
--static-export-dir ./dist/paper-static \
-i ./new_papers.json \
-b ./new_papers.bib \
--md-root ./docs \
--output-db ./dist/paper_snapshot_updated.db \
--output-static-dir ./dist/paper-static-updated
Differences:
supplement: Only adds missing templates/translations for existing papers (skips new papers)update: Only adds completely new papers (skips existing papers)
Upgrade Legacy Snapshot Schema (DOI/BibTeX)
Recommended: Migrate Schema In-Place (No Data Loss)
If your existing snapshot was built before DOI/BibTeX support, use the migrate command to upgrade the schema without losing any papers:
# In-place migration with timestamped backup
uv run deepresearch-flow paper db snapshot migrate \
--snapshot-db ./dist/paper_snapshot.db \
--bibtex ./papers.bib \
--static-export-dir ./dist/paper-static \
--in-place
# Or copy to new location
uv run deepresearch-flow paper db snapshot migrate \
--snapshot-db ./dist/paper_snapshot.db \
--bibtex ./papers.bib \
--static-export-dir ./dist/paper-static \
--output-db ./dist/paper_snapshot_v2.db
# Schema-only migration (no BibTeX enrichment)
uv run deepresearch-flow paper db snapshot migrate \
--snapshot-db ./dist/paper_snapshot.db \
--in-place
Features:
- No data loss: Uses
ALTER TABLEto upgrade schema, preserving all papers - Timestamped backups: Creates
.bak_YYYYMMDD_HHMMSSbackup files automatically - BibTeX enrichment: Matches papers with BibTeX and extracts DOI metadata
- Static export update: Updates
paper_index.jsonwith DOI/BibTeX references - Beautiful output: Rich tables showing schema changes and match statistics
The migrate command will:
- Create a timestamped backup (unless
--no-backupis used) - Add
doicolumn to thepapertable (if missing) - Create
paper_bibtextable (if missing) - Match papers with BibTeX entries and populate DOI/BibTeX data
- Update static export index with new metadata
Alternative: Rebuild with Previous Snapshot
If you need to rebuild from scratch while preserving identity continuity:
uv run deepresearch-flow paper db snapshot build \
--input ./paper_infos_complete.json \
--bibtex ./papers.bib \
--output-db ./dist/paper_snapshot_v2.db \
--static-export-dir ./dist/paper-static-v2 \
--previous-snapshot-db ./dist/paper_snapshot.db
Notes:
--md-root,--md-translated-root, and--pdf-rootare optional for this rebuild.- If a paper in current inputs already has DOI/BibTeX, current input wins; otherwise data can be inherited from
--previous-snapshot-db. - Warning: This approach only includes papers from the input JSON files, so ensure all papers are included to avoid data loss.
Supplement Missing Translations
If some papers are missing translations (e.g., zh), you can export and translate them:
# 1) Export papers missing Chinese translation (with file paths)
uv run deepresearch-flow paper db snapshot export-missing \
--snapshot-db ./dist/paper_snapshot.db \
--type translation \
--lang zh \
--static-export-dir ./dist/paper-static \
--output-paths ./to_translate_paths.txt
# 2) Translate missing papers
uv run deepresearch-flow translator translate \
--input ./docs \
--target-lang zh \
--model openai/gpt-4o-mini \
--input-list ./to_translate_paths.txt \
--output-dir ./docs_translated
# 3) Rebuild or supplement snapshot with new translations
uv run deepresearch-flow paper db snapshot build ...
# Or use snapshot supplement if only adding translations
Other useful export types:
--type source_md- Papers without source markdown--type pdf- Papers without PDF--type translation --lang zh- Papers without Chinese translation
Incremental PDF Library Workflow
This workflow keeps a growing PDF library in sync without reprocessing everything.
# 1) Compare processed JSON vs new PDF library to find missing PDFs
uv run deepresearch-flow paper db compare \
--input-a ./paper_infos.json \
--pdf-root-b ./pdfs_new \
--output-only-in-b ./pdfs_todo.txt
# 2) Stage the missing PDFs for OCR
uv run deepresearch-flow paper db transfer-pdfs \
--input-list ./pdfs_todo.txt \
--output-dir ./pdfs_todo \
--copy
# (optional) use --move instead of --copy
# uv run deepresearch-flow paper db transfer-pdfs --input-list ./pdfs_todo.txt --output-dir ./pdfs_todo --move
# 3) OCR the missing PDFs (use your OCR tool; write markdowns to ./md_todo)
# 4) Export matched existing assets against the new PDF library
uv run deepresearch-flow paper db extract \
--input-json ./paper_infos.json \
--pdf-root ./pdfs_new \
--output-json ./paper_infos_matched.json
uv run deepresearch-flow paper db extract \
--md-source-root ./mds \
--output-md-root ./mds_matched \
--pdf-root ./pdfs_new
uv run deepresearch-flow paper db extract \
--md-translated-root ./translated \
--output-md-translated-root ./translated_matched \
--pdf-root ./pdfs_new \
--lang zh
# 5) Translate + extract summaries for the new OCR markdowns
uv run deepresearch-flow translator translate \
--input ./md_todo \
--target-lang zh \
--model openai/gpt-4o-mini
uv run deepresearch-flow paper extract \
--input ./md_todo \
--model openai/gpt-4o-mini
# 6) Merge and serve the new library (multi-input)
uv run deepresearch-flow paper db serve \
--input ./paper_infos_matched.json \
--input ./paper_infos_new.json \
--md-root ./mds_matched \
--md-root ./md_todo \
--md-translated-root ./translated_matched \
--md-translated-root ./md_todo \
--pdf-root ./pdfs_new
Merge Paper JSONs
# Merge multiple libraries using the same template
uv run deepresearch-flow paper db merge library \
--inputs ./paper_infos_a.json \
--inputs ./paper_infos_b.json \
--output ./paper_infos_merged.json
# Merge multiple templates from the same library (first input wins on shared fields)
uv run deepresearch-flow paper db merge templates \
--inputs ./simple.json \
--inputs ./deep_read.json \
--output ./paper_infos_templates.json
Note: paper db merge is now split into merge library and merge templates.
Merge multiple databases (PDF + Markdown + BibTeX)
# 1) Copy PDFs into a single folder
rsync -av ./pdfs_a/ ./pdfs_merged/
rsync -av ./pdfs_b/ ./pdfs_merged/
# 2) Copy Markdown folders into a single folder
rsync -av ./md_a/ ./md_merged/
rsync -av ./md_b/ ./md_merged/
# 3) Merge JSON libraries
uv run deepresearch-flow paper db merge library \
--inputs ./paper_infos_a.json \
--inputs ./paper_infos_b.json \
--output ./paper_infos_merged.json
# 4) Merge BibTeX files
uv run deepresearch-flow paper db merge bibtex \
-i ./library_a.bib \
-i ./library_b.bib \
-o ./library_merged.bib
Merge BibTeX files
uv run deepresearch-flow paper db merge bibtex \
-i ./library_a.bib \
-i ./library_b.bib \
-o ./library_merged.bib
Duplicate keys keep the entry with the most fields; ties keep the first input order.
Recommended: Merge templates then filter by BibTeX
# 1) Merge templates for the same library
uv run deepresearch-flow paper db merge templates \
--inputs ./deep_read.json \
--inputs ./simple.json \
--output ./all.json
# 2) Filter the merged set with BibTeX
uv run deepresearch-flow paper db extract \
--input-bibtex ./library.bib \
--json ./all.json \
--output-json ./library_filtered.json \
--output-csv ./library_filtered.csv
Deployment (Static CDN)
The recommended production setup is front/back separation:
- Static CDN hosts PDFs/Markdown/images/summaries.
- API server serves a read-only snapshot DB.
- Frontend is a separate static app (Vite build or any static host).
1) Build snapshot + static export
uv run deepresearch-flow paper db snapshot build \
--input ./paper_infos.json \
--bibtex ./papers.bib \
--md-root ./docs \
--md-translated-root ./docs \
--pdf-root ./pdfs \
--output-db ./dist/paper_snapshot.db \
--static-export-dir /data/paper-static
Notes:
- The build host must be able to read the original PDF/Markdown roots.
- The CDN only needs the exported directory (e.g.
/data/paper-static).
2) Serve static assets with CORS + cache headers (Caddy example)
:8002 {
root * /data/paper-static
encode zstd gzip
@static path /pdf/* /md/* /md_translate/* /images/*
header @static {
Access-Control-Allow-Origin *
Access-Control-Allow-Methods GET,HEAD,OPTIONS
Access-Control-Allow-Headers *
Cache-Control "public, max-age=31536000, immutable"
}
@options method OPTIONS
respond @options 204
file_server
}
2.1) Nginx example (API + frontend on one domain, static on another)
# Frontend + API (same domain)
server {
listen 80;
server_name frontend.example.com;
root /var/www/paper-frontend;
index index.html;
location / {
try_files $uri /index.html;
}
location /api/ {
proxy_pass http://127.0.0.1:8001;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
location ^~ /mcp {
proxy_pass http://127.0.0.1:8001;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
# SSE transport for MCP clients that require Server-Sent Events
location ^~ /mcp-sse {
proxy_pass http://127.0.0.1:8001;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_buffering off;
proxy_cache off;
proxy_read_timeout 3600s;
proxy_send_timeout 3600s;
chunked_transfer_encoding off;
add_header X-Accel-Buffering no;
}
}
# Static assets (separate domain)
server {
listen 80;
server_name static.example.com;
root /data/paper-static;
location / {
add_header Access-Control-Allow-Origin *;
add_header Access-Control-Allow-Methods "GET,HEAD,OPTIONS";
add_header Access-Control-Allow-Headers "*";
add_header Cache-Control "public, max-age=31536000, immutable";
try_files $uri =404;
}
}
3) Start the API server (read-only)
export PAPER_DB_STATIC_BASE_URL="https://static.example.com"
uv run deepresearch-flow paper db api serve \
--snapshot-db /data/paper_snapshot.db \
--cors-origin https://frontend.example.com \
--host 0.0.0.0 --port 8001
BibTeX metadata endpoint:
GET /api/v1/papers/{paper_id}/bibtex- Success payload:
{ paper_id, doi, bibtex_raw, bibtex_key, entry_type } - Error codes:
paper_not_foundbibtex_not_found
3.1) Admin API (Optional)
Enable the admin API to add or delete papers remotely via Bearer token authentication.
# Start API server with admin enabled
PAPER_DB_ADMIN_TOKEN=your-secret-token \
uv run deepresearch-flow paper db api serve \
--snapshot-db /data/paper_snapshot.db \
--cors-origin https://frontend.example.com \
--host 0.0.0.0 --port 8001
Or pass the token via CLI flag: --admin-token your-secret-token
Endpoints (all require Authorization: Bearer <token> header):
-
POST /api/v1/admin/papers— Batch add papers (up to 200 per request)curl -X POST https://api.example.com/api/v1/admin/papers \ -H "Authorization: Bearer your-secret-token" \ -H "Content-Type: application/json" \ -d '{"papers": [{"paper_title": "...", "paper_authors": [...], ...}]}'
Response:
{ added, skipped, errors, paper_ids } -
DELETE /api/v1/admin/papers/{paper_id}— Delete a paper and all its relationscurl -X DELETE https://api.example.com/api/v1/admin/papers/{paper_id} \ -H "Authorization: Bearer your-secret-token"
Response:
{ deleted: true, paper_id }
The paper JSON format is the same as snapshot update input. Static files (PDF, markdown, images) are not handled by the API — upload them to your CDN separately.
Push from Local DB to Remote
Use api push to merge a locally-built snapshot DB into a remote deployment:
# remote.toml
[remote]
api_base_url = "https://api.example.com"
admin_token = "env:PAPER_DB_ADMIN_TOKEN"
batch_size = 100
# Preview what will be pushed
uv run deepresearch-flow paper db api push \
--snapshot-db ./dist/paper_snapshot.db \
--static-export-dir ./dist/paper-static \
--config remote.toml \
--dry-run
# Push to remote
uv run deepresearch-flow paper db api push \
--snapshot-db ./dist/paper_snapshot.db \
--static-export-dir ./dist/paper-static \
--config remote.toml
--static-export-diris optional — when provided, summary JSON payloads are included so the remote side can build FTS indexes and preview text.- Duplicate papers (same
paper_id) are automatically skipped. - Static files (PDF, markdown, images) are not pushed — sync them to your CDN separately (e.g.,
rsync,aws s3 sync).
3.2) MCP (FastMCP Streamable HTTP + SSE)
This project exposes MCP servers mounted on the snapshot API:
- Streamable HTTP endpoint:
http://<host>:8001/mcp - SSE endpoint:
http://<host>:8001/mcp-sse - Transport behavior:
/mcp: Streamable HTTP viaPOSTonly (GETreturns 405)/mcp-sse: SSE-enabled transport (supportsGEThandshake)
- Protocol header: optional
mcp-protocol-version(2025-03-26or2025-06-18) - Static reads: summary/source/translation are served as text content by reading snapshot static assets (local-first via
PAPER_DB_STATIC_EXPORT_DIR, HTTP fallback viaPAPER_DB_STATIC_BASE/PAPER_DB_STATIC_BASE_URL)
Optional (avoid HTTP fetch by reading exported assets directly on the API host):
export PAPER_DB_STATIC_EXPORT_DIR=/data/paper-static
MCP Tools (API functions)
search_papers(query, limit=10) — full-text search (relevance-ranked)
- Args:
query(str): keywords / topic querylimit(int): number of results (clamped to API max page size)
- Returns: list of
{ paper_id, title, year, venue, snippet_markdown }
search_papers_by_keyword(keyword, limit=10) — facet keyword search
- Args:
keyword(str): keyword substringlimit(int): number of results (clamped)
- Returns: list of
{ paper_id, title, year, venue, snippet_markdown }
get_paper_metadata(paper_id) — metadata + available summary templates
- Args:
paper_id(str)
- Returns: dict with:
paper_id,title,year,venuedoi,arxiv_id,openreview_id,paper_pw_urlhas_bibtexpreferred_summary_template,available_summary_templates
get_paper_bibtex(paper_id) — persisted BibTeX payload
- Args:
paper_id(str)
- Returns: dict with:
paper_id,doi,bibtex_raw,bibtex_key,entry_type
- Errors:
paper_not_foundbibtex_not_found
get_paper_summary(paper_id, template=None, max_chars=None) — summary JSON as raw text
- Notes:
- Uses
preferred_summary_templateiftemplateis omitted - Returns the full JSON content (not a URL)
- Uses
- Args:
paper_id(str)template(str | null)max_chars(int | null): truncation limit
- Returns: JSON string (may include a
[truncated: ...]marker)
get_paper_source(paper_id, max_chars=None) — source markdown as raw text
- Args:
paper_id(str)max_chars(int | null): truncation limit
- Returns: markdown string (may include a
[truncated: ...]marker)
get_database_stats() — snapshot-level stats
- Returns:
totalyears,months: list of{ value, paper_count }authors,venues,institutions,keywords,tags: top lists of{ value, paper_count }
list_top_facets(category, limit=20) — top values for one facet
- Args:
category:author | venue | keyword | institution | taglimit(int)
- Returns: list of
{ value, paper_count }
filter_papers(author=None, venue=None, year=None, keyword=None, tag=None, limit=10) — structured filtering
- Args (all optional except
limit):author,venue,keyword,tag: substring matchyear: exact matchlimit(int): number of results (clamped)
- Returns: list of
{ paper_id, title, year, venue }
MCP Resources (URI access)
paper://{paper_id}/metadata — metadata JSON
Returns the same content as get_paper_metadata(paper_id) (as a JSON string).
paper://{paper_id}/summary — preferred summary JSON
Returns the same content as get_paper_summary(paper_id) (preferred template; JSON string).
paper://{paper_id}/summary/{template} — summary JSON for template
Returns the same content as get_paper_summary(paper_id, template=template) (JSON string).
paper://{paper_id}/source — source markdown
Returns the same content as get_paper_source(paper_id) (markdown string).
paper://{paper_id}/translation/{lang} — translated markdown
Returns translated markdown for lang (e.g. zh, ja) when available.
4) Frontend (static build or dev)
cd frontend
npm install
# Dev
VITE_PAPER_DB_API_BASE=https://api.example.com/api/v1 \
VITE_PAPER_DB_STATIC_BASE=https://static.example.com \
npm run dev
# Build for static hosting
VITE_PAPER_DB_API_BASE=https://api.example.com/api/v1 \
VITE_PAPER_DB_STATIC_BASE=https://static.example.com \
npm run build
Comprehensive Guide
1. Translator: OCR-Safe Translation
The translator module is built for scientific documents. It uses a node-based architecture to ensure stability.
- Structure Protection: automatically detects and "freezes" code blocks, LaTeX (
$$...$$), HTML tables, and images before sending text to the LLM. - OCR Repair: use
--fix-levelto merge broken paragraphs and convert text references ([1]) to clickable Markdown footnotes ([^1]). - Context-Aware: supports retries for failed chunks and falls back gracefully.
- Group Concurrency: use
--group-concurrencyto run multiple translation groups in parallel per document.
# Translate with structure protection and OCR repairs
uv run deepresearch-flow translator translate \
--input ./paper.md \
--target-lang ja \
--fix-level aggressive \
--group-concurrency 4 \
--model claude/claude-3-5-sonnet-20240620
2. Paper Extract: Structured Knowledge
Turn loose markdown files into a queryable database.
- Templates: built-in prompts like
simple,eight_questions, anddeep_readguide the LLM to extract specific insights. - Async and throttled: precise control over concurrency (
--max-concurrency), rate limits (--sleep-every), and request timeout (--timeout). - Incremental: skips already processed files; resumes from where you left off.
- Stage resume: multi-stage templates persist per-module outputs; use
--force-stage <name>to rerun a module. - Stage DAG: enable
--stage-dag(orextract.stage_dag = true) for dependency-aware parallelism; DAG mode only passes dependency outputs to a stage and--dry-runprints the per-stage plan. - Diagram hints:
deep_readcan emit inferred diagrams labeled[Inferred]; userecognize fix-mermaidon rendered markdown if needed. - Stage focus: multi-stage runs emphasize the active module and summarize others to reduce context overload.
- Range filter: use
--start-idx/--end-idxto slice inputs; range applies before--retry-failed/--retry-failed-stages(--end-idx -1= last item). - Retry failed stages: use
--retry-failed-stagesto re-run only failed stages (multi-stage templates); missing stages are forced to run. Retry runs keep existing results and only update retried items.
uv run deepresearch-flow paper extract \
--input ./library \
--output paper_data.json \
--template-dir ./my-custom-prompts \
--max-concurrency 10 \
--timeout 180
# Extract items 0..99, then retry only failed ones from that range
uv run deepresearch-flow paper extract \
--input ./library \
--start-idx 0 \
--end-idx 100 \
--retry-failed \
--model claude/claude-3-5-sonnet-20240620
# Retry only failed stages in multi-stage templates
uv run deepresearch-flow paper extract \
--input ./library \
--retry-failed-stages \
--model claude/claude-3-5-sonnet-20240620
4. Recognize Fix: Repair Math and Mermaid
Fix broken LaTeX formulas and Mermaid diagrams in markdown or JSON outputs.
- Retry Failed: use
--retry-failedwith the prior--reportoutput to reprocess only failed formulas/diagrams.
uv run deepresearch-flow recognize fix-math \
--input ./docs \
--in-place \
--model claude/claude-3-5-sonnet-20240620 \
--report ./fix-math-errors.json \
--retry-failed
uv run deepresearch-flow recognize fix-mermaid \
--input ./docs \
--in-place \
--model claude/claude-3-5-sonnet-20240620 \
--report ./fix-mermaid-errors.json \
--retry-failed
3. Database and UI: Your Personal ArXiv
The db serve command creates a local research station.
- Split View: read the original PDF/Markdown on the left and the Summary/Translation on the right.
- Full Text Search: search by title, author, year, or content tags (
tag:fpga year:2023..2024). - Stats: visualize publication trends and keyword frequencies.
- PDF Viewer: built-in PDF.js viewer prevents cross-origin issues with local files.
uv run deepresearch-flow paper db serve \
--input paper_infos.json \
--pdf-root ./pdfs \
--cache-dir .cache/db
4. Paper DB Compare: Coverage Audit
Compare two datasets (A/B) to find missing PDFs, markdowns, translations, or JSON items, with match metadata.
uv run deepresearch-flow paper db compare \
--input-a ./a.json \
--md-root-b ./md_root \
--output-csv ./compare.csv
# Compare translated markdowns by language
uv run deepresearch-flow paper db compare \
--md-translated-root-a ./translated_a \
--md-translated-root-b ./translated_b \
--lang zh
5. Paper DB Extract: Matched Export
Extract matched JSON entries or translated Markdown after coverage comparison.
uv run deepresearch-flow paper db extract \
--json ./processed.json \
--input-bibtex ./refs.bib \
--pdf-root ./pdfs \
--output-json ./matched.json \
--output-csv ./extract.csv
# Use a JSON reference list to filter the target JSON
uv run deepresearch-flow paper db extract \
--json ./processed.json \
--input-json ./reference.json \
--pdf-root ./pdfs \
--output-json ./matched.json \
--output-csv ./extract.csv
# Extract translated markdowns by language
uv run deepresearch-flow paper db extract \
--md-root ./md_root \
--md-translated-root ./translated \
--lang zh \
--output-md-translated-root ./translated_matched \
--output-csv ./extract.csv
6. Recognize: OCR Post-Processing
Tools to clean up raw outputs from OCR engines like MinerU.
- Embed Images: convert local image links to Base64 for a portable single-file Markdown.
- Unpack Images: extract Base64 images back to files.
- Organize: flatten nested OCR output directories.
- Fix: apply OCR fixes and rumdl formatting during organize, or as a standalone step.
- Fix JSON: apply the same fixes to markdown fields inside paper JSON outputs.
- Fix Math: validate and repair LaTeX formulas with optional LLM assistance.
- Fix Mermaid: validate and repair Mermaid diagrams (requires
mmdcfrom mermaid-cli). - Recommended order:
fix->fix-math->fix-mermaid->fix.
uv run deepresearch-flow recognize md embed --input ./raw_ocr --output ./clean_md
# Organize MinerU output and apply OCR fixes
uv run deepresearch-flow recognize organize \
--input ./mineru_outputs \
--output-simple ./ocr_md \
--fix
# Fix and format existing markdown outputs
uv run deepresearch-flow recognize fix \
--input ./ocr_md \
--output ./ocr_md_fixed
# Fix in place
uv run deepresearch-flow recognize fix \
--input ./ocr_md \
--in-place
# Fix JSON outputs in place
uv run deepresearch-flow recognize fix \
--json \
--input ./paper_outputs \
--in-place
# Fix LaTeX formulas in markdown
uv run deepresearch-flow recognize fix-math \
--input ./docs \
--model openai/gpt-4o-mini \
--in-place
# Fix Mermaid diagrams in JSON outputs
uv run deepresearch-flow recognize fix-mermaid \
--json \
--input ./paper_outputs \
--model openai/gpt-4o-mini \
--in-place
Docker Support
Don't want to manage Python environments?
docker run --rm -v $(pwd):/app -it ghcr.io/nerdneilsfield/deepresearch-flow:latest --help
Deploy image (API + frontend via nginx):
docker run --rm -p 8899:8899 \
-v $(pwd)/paper_snapshot.db:/db/papers.db \
-v $(pwd)/paper-static:/static \
ghcr.io/nerdneilsfield/deepresearch-flow:deploy-latest
Notes:
- nginx listens on 8899 and proxies
/api,/mcp, and/mcp-sseto the internal API at127.0.0.1:8000. - Mount your snapshot DB to
/db/papers.dbinside the container. - Mount snapshot static assets to
/staticwhen serving assets from this container (defaultPAPER_DB_STATIC_BASEis/static). - If
PAPER_DB_STATIC_BASEis a full URL (e.g.https://static.example.com), nginx still serves the frontend locally, while API responses use that external static base for asset links.
Docker Compose example (two modes):
docker compose -f scripts/docker/docker-compose.example.yml --profile local-static up
# or
docker compose -f scripts/docker/docker-compose.example.yml --profile external-static up
External static assets example:
docker run --rm -p 8899:8899 \
-v $(pwd)/paper_snapshot.db:/db/papers.db \
-e PAPER_DB_STATIC_BASE=https://static.example.com \
ghcr.io/nerdneilsfield/deepresearch-flow:deploy-latest
Configuration
The config.toml is your control center. It supports:
- Multiple Providers: mix and match OpenAI, DeepSeek (DashScope), Gemini, Claude, and Ollama.
- Model Routing: explicit routing to specific models (
--model provider/model_name). - Environment Variables: keep secrets safe using
env:VAR_NAMEsyntax.
See config.example.toml for a full reference.
Built with love for the Open Science community.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file deepresearch_flow-0.8.1.tar.gz.
File metadata
- Download URL: deepresearch_flow-0.8.1.tar.gz
- Upload date:
- Size: 5.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a0c89d969749c9892f5ce29dd3ee955a7ddee4bd3bbef105a58a8609d6e42f74
|
|
| MD5 |
4b99f5bc724caecab9f08084d35f67b1
|
|
| BLAKE2b-256 |
c04ce2c5acd8c4e2fb21018e8c66c681e1aa7e84ad0b1ba3e8223224d668877e
|
Provenance
The following attestation bundles were made for deepresearch_flow-0.8.1.tar.gz:
Publisher:
push-to-pypi.yml on nerdneilsfield/ai-deepresearch-flow
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
deepresearch_flow-0.8.1.tar.gz -
Subject digest:
a0c89d969749c9892f5ce29dd3ee955a7ddee4bd3bbef105a58a8609d6e42f74 - Sigstore transparency entry: 1191411557
- Sigstore integration time:
-
Permalink:
nerdneilsfield/ai-deepresearch-flow@daca68b3426e99b5962bff2c320f79d77966d5e4 -
Branch / Tag:
refs/tags/v0.8.1 - Owner: https://github.com/nerdneilsfield
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
push-to-pypi.yml@daca68b3426e99b5962bff2c320f79d77966d5e4 -
Trigger Event:
push
-
Statement type:
File details
Details for the file deepresearch_flow-0.8.1-py3-none-any.whl.
File metadata
- Download URL: deepresearch_flow-0.8.1-py3-none-any.whl
- Upload date:
- Size: 6.2 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
61285e33719e425b69e3dff9527a1de6f919e5b7792b5a3713e2c69b21b3274e
|
|
| MD5 |
a9a43c391e8c3fabe5ad318635243dc8
|
|
| BLAKE2b-256 |
bc9110f17cd503a901a34dfa1346cfa8b08642aa4794d74c022bc629a5904b76
|
Provenance
The following attestation bundles were made for deepresearch_flow-0.8.1-py3-none-any.whl:
Publisher:
push-to-pypi.yml on nerdneilsfield/ai-deepresearch-flow
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
deepresearch_flow-0.8.1-py3-none-any.whl -
Subject digest:
61285e33719e425b69e3dff9527a1de6f919e5b7792b5a3713e2c69b21b3274e - Sigstore transparency entry: 1191411559
- Sigstore integration time:
-
Permalink:
nerdneilsfield/ai-deepresearch-flow@daca68b3426e99b5962bff2c320f79d77966d5e4 -
Branch / Tag:
refs/tags/v0.8.1 - Owner: https://github.com/nerdneilsfield
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
push-to-pypi.yml@daca68b3426e99b5962bff2c320f79d77966d5e4 -
Trigger Event:
push
-
Statement type: