Skip to main content

Workflow tools for paper extraction, review, and research automation.

Project description

ai-deepresearch-flow logo

ai-deepresearch-flow

From documents to deep research insight — automatically.

English | 中文

PyPI - Version


The Core Pain Points

  • OCR Chaos: Raw markdown from OCR tools is often broken -- tables drift, formulas break, and references are non-clickable.
  • Translation Nightmares: Translating technical papers often destroys code blocks, LaTeX formulas, and table structures.
  • Information Overload: Extracting structured insights (authors, venues, summaries) from hundreds of PDFs manually is impossible.
  • Context Switching: Managing PDFs, summaries, and translations in different windows kills focus.

The Solution

DeepResearch Flow provides a unified pipeline to Repair, Translate, Extract, and Serve your research library.

Key Features

  • Smart Extraction: Turn unstructured Markdown into schema-enforced JSON (summaries, metadata, Q&A) using LLMs (OpenAI, Claude, Gemini, etc.).
  • Precision Translation: Translate OCR Markdown to Chinese/Japanese (.zh.md, .ja.md) while freezing formulas, code, tables, and references. No more broken layout.
  • Local Knowledge DB: A high-performance local Web UI to browse papers with Split View (Source vs. Translated vs. Summary), full-text search, and multi-dimensional filtering.
  • Snapshot + API Serve: Build a production-ready SQLite snapshot with static assets, then serve a read-only JSON API for a separate frontend.
  • Coverage Compare: Compare JSON/PDF/Markdown/Translated datasets to find missing artifacts and export CSV reports.
  • Matched Export: Extract matched JSON or translated Markdown after coverage checks.
  • OCR Post-Processing: Automatically fix broken references ([1] -> [^1]), merge split paragraphs, and standardize layouts.

Quick Start

1) Installation

# Recommended: using uv for speed
uv pip install deepresearch-flow

# Or standard pip
pip install deepresearch-flow

2) Configuration

Set up your LLM providers. We support OpenAI, Claude, Gemini, Ollama, and more.

cp config.example.toml config.toml
# Edit config.toml to add your weighted providers, keys, and models

Breaking change: the old api_keys, model_list, and structured_mode fields are no longer accepted. The new config uses:

  • top-level main_model for weighted model routing
  • providers[].base[] for weighted URL routing
  • providers[].base[].key[] for weighted key routing
  • providers[].models[] for model capability declarations

Missing env:VAR_NAME references now fail explicitly during config load.

Per-key quota metadata still lives on the key object:

main_model = [
  { model = "openai/gpt-4o-mini", weight = 4 },
  { model = "claude/claude-sonnet-4-5-20250929", weight = 1 }
]

[[providers]]
name = "openai"
type = "openai_compatible"
base = [
  { url = "https://api.openai.com/v1", weight = 1, key = [
    { value = "env:OPENAI_API_KEY", weight = 4 },
    { value = "env:OPENAI_API_KEY_2", weight = 1, quota_duration = 18000, reset_time = "2026-01-23 18:04:25 +0800 CST", quota_error_tokens = ["exceed", "quota"] }
  ] }
]
models = [
  { model_name = "gpt-4o-mini", is_stream = true, is_support_json_schema = true, is_support_json_object = true }
]

3) The "Zero to Hero" Workflow

Step 1: Extract Insights

Scan a folder of markdown files and extract structured summaries.

uv run deepresearch-flow paper extract \
  --input ./docs \
  --model openai/gpt-4o-mini \
  --prompt-template deep_read

extract

Step 1.1: Verify & Retry Missing Fields

Validate extracted JSON against the template schema and retry only the missing items.

uv run deepresearch-flow paper db verify \
  --input-json ./paper_infos.json \
  --prompt-template deep_read \
  --output-json ./paper_verify.json

uv run deepresearch-flow paper extract \
  --input ./docs \
  --model openai/gpt-4o-mini \
  --prompt-template deep_read \
  --retry-list-json ./paper_verify.json

verify

Step 2: Translate Safely

Translate papers to Chinese, protecting LaTeX and tables.

uv run deepresearch-flow translator translate \
  --input ./docs \
  --target-lang zh \
  --model openai/gpt-4o-mini \
  --fix-level moderate

Step 2.5: Run OCR on PDFs/Images (Optional)

If your source documents are PDFs or scanned images, run OCR first to produce markdown:

# 1) Copy and edit the OCR config
cp ocr.example.toml ocr.toml
# Set your PaddleOCR token: export PADDLE_OCR_TOKEN=xxx

# 2) Run OCR on a directory of PDFs
uv run deepresearch-flow recognize ocr ./pdfs --config ocr.toml --output-dir ./ocr_output

Output follows the mineru layout (full.md + images/ per document), compatible with the repair steps below.

See ocr.example.toml for backend configuration. Currently supports PaddleOCR; more backends planned.

Step 3: Repair OCR Outputs (Recommended)

Recommended sequence to stabilize markdown before serving:

# 1) Fix OCR markdown (auto-detects JSON if inputs are .json)
uv run deepresearch-flow recognize fix \
  --input ./docs \
  --in-place

fix

# 2) Fix LaTeX formulas
uv run deepresearch-flow recognize fix-math \
  --input ./docs \
  --model openai/gpt-4o-mini \
  --in-place

fix math

# 3) Fix Mermaid diagrams
uv run deepresearch-flow recognize fix-mermaid \
  --input ./paper_outputs \
  --json \
  --model openai/gpt-4o-mini \
  --in-place

fix mermaid

# (optional) Retry failed formulas/diagrams only
uv run deepresearch-flow recognize fix-math \
  --input ./docs \
  --model openai/gpt-4o-mini \
  --retry-failed

uv run deepresearch-flow recognize fix-mermaid \
  --input ./paper_outputs \
  --json \
  --model openai/gpt-4o-mini \
  --retry-failed

fix retry failed

# 4) Fix again to normalize formatting
uv run deepresearch-flow recognize fix \
  --input ./docs \
  --in-place

Step 4: Serve Your Database

Launch a local UI to read and manage your papers.

uv run deepresearch-flow paper db serve \
  --input paper_infos.json \
  --md-root ./docs \
  --md-translated-root ./docs \
  --host 127.0.0.1

Step 4.5: Build Snapshot + Serve API + Frontend (Recommended)

Build a production snapshot (SQLite + static assets), serve a read-only API, and run the frontend.

# 1) Build snapshot + static export
uv run deepresearch-flow paper db snapshot build \
  --input ./paper_infos.json \
  --bibtex ./papers.bib \
  --md-root ./docs \
  --md-translated-root ./docs \
  --pdf-root ./pdfs \
  --output-db ./dist/paper_snapshot.db \
  --static-export-dir ./dist/paper-static

# 2) Serve static assets (CORS required for ZIP export)
npx http-server ./dist/paper-static -p 8002 --cors

# 3) Serve API (read-only)
PAPER_DB_STATIC_BASE_URL=http://127.0.0.1:8002 \
uv run deepresearch-flow paper db api serve \
  --snapshot-db ./dist/paper_snapshot.db \
  --cors-origin http://127.0.0.1:5173 \
  --host 127.0.0.1 --port 8001

# 4) Run frontend
cd frontend
npm install
VITE_PAPER_DB_API_BASE=http://127.0.0.1:8001/api/v1 \
VITE_PAPER_DB_STATIC_BASE=http://127.0.0.1:8002 \
npm run dev

Step 4.6: Supplement Missing Templates (Optional)

If some papers are missing specific templates (e.g., deep_read), you can identify gaps and supplement extract them:

# 1) Check missing templates in snapshot
uv run deepresearch-flow paper db snapshot show-missing \
  --snapshot-db ./dist/paper_snapshot.db

# 2) Export papers missing specific template (with file paths for extraction)
uv run deepresearch-flow paper db snapshot export-missing \
  --snapshot-db ./dist/paper_snapshot.db \
  --type template \
  --template deep_read \
  --static-export-dir ./dist/paper-static \
  --output ./missing_deep_read.json \
  --txt-output ./missing_ids.txt \
  --output-paths ./extractable_paths.txt

# 3) Extract missing templates (only for papers with source markdown)
uv run deepresearch-flow paper extract \
  --model openai/gpt-4o-mini \
  --prompt-template deep_read \
  --input-list ./extractable_paths.txt \
  --output ./deep_read_supplement.json

# 4) Merge with existing paper_infos.json
uv run deepresearch-flow paper db merge library \
  --inputs ./paper_infos.json \
  --inputs ./deep_read_supplement.json \
  --output ./paper_infos_complete.json

# 5) Rebuild snapshot with complete data
uv run deepresearch-flow paper db snapshot build \
  --input ./paper_infos_complete.json \
  --bibtex ./papers.bib \
  --md-root ./docs \
  --md-translated-root ./docs \
  --pdf-root ./pdfs \
  --output-db ./dist/paper_snapshot_complete.db \
  --static-export-dir ./dist/paper-static-complete

Alternative 1: Supplement Missing Content (Templates/Translations)

If existing papers are missing templates or translations, supplement them without rebuilding:

# Supplement missing templates for existing papers (in-place)
uv run deepresearch-flow paper db snapshot supplement \
  --snapshot-db ./dist/paper_snapshot.db \
  --static-export-dir ./dist/paper-static \
  -i ./deep_read_supplement.json \
  --in-place

# Or output to new location
uv run deepresearch-flow paper db snapshot supplement \
  --snapshot-db ./dist/paper_snapshot.db \
  --static-export-dir ./dist/paper-static \
  -i ./deep_read_supplement.json \
  --output-db ./dist/paper_snapshot_supplemented.db \
  --output-static-dir ./dist/paper-static-supplemented

Notes:

  • --md-root and --md-translated-root are optional for snapshot supplement.
  • Use them only when you want to resolve/copy markdown files from local source directories.

Alternative 2: Add New Papers

If you have completely new papers to add to the snapshot:

# Add new papers to existing snapshot (in-place)
uv run deepresearch-flow paper db snapshot update \
  --snapshot-db ./dist/paper_snapshot.db \
  --static-export-dir ./dist/paper-static \
  -i ./new_papers.json \
  -b ./new_papers.bib \
  --md-root ./docs \
  --md-translated-root ./docs_translated \
  --pdf-root ./pdfs \
  --in-place

# Or output to new location
uv run deepresearch-flow paper db snapshot update \
  --snapshot-db ./dist/paper_snapshot.db \
  --static-export-dir ./dist/paper-static \
  -i ./new_papers.json \
  -b ./new_papers.bib \
  --md-root ./docs \
  --output-db ./dist/paper_snapshot_updated.db \
  --output-static-dir ./dist/paper-static-updated

Differences:

  • supplement: Only adds missing templates/translations for existing papers (skips new papers)
  • update: Only adds completely new papers (skips existing papers)

Upgrade Legacy Snapshot Schema (DOI/BibTeX)

Recommended: Migrate Schema In-Place (No Data Loss)

If your existing snapshot was built before DOI/BibTeX support, use the migrate command to upgrade the schema without losing any papers:

# In-place migration with timestamped backup
uv run deepresearch-flow paper db snapshot migrate \
  --snapshot-db ./dist/paper_snapshot.db \
  --bibtex ./papers.bib \
  --static-export-dir ./dist/paper-static \
  --in-place

# Or copy to new location
uv run deepresearch-flow paper db snapshot migrate \
  --snapshot-db ./dist/paper_snapshot.db \
  --bibtex ./papers.bib \
  --static-export-dir ./dist/paper-static \
  --output-db ./dist/paper_snapshot_v2.db

# Schema-only migration (no BibTeX enrichment)
uv run deepresearch-flow paper db snapshot migrate \
  --snapshot-db ./dist/paper_snapshot.db \
  --in-place

Features:

  • No data loss: Uses ALTER TABLE to upgrade schema, preserving all papers
  • Timestamped backups: Creates .bak_YYYYMMDD_HHMMSS backup files automatically
  • BibTeX enrichment: Matches papers with BibTeX and extracts DOI metadata
  • Static export update: Updates paper_index.json with DOI/BibTeX references
  • Beautiful output: Rich tables showing schema changes and match statistics

The migrate command will:

  1. Create a timestamped backup (unless --no-backup is used)
  2. Add doi column to the paper table (if missing)
  3. Create paper_bibtex table (if missing)
  4. Match papers with BibTeX entries and populate DOI/BibTeX data
  5. Update static export index with new metadata

Alternative: Rebuild with Previous Snapshot

If you need to rebuild from scratch while preserving identity continuity:

uv run deepresearch-flow paper db snapshot build \
  --input ./paper_infos_complete.json \
  --bibtex ./papers.bib \
  --output-db ./dist/paper_snapshot_v2.db \
  --static-export-dir ./dist/paper-static-v2 \
  --previous-snapshot-db ./dist/paper_snapshot.db

Notes:

  • --md-root, --md-translated-root, and --pdf-root are optional for this rebuild.
  • If a paper in current inputs already has DOI/BibTeX, current input wins; otherwise data can be inherited from --previous-snapshot-db.
  • Warning: This approach only includes papers from the input JSON files, so ensure all papers are included to avoid data loss.

Supplement Missing Translations

If some papers are missing translations (e.g., zh), you can export and translate them:

# 1) Export papers missing Chinese translation (with file paths)
uv run deepresearch-flow paper db snapshot export-missing \
  --snapshot-db ./dist/paper_snapshot.db \
  --type translation \
  --lang zh \
  --static-export-dir ./dist/paper-static \
  --output-paths ./to_translate_paths.txt

# 2) Translate missing papers
uv run deepresearch-flow translator translate \
  --input ./docs \
  --target-lang zh \
  --model openai/gpt-4o-mini \
  --input-list ./to_translate_paths.txt \
  --output-dir ./docs_translated

# 3) Rebuild or supplement snapshot with new translations
uv run deepresearch-flow paper db snapshot build ...
# Or use snapshot supplement if only adding translations

Other useful export types:

  • --type source_md - Papers without source markdown
  • --type pdf - Papers without PDF
  • --type translation --lang zh - Papers without Chinese translation

Incremental PDF Library Workflow

This workflow keeps a growing PDF library in sync without reprocessing everything.

# 1) Compare processed JSON vs new PDF library to find missing PDFs
uv run deepresearch-flow paper db compare \
  --input-a ./paper_infos.json \
  --pdf-root-b ./pdfs_new \
  --output-only-in-b ./pdfs_todo.txt

# 2) Stage the missing PDFs for OCR
uv run deepresearch-flow paper db transfer-pdfs \
  --input-list ./pdfs_todo.txt \
  --output-dir ./pdfs_todo \
  --copy

# (optional) use --move instead of --copy
# uv run deepresearch-flow paper db transfer-pdfs --input-list ./pdfs_todo.txt --output-dir ./pdfs_todo --move

# 3) OCR the missing PDFs (use your OCR tool; write markdowns to ./md_todo)

# 4) Export matched existing assets against the new PDF library
uv run deepresearch-flow paper db extract \
  --input-json ./paper_infos.json \
  --pdf-root ./pdfs_new \
  --output-json ./paper_infos_matched.json

uv run deepresearch-flow paper db extract \
  --md-source-root ./mds \
  --output-md-root ./mds_matched \
  --pdf-root ./pdfs_new

uv run deepresearch-flow paper db extract \
  --md-translated-root ./translated \
  --output-md-translated-root ./translated_matched \
  --pdf-root ./pdfs_new \
  --lang zh

# 5) Translate + extract summaries for the new OCR markdowns
uv run deepresearch-flow translator translate \
  --input ./md_todo \
  --target-lang zh \
  --model openai/gpt-4o-mini

uv run deepresearch-flow paper extract \
  --input ./md_todo \
  --model openai/gpt-4o-mini

# 6) Merge and serve the new library (multi-input)
uv run deepresearch-flow paper db serve \
  --input ./paper_infos_matched.json \
  --input ./paper_infos_new.json \
  --md-root ./mds_matched \
  --md-root ./md_todo \
  --md-translated-root ./translated_matched \
  --md-translated-root ./md_todo \
  --pdf-root ./pdfs_new

Merge Paper JSONs

# Merge multiple libraries using the same template
uv run deepresearch-flow paper db merge library \
  --inputs ./paper_infos_a.json \
  --inputs ./paper_infos_b.json \
  --output ./paper_infos_merged.json

# Merge multiple templates from the same library (first input wins on shared fields)
uv run deepresearch-flow paper db merge templates \
  --inputs ./simple.json \
  --inputs ./deep_read.json \
  --output ./paper_infos_templates.json

Note: paper db merge is now split into merge library and merge templates.

Merge multiple databases (PDF + Markdown + BibTeX)

# 1) Copy PDFs into a single folder
rsync -av ./pdfs_a/ ./pdfs_merged/
rsync -av ./pdfs_b/ ./pdfs_merged/

# 2) Copy Markdown folders into a single folder
rsync -av ./md_a/ ./md_merged/
rsync -av ./md_b/ ./md_merged/

# 3) Merge JSON libraries
uv run deepresearch-flow paper db merge library \
  --inputs ./paper_infos_a.json \
  --inputs ./paper_infos_b.json \
  --output ./paper_infos_merged.json

# 4) Merge BibTeX files
uv run deepresearch-flow paper db merge bibtex \
  -i ./library_a.bib \
  -i ./library_b.bib \
  -o ./library_merged.bib

Merge BibTeX files

uv run deepresearch-flow paper db merge bibtex \
  -i ./library_a.bib \
  -i ./library_b.bib \
  -o ./library_merged.bib

Duplicate keys keep the entry with the most fields; ties keep the first input order.

Recommended: Merge templates then filter by BibTeX

# 1) Merge templates for the same library
uv run deepresearch-flow paper db merge templates \
  --inputs ./deep_read.json \
  --inputs ./simple.json \
  --output ./all.json

# 2) Filter the merged set with BibTeX
uv run deepresearch-flow paper db extract \
  --input-bibtex ./library.bib \
  --json ./all.json \
  --output-json ./library_filtered.json \
  --output-csv ./library_filtered.csv

Deployment (Static CDN)

The recommended production setup is front/back separation:

  • Static CDN hosts PDFs/Markdown/images/summaries.
  • API server serves a read-only snapshot DB.
  • Frontend is a separate static app (Vite build or any static host).

frontend

1) Build snapshot + static export

uv run deepresearch-flow paper db snapshot build \
  --input ./paper_infos.json \
  --bibtex ./papers.bib \
  --md-root ./docs \
  --md-translated-root ./docs \
  --pdf-root ./pdfs \
  --output-db ./dist/paper_snapshot.db \
  --static-export-dir /data/paper-static

Notes:

  • The build host must be able to read the original PDF/Markdown roots.
  • The CDN only needs the exported directory (e.g. /data/paper-static).

2) Serve static assets with CORS + cache headers (Caddy example)

:8002 {
  root * /data/paper-static
  encode zstd gzip

  @static path /pdf/* /md/* /md_translate/* /images/*
  header @static {
    Access-Control-Allow-Origin *
    Access-Control-Allow-Methods GET,HEAD,OPTIONS
    Access-Control-Allow-Headers *
    Cache-Control "public, max-age=31536000, immutable"
  }

  @options method OPTIONS
  respond @options 204

  file_server
}

2.1) Nginx example (API + frontend on one domain, static on another)

# Frontend + API (same domain)
server {
  listen 80;
  server_name frontend.example.com;

  root /var/www/paper-frontend;
  index index.html;

  location / {
    try_files $uri /index.html;
  }

  location /api/ {
    proxy_pass http://127.0.0.1:8001;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
  }

  location ^~ /mcp {
    proxy_pass http://127.0.0.1:8001;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
  }

  # SSE transport for MCP clients that require Server-Sent Events
  location ^~ /mcp-sse {
    proxy_pass http://127.0.0.1:8001;
    proxy_http_version 1.1;
    proxy_set_header Connection "";
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_buffering off;
    proxy_cache off;
    proxy_read_timeout 3600s;
    proxy_send_timeout 3600s;
    chunked_transfer_encoding off;
    add_header X-Accel-Buffering no;
  }
}

# Static assets (separate domain)
server {
  listen 80;
  server_name static.example.com;

  root /data/paper-static;

  location / {
    add_header Access-Control-Allow-Origin *;
    add_header Access-Control-Allow-Methods "GET,HEAD,OPTIONS";
    add_header Access-Control-Allow-Headers "*";
    add_header Cache-Control "public, max-age=31536000, immutable";
    try_files $uri =404;
  }
}

3) Start the API server (read-only)

export PAPER_DB_STATIC_BASE_URL="https://static.example.com"

uv run deepresearch-flow paper db api serve \
  --snapshot-db /data/paper_snapshot.db \
  --cors-origin https://frontend.example.com \
  --host 0.0.0.0 --port 8001

BibTeX metadata endpoint:

  • GET /api/v1/papers/{paper_id}/bibtex
  • Success payload: { paper_id, doi, bibtex_raw, bibtex_key, entry_type }
  • Error codes:
    • paper_not_found
    • bibtex_not_found

3.1) Admin API (Optional)

Enable the admin API to add or delete papers remotely via Bearer token authentication.

# Start API server with admin enabled
PAPER_DB_ADMIN_TOKEN=your-secret-token \
uv run deepresearch-flow paper db api serve \
  --snapshot-db /data/paper_snapshot.db \
  --cors-origin https://frontend.example.com \
  --host 0.0.0.0 --port 8001

Or pass the token via CLI flag: --admin-token your-secret-token

Endpoints (all require Authorization: Bearer <token> header):

  • POST /api/v1/admin/papers — Batch add papers (up to 200 per request)

    curl -X POST https://api.example.com/api/v1/admin/papers \
      -H "Authorization: Bearer your-secret-token" \
      -H "Content-Type: application/json" \
      -d '{"papers": [{"paper_title": "...", "paper_authors": [...], ...}]}'
    

    Response: { added, skipped, errors, paper_ids }

  • DELETE /api/v1/admin/papers/{paper_id} — Delete a paper and all its relations

    curl -X DELETE https://api.example.com/api/v1/admin/papers/{paper_id} \
      -H "Authorization: Bearer your-secret-token"
    

    Response: { deleted: true, paper_id }

The paper JSON format is the same as snapshot update input. The admin API handles metadata insertion; static files can be pushed separately through api push when remote.storage is configured.

Push from Local DB to Remote

Use api push to merge a locally-built snapshot DB into a remote deployment:

# remote.toml
[remote]
api_base_url = "https://api.example.com"
admin_token = "env:PAPER_DB_ADMIN_TOKEN"
batch_size = 10

[remote.storage]
type = "webdav"
url = "https://cdn.example.com/paper-static"
username = "deploy"
password = "env:PAPER_DB_WEBDAV_PASSWORD"
# Preview what will be pushed
uv run deepresearch-flow paper db api push \
  --snapshot-db ./dist/paper_snapshot.db \
  --static-export-dir ./dist/paper-static \
  --config remote.toml \
  --dry-run

# Push to remote
uv run deepresearch-flow paper db api push \
  --snapshot-db ./dist/paper_snapshot.db \
  --static-export-dir ./dist/paper-static \
  --config remote.toml

# Push only the admin API metadata
uv run deepresearch-flow paper db api push \
  --snapshot-db ./dist/paper_snapshot.db \
  --static-export-dir ./dist/paper-static \
  --config remote.toml \
  --only-api

# Push only static storage assets
uv run deepresearch-flow paper db api push \
  --snapshot-db ./dist/paper_snapshot.db \
  --static-export-dir ./dist/paper-static \
  --config remote.toml \
  --only-storage \
  --storage-concurrency 8

# Retry only failed static files from the last push
uv run deepresearch-flow paper db api push \
  --snapshot-db ./dist/paper_snapshot.db \
  --static-export-dir ./dist/paper-static \
  --config remote.toml \
  --retry-failed push-static-errors.json
  • --static-export-dir is optional — when provided, summary JSON payloads are included so the remote side can build FTS indexes and preview text.
  • Duplicate papers (same paper_id) are automatically skipped.
  • When [remote.storage] is configured, static files under the export dir are pushed after the metadata API sync.
  • The currently supported storage backend is webdav.
  • Static file push prints per-file status logs: uploaded, skipped, and failed.
  • If some static uploads fail, a push-static-errors.json report is written and can be retried with --retry-failed.
  • --only-api pushes only the admin API metadata and skips static storage.
  • --only-storage pushes only static storage and skips the admin API metadata step.
  • --storage-concurrency controls the number of concurrent workers used for static storage push.
  • --only-api and --only-storage are mutually exclusive.
  • --dry-run cannot be combined with --only-storage.
  • --retry-failed applies only to static storage and cannot be combined with --only-api.
  • If updated summary / manifest JSON behaves differently in one browser only, try a hard refresh or clear that browser's site cache first; stale browser cache can make static JSON appear inconsistent after a push.

3.2) MCP (FastMCP Streamable HTTP + SSE)

This project exposes MCP servers mounted on the snapshot API:

  • Streamable HTTP endpoint: http://<host>:8001/mcp
  • SSE endpoint: http://<host>:8001/mcp-sse
  • Transport behavior:
    • /mcp: Streamable HTTP via POST only (GET returns 405)
    • /mcp-sse: SSE-enabled transport (supports GET handshake)
  • Protocol header: optional mcp-protocol-version (2025-03-26 or 2025-06-18)
  • Static reads: summary/source/translation are served as text content by reading snapshot static assets (local-first via PAPER_DB_STATIC_EXPORT_DIR, HTTP fallback via PAPER_DB_STATIC_BASE / PAPER_DB_STATIC_BASE_URL)

Optional (avoid HTTP fetch by reading exported assets directly on the API host):

export PAPER_DB_STATIC_EXPORT_DIR=/data/paper-static

MCP Tools (API functions)

search_papers(query, limit=10) — full-text search (relevance-ranked)
  • Args:
    • query (str): keywords / topic query
    • limit (int): number of results (clamped to API max page size)
  • Returns: list of { paper_id, title, year, venue, snippet_markdown }
search_papers_by_keyword(keyword, limit=10) — facet keyword search
  • Args:
    • keyword (str): keyword substring
    • limit (int): number of results (clamped)
  • Returns: list of { paper_id, title, year, venue, snippet_markdown }
get_paper_metadata(paper_id) — metadata + available summary templates
  • Args:
    • paper_id (str)
  • Returns: dict with:
    • paper_id, title, year, venue
    • doi, arxiv_id, openreview_id, paper_pw_url
    • has_bibtex
    • preferred_summary_template, available_summary_templates
get_paper_bibtex(paper_id) — persisted BibTeX payload
  • Args:
    • paper_id (str)
  • Returns: dict with:
    • paper_id, doi, bibtex_raw, bibtex_key, entry_type
  • Errors:
    • paper_not_found
    • bibtex_not_found
get_paper_summary(paper_id, template=None, max_chars=None) — summary JSON as raw text
  • Notes:
    • Uses preferred_summary_template if template is omitted
    • Returns the full JSON content (not a URL)
  • Args:
    • paper_id (str)
    • template (str | null)
    • max_chars (int | null): truncation limit
  • Returns: JSON string (may include a [truncated: ...] marker)
get_paper_source(paper_id, max_chars=None) — source markdown as raw text
  • Args:
    • paper_id (str)
    • max_chars (int | null): truncation limit
  • Returns: markdown string (may include a [truncated: ...] marker)
get_database_stats() — snapshot-level stats
  • Returns:
    • total
    • years, months: list of { value, paper_count }
    • authors, venues, institutions, keywords, tags: top lists of { value, paper_count }
list_top_facets(category, limit=20) — top values for one facet
  • Args:
    • category: author | venue | keyword | institution | tag
    • limit (int)
  • Returns: list of { value, paper_count }
filter_papers(author=None, venue=None, year=None, keyword=None, tag=None, limit=10) — structured filtering
  • Args (all optional except limit):
    • author, venue, keyword, tag: substring match
    • year: exact match
    • limit (int): number of results (clamped)
  • Returns: list of { paper_id, title, year, venue }

MCP Resources (URI access)

paper://{paper_id}/metadata — metadata JSON

Returns the same content as get_paper_metadata(paper_id) (as a JSON string).

paper://{paper_id}/summary — preferred summary JSON

Returns the same content as get_paper_summary(paper_id) (preferred template; JSON string).

paper://{paper_id}/summary/{template} — summary JSON for template

Returns the same content as get_paper_summary(paper_id, template=template) (JSON string).

paper://{paper_id}/source — source markdown

Returns the same content as get_paper_source(paper_id) (markdown string).

paper://{paper_id}/translation/{lang} — translated markdown

Returns translated markdown for lang (e.g. zh, ja) when available.

4) Frontend (static build or dev)

cd frontend
npm install

# Dev
VITE_PAPER_DB_API_BASE=https://api.example.com/api/v1 \
VITE_PAPER_DB_STATIC_BASE=https://static.example.com \
npm run dev

# Build for static hosting
VITE_PAPER_DB_API_BASE=https://api.example.com/api/v1 \
VITE_PAPER_DB_STATIC_BASE=https://static.example.com \
npm run build

Comprehensive Guide

1. Translator: OCR-Safe Translation

The translator module is built for scientific documents. It uses a node-based architecture to ensure stability.

  • Structure Protection: automatically detects and "freezes" code blocks, LaTeX ($$...$$), HTML tables, and images before sending text to the LLM.
  • OCR Repair: use --fix-level to merge broken paragraphs and convert text references ([1]) to clickable Markdown footnotes ([^1]).
  • Context-Aware: supports retries for failed chunks and falls back gracefully.
  • Group Concurrency: use --group-concurrency to run multiple translation groups in parallel per document.
# Translate with structure protection and OCR repairs
uv run deepresearch-flow translator translate \
  --input ./paper.md \
  --target-lang ja \
  --fix-level aggressive \
  --group-concurrency 4 \
  --model claude/claude-3-5-sonnet-20240620
2. Paper Extract: Structured Knowledge

Turn loose markdown files into a queryable database.

  • Templates: built-in prompts like simple, eight_questions, and deep_read guide the LLM to extract specific insights.
  • Async and throttled: precise control over concurrency (--max-concurrency), rate limits (--sleep-every), and request timeout (--timeout).
  • Incremental: skips already processed files; resumes from where you left off.
  • Stage resume: multi-stage templates persist per-module outputs; use --force-stage <name> to rerun a module.
  • Stage DAG: enable --stage-dag (or extract.stage_dag = true) for dependency-aware parallelism; DAG mode only passes dependency outputs to a stage and --dry-run prints the per-stage plan.
  • Diagram hints: deep_read can emit inferred diagrams labeled [Inferred]; use recognize fix-mermaid on rendered markdown if needed.
  • Stage focus: multi-stage runs emphasize the active module and summarize others to reduce context overload.
  • Range filter: use --start-idx/--end-idx to slice inputs; range applies before --retry-failed/--retry-failed-stages (--end-idx -1 = last item).
  • Retry failed stages: use --retry-failed-stages to re-run only failed stages (multi-stage templates); missing stages are forced to run. Sequential retry plans enqueue only stages that still need execution, and the final paper_infos.json stays aligned with the final errors.json (documents with unresolved errors are omitted from output until fixed).
uv run deepresearch-flow paper extract \
  --input ./library \
  --output paper_data.json \
  --template-dir ./my-custom-prompts \
  --max-concurrency 10 \
  --timeout 180

# Extract items 0..99, then retry only failed ones from that range
uv run deepresearch-flow paper extract \
  --input ./library \
  --start-idx 0 \
  --end-idx 100 \
  --retry-failed \
  --model claude/claude-3-5-sonnet-20240620

# Retry only failed stages in multi-stage templates
uv run deepresearch-flow paper extract \
  --input ./library \
  --retry-failed-stages \
  --model claude/claude-3-5-sonnet-20240620
4. Recognize Fix: Repair Math and Mermaid

Fix broken LaTeX formulas and Mermaid diagrams in markdown or JSON outputs.

  • Retry Failed: use --retry-failed with the prior --report output to reprocess only failed formulas/diagrams.
uv run deepresearch-flow recognize fix-math \
  --input ./docs \
  --in-place \
  --model claude/claude-3-5-sonnet-20240620 \
  --report ./fix-math-errors.json \
  --retry-failed

uv run deepresearch-flow recognize fix-mermaid \
  --input ./docs \
  --in-place \
  --model claude/claude-3-5-sonnet-20240620 \
  --report ./fix-mermaid-errors.json \
  --retry-failed
3. Database and UI: Your Personal ArXiv

The db serve command creates a local research station.

  • Split View: read the original PDF/Markdown on the left and the Summary/Translation on the right.
  • Full Text Search: search by title, author, year, or content tags (tag:fpga year:2023..2024).
  • Stats: visualize publication trends and keyword frequencies.
  • PDF Viewer: built-in PDF.js viewer prevents cross-origin issues with local files.
uv run deepresearch-flow paper db serve \
  --input paper_infos.json \
  --pdf-root ./pdfs \
  --cache-dir .cache/db
4. Paper DB Compare: Coverage Audit

Compare two datasets (A/B) to find missing PDFs, markdowns, translations, or JSON items, with match metadata.

uv run deepresearch-flow paper db compare \
  --input-a ./a.json \
  --md-root-b ./md_root \
  --output-csv ./compare.csv

# Compare translated markdowns by language
uv run deepresearch-flow paper db compare \
  --md-translated-root-a ./translated_a \
  --md-translated-root-b ./translated_b \
  --lang zh
5. Paper DB Extract: Matched Export

Extract matched JSON entries or translated Markdown after coverage comparison.

uv run deepresearch-flow paper db extract \
  --json ./processed.json \
  --input-bibtex ./refs.bib \
  --pdf-root ./pdfs \
  --output-json ./matched.json \
  --output-csv ./extract.csv

# Use a JSON reference list to filter the target JSON
uv run deepresearch-flow paper db extract \
  --json ./processed.json \
  --input-json ./reference.json \
  --pdf-root ./pdfs \
  --output-json ./matched.json \
  --output-csv ./extract.csv

# Extract translated markdowns by language
uv run deepresearch-flow paper db extract \
  --md-root ./md_root \
  --md-translated-root ./translated \
  --lang zh \
  --output-md-translated-root ./translated_matched \
  --output-csv ./extract.csv
6. Recognize: OCR Post-Processing

Tools to clean up raw outputs from OCR engines like MinerU.

  • Embed Images: convert local image links to Base64 for a portable single-file Markdown.
  • Unpack Images: extract Base64 images back to files.
  • Organize: flatten nested OCR output directories.
  • Fix: apply OCR fixes and rumdl formatting during organize, or as a standalone step.
  • Fix JSON: apply the same fixes to markdown fields inside paper JSON outputs.
  • Fix Math: validate and repair LaTeX formulas with optional LLM assistance.
  • Fix Mermaid: validate and repair Mermaid diagrams (requires mmdc from mermaid-cli).
  • Recommended order: fix -> fix-math -> fix-mermaid -> fix.
uv run deepresearch-flow recognize md embed --input ./raw_ocr --output ./clean_md
# Organize MinerU output and apply OCR fixes
uv run deepresearch-flow recognize organize \
  --input ./mineru_outputs \
  --output-simple ./ocr_md \
  --fix

# Fix and format existing markdown outputs
uv run deepresearch-flow recognize fix \
  --input ./ocr_md \
  --output ./ocr_md_fixed

# Fix in place
uv run deepresearch-flow recognize fix \
  --input ./ocr_md \
  --in-place

# Fix JSON outputs in place
uv run deepresearch-flow recognize fix \
  --json \
  --input ./paper_outputs \
  --in-place

# Fix LaTeX formulas in markdown
uv run deepresearch-flow recognize fix-math \
  --input ./docs \
  --model openai/gpt-4o-mini \
  --in-place

# Fix Mermaid diagrams in JSON outputs
uv run deepresearch-flow recognize fix-mermaid \
  --json \
  --input ./paper_outputs \
  --model openai/gpt-4o-mini \
  --in-place

Docker Support

Don't want to manage Python environments?

docker run --rm -v $(pwd):/app -it ghcr.io/nerdneilsfield/deepresearch-flow:latest --help

Deploy image (API + frontend via nginx):

docker run --rm -p 8899:8899 \
  -v $(pwd)/paper_snapshot.db:/db/papers.db \
  -v $(pwd)/paper-static:/static \
  ghcr.io/nerdneilsfield/deepresearch-flow:deploy-latest

Notes:

  • nginx listens on 8899 and proxies /api, /mcp, and /mcp-sse to the internal API at 127.0.0.1:8000.
  • Mount your snapshot DB to /db/papers.db inside the container.
  • Mount snapshot static assets to /static when serving assets from this container (default PAPER_DB_STATIC_BASE is /static).
  • If PAPER_DB_STATIC_BASE is a full URL (e.g. https://static.example.com), nginx still serves the frontend locally, while API responses use that external static base for asset links.

Docker Compose example (two modes):

docker compose -f scripts/docker/docker-compose.example.yml --profile local-static up
# or
docker compose -f scripts/docker/docker-compose.example.yml --profile external-static up

External static assets example:

docker run --rm -p 8899:8899 \
  -v $(pwd)/paper_snapshot.db:/db/papers.db \
  -e PAPER_DB_STATIC_BASE=https://static.example.com \
  ghcr.io/nerdneilsfield/deepresearch-flow:deploy-latest

Configuration

The config.toml is your control center. It supports:

  • Multiple Providers: mix and match OpenAI, DeepSeek (DashScope), Gemini, Claude, and Ollama.
  • Weighted model routing via main_model, weighted URL routing via providers[].base[], and weighted key routing via providers[].base[].key[].
  • Request-time route pooling: real LLM requests pull routes from a shared runtime pool, so weighted model -> base -> key selection happens per request, not just once at process startup.
  • Model Routing: --model accepts a single provider/model, an inline JSON model pool, or an @file JSON model pool. If omitted in paper extract, the command falls back to config.toml main_model.
  • Environment Variables: keep secrets safe using env:VAR_NAME syntax.

Examples:

# Use config.toml main_model
uv run deepresearch-flow paper extract --input ./docs

# Fixed model
uv run deepresearch-flow paper extract --input ./docs --model openai/gpt-4o-mini

# Inline weighted main_model override
uv run deepresearch-flow paper extract \
  --input ./docs \
  --model '[{"model":"openai/gpt-4o-mini","weight":4},{"model":"claude/claude-sonnet-4-5-20250929","weight":1}]'

# File-based weighted main_model override
uv run deepresearch-flow paper extract \
  --input ./docs \
  --model @main_model.json

Mode probing:

# Report only
uv run deepresearch-flow utils test-mode \
  --config ./config.toml \
  --model openai/gpt-4o-mini

# Write probe results back to config
uv run deepresearch-flow utils test-mode \
  --config ./config.toml \
  --model openai/gpt-4o-mini \
  --write-back

utils test-mode probes one weighted base + key path per requested model. Normal extraction, translation, recognize repair, and tag-generation flows now select routes from the runtime pool per request.

See config.example.toml for a full reference.


Built with love for the Open Science community.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepresearch_flow-0.8.5.tar.gz (5.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deepresearch_flow-0.8.5-py3-none-any.whl (6.2 MB view details)

Uploaded Python 3

File details

Details for the file deepresearch_flow-0.8.5.tar.gz.

File metadata

  • Download URL: deepresearch_flow-0.8.5.tar.gz
  • Upload date:
  • Size: 5.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for deepresearch_flow-0.8.5.tar.gz
Algorithm Hash digest
SHA256 f277313f54021dfeeb860388bb0e71f97392ee3cdfbdd8bfcfc98b97b1d9e7cd
MD5 93f6a927a0dc81b891f7b64bdd268988
BLAKE2b-256 29d2189aa52b7d95f4f2f98dcee3ff0f0bbd94a81bb1e1c7754fab0da8147967

See more details on using hashes here.

Provenance

The following attestation bundles were made for deepresearch_flow-0.8.5.tar.gz:

Publisher: push-to-pypi.yml on nerdneilsfield/ai-deepresearch-flow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file deepresearch_flow-0.8.5-py3-none-any.whl.

File metadata

File hashes

Hashes for deepresearch_flow-0.8.5-py3-none-any.whl
Algorithm Hash digest
SHA256 2484e4cf257fffb00f02182dd1d164cc717b2a515cf7d5f4af3bb0b82451392b
MD5 b1b22e811eb1bfd59e9a932f81e57eaa
BLAKE2b-256 5e211336f4b2030e071a1017441677480f2592af8c46e3439875c7e4eae5e5ec

See more details on using hashes here.

Provenance

The following attestation bundles were made for deepresearch_flow-0.8.5-py3-none-any.whl:

Publisher: push-to-pypi.yml on nerdneilsfield/ai-deepresearch-flow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page