Skip to main content

Workflow tools for paper extraction, review, and research automation.

Project description

ai-deepresearch-flow logo

ai-deepresearch-flow

From documents to deep research insight — automatically.

English | 中文

PyPI - Version


The Core Pain Points

  • OCR Chaos: Raw markdown from OCR tools is often broken -- tables drift, formulas break, and references are non-clickable.
  • Translation Nightmares: Translating technical papers often destroys code blocks, LaTeX formulas, and table structures.
  • Information Overload: Extracting structured insights (authors, venues, summaries) from hundreds of PDFs manually is impossible.
  • Context Switching: Managing PDFs, summaries, and translations in different windows kills focus.

The Solution

DeepResearch Flow provides a unified pipeline to Repair, Translate, Extract, and Serve your research library.

Key Features

  • Smart Extraction: Turn unstructured Markdown into schema-enforced JSON (summaries, metadata, Q&A) using LLMs (OpenAI, Claude, Gemini, etc.).
  • Precision Translation: Translate OCR Markdown to Chinese/Japanese (.zh.md, .ja.md) while freezing formulas, code, tables, and references. No more broken layout.
  • Local Knowledge DB: A high-performance local Web UI to browse papers with Split View (Source vs. Translated vs. Summary), full-text search, and multi-dimensional filtering.
  • Snapshot + API Serve: Build a production-ready SQLite snapshot with static assets, then serve a read-only JSON API for a separate frontend.
  • Coverage Compare: Compare JSON/PDF/Markdown/Translated datasets to find missing artifacts and export CSV reports.
  • Matched Export: Extract matched JSON or translated Markdown after coverage checks.
  • OCR Post-Processing: Automatically fix broken references ([1] -> [^1]), merge split paragraphs, and standardize layouts.

Quick Start

1) Installation

# Recommended: using uv for speed
uv pip install deepresearch-flow

# Or standard pip
pip install deepresearch-flow

2) Configuration

Set up your LLM providers. We support OpenAI, Claude, Gemini, Ollama, and more.

cp config.example.toml config.toml
# Edit config.toml to add your API keys (e.g., env:OPENAI_API_KEY)

Multiple keys per provider are supported. Keys rotate per request and enter a short cooldown on retryable errors. You can also provide quota metadata per key:

api_keys = [
  "env:OPENAI_API_KEY",
  { key = "env:OPENAI_API_KEY_2", quota_duration = 18000, reset_time = "2026-01-23 18:04:25 +0800 CST", quota_error_tokens = ["exceed", "quota"] }
]

3) The "Zero to Hero" Workflow

Step 1: Extract Insights

Scan a folder of markdown files and extract structured summaries.

uv run deepresearch-flow paper extract \
  --input ./docs \
  --model openai/gpt-4o-mini \
  --prompt-template deep_read

extract

Step 1.1: Verify & Retry Missing Fields

Validate extracted JSON against the template schema and retry only the missing items.

uv run deepresearch-flow paper db verify \
  --input-json ./paper_infos.json \
  --prompt-template deep_read \
  --output-json ./paper_verify.json

uv run deepresearch-flow paper extract \
  --input ./docs \
  --model openai/gpt-4o-mini \
  --prompt-template deep_read \
  --retry-list-json ./paper_verify.json

verify

Step 2: Translate Safely

Translate papers to Chinese, protecting LaTeX and tables.

uv run deepresearch-flow translator translate \
  --input ./docs \
  --target-lang zh \
  --model openai/gpt-4o-mini \
  --fix-level moderate

Step 3: Repair OCR Outputs (Recommended)

Recommended sequence to stabilize markdown before serving:

# 1) Fix OCR markdown (auto-detects JSON if inputs are .json)
uv run deepresearch-flow recognize fix \
  --input ./docs \
  --in-place

fix

# 2) Fix LaTeX formulas
uv run deepresearch-flow recognize fix-math \
  --input ./docs \
  --model openai/gpt-4o-mini \
  --in-place

fix math

# 3) Fix Mermaid diagrams
uv run deepresearch-flow recognize fix-mermaid \
  --input ./paper_outputs \
  --json \
  --model openai/gpt-4o-mini \
  --in-place

fix mermaid

# (optional) Retry failed formulas/diagrams only
uv run deepresearch-flow recognize fix-math \
  --input ./docs \
  --model openai/gpt-4o-mini \
  --retry-failed

uv run deepresearch-flow recognize fix-mermaid \
  --input ./paper_outputs \
  --json \
  --model openai/gpt-4o-mini \
  --retry-failed

fix retry failed

# 4) Fix again to normalize formatting
uv run deepresearch-flow recognize fix \
  --input ./docs \
  --in-place

Step 4: Serve Your Database

Launch a local UI to read and manage your papers.

uv run deepresearch-flow paper db serve \
  --input paper_infos.json \
  --md-root ./docs \
  --md-translated-root ./docs \
  --host 127.0.0.1

Step 4.5: Build Snapshot + Serve API + Frontend (Recommended)

Build a production snapshot (SQLite + static assets), serve a read-only API, and run the frontend.

# 1) Build snapshot + static export
uv run deepresearch-flow paper db snapshot build \
  --input ./paper_infos.json \
  --bibtex ./papers.bib \
  --md-root ./docs \
  --md-translated-root ./docs \
  --pdf-root ./pdfs \
  --output-db ./dist/paper_snapshot.db \
  --static-export-dir ./dist/paper-static

# 2) Serve static assets (CORS required for ZIP export)
npx http-server ./dist/paper-static -p 8002 --cors

# 3) Serve API (read-only)
PAPER_DB_STATIC_BASE_URL=http://127.0.0.1:8002 \
uv run deepresearch-flow paper db api serve \
  --snapshot-db ./dist/paper_snapshot.db \
  --cors-origin http://127.0.0.1:5173 \
  --host 127.0.0.1 --port 8001

# 4) Run frontend
cd frontend
npm install
VITE_PAPER_DB_API_BASE=http://127.0.0.1:8001/api/v1 \
VITE_PAPER_DB_STATIC_BASE=http://127.0.0.1:8002 \
npm run dev

Step 4.6: Supplement Missing Templates (Optional)

If some papers are missing specific templates (e.g., deep_read), you can identify gaps and supplement extract them:

# 1) Check missing templates in snapshot
uv run deepresearch-flow paper db snapshot show-missing \
  --snapshot-db ./dist/paper_snapshot.db

# 2) Export papers missing specific template (with file paths for extraction)
uv run deepresearch-flow paper db snapshot export-missing \
  --snapshot-db ./dist/paper_snapshot.db \
  --type template \
  --template deep_read \
  --static-export-dir ./dist/paper-static \
  --output ./missing_deep_read.json \
  --txt-output ./missing_ids.txt \
  --output-paths ./extractable_paths.txt

# 3) Extract missing templates (only for papers with source markdown)
uv run deepresearch-flow paper extract \
  --model openai/gpt-4o-mini \
  --prompt-template deep_read \
  --input-list ./extractable_paths.txt \
  --output ./deep_read_supplement.json

# 4) Merge with existing paper_infos.json
uv run deepresearch-flow paper db merge library \
  --inputs ./paper_infos.json \
  --inputs ./deep_read_supplement.json \
  --output ./paper_infos_complete.json

# 5) Rebuild snapshot with complete data
uv run deepresearch-flow paper db snapshot build \
  --input ./paper_infos_complete.json \
  --bibtex ./papers.bib \
  --md-root ./docs \
  --md-translated-root ./docs \
  --pdf-root ./pdfs \
  --output-db ./dist/paper_snapshot_complete.db \
  --static-export-dir ./dist/paper-static-complete

Alternative 1: Supplement Missing Content (Templates/Translations)

If existing papers are missing templates or translations, supplement them without rebuilding:

# Supplement missing templates for existing papers (in-place)
uv run deepresearch-flow paper db snapshot supplement \
  --snapshot-db ./dist/paper_snapshot.db \
  --static-export-dir ./dist/paper-static \
  -i ./deep_read_supplement.json \
  --in-place

# Or output to new location
uv run deepresearch-flow paper db snapshot supplement \
  --snapshot-db ./dist/paper_snapshot.db \
  --static-export-dir ./dist/paper-static \
  -i ./deep_read_supplement.json \
  --output-db ./dist/paper_snapshot_supplemented.db \
  --output-static-dir ./dist/paper-static-supplemented

Notes:

  • --md-root and --md-translated-root are optional for snapshot supplement.
  • Use them only when you want to resolve/copy markdown files from local source directories.

Alternative 2: Add New Papers

If you have completely new papers to add to the snapshot:

# Add new papers to existing snapshot (in-place)
uv run deepresearch-flow paper db snapshot update \
  --snapshot-db ./dist/paper_snapshot.db \
  --static-export-dir ./dist/paper-static \
  -i ./new_papers.json \
  -b ./new_papers.bib \
  --md-root ./docs \
  --md-translated-root ./docs_translated \
  --pdf-root ./pdfs \
  --in-place

# Or output to new location
uv run deepresearch-flow paper db snapshot update \
  --snapshot-db ./dist/paper_snapshot.db \
  --static-export-dir ./dist/paper-static \
  -i ./new_papers.json \
  -b ./new_papers.bib \
  --md-root ./docs \
  --output-db ./dist/paper_snapshot_updated.db \
  --output-static-dir ./dist/paper-static-updated

Differences:

  • supplement: Only adds missing templates/translations for existing papers (skips new papers)
  • update: Only adds completely new papers (skips existing papers)

Upgrade Legacy Snapshot Schema (DOI/BibTeX)

Recommended: Migrate Schema In-Place (No Data Loss)

If your existing snapshot was built before DOI/BibTeX support, use the migrate command to upgrade the schema without losing any papers:

# In-place migration with timestamped backup
uv run deepresearch-flow paper db snapshot migrate \
  --snapshot-db ./dist/paper_snapshot.db \
  --bibtex ./papers.bib \
  --static-export-dir ./dist/paper-static \
  --in-place

# Or copy to new location
uv run deepresearch-flow paper db snapshot migrate \
  --snapshot-db ./dist/paper_snapshot.db \
  --bibtex ./papers.bib \
  --static-export-dir ./dist/paper-static \
  --output-db ./dist/paper_snapshot_v2.db

# Schema-only migration (no BibTeX enrichment)
uv run deepresearch-flow paper db snapshot migrate \
  --snapshot-db ./dist/paper_snapshot.db \
  --in-place

Features:

  • No data loss: Uses ALTER TABLE to upgrade schema, preserving all papers
  • Timestamped backups: Creates .bak_YYYYMMDD_HHMMSS backup files automatically
  • BibTeX enrichment: Matches papers with BibTeX and extracts DOI metadata
  • Static export update: Updates paper_index.json with DOI/BibTeX references
  • Beautiful output: Rich tables showing schema changes and match statistics

The migrate command will:

  1. Create a timestamped backup (unless --no-backup is used)
  2. Add doi column to the paper table (if missing)
  3. Create paper_bibtex table (if missing)
  4. Match papers with BibTeX entries and populate DOI/BibTeX data
  5. Update static export index with new metadata

Alternative: Rebuild with Previous Snapshot

If you need to rebuild from scratch while preserving identity continuity:

uv run deepresearch-flow paper db snapshot build \
  --input ./paper_infos_complete.json \
  --bibtex ./papers.bib \
  --output-db ./dist/paper_snapshot_v2.db \
  --static-export-dir ./dist/paper-static-v2 \
  --previous-snapshot-db ./dist/paper_snapshot.db

Notes:

  • --md-root, --md-translated-root, and --pdf-root are optional for this rebuild.
  • If a paper in current inputs already has DOI/BibTeX, current input wins; otherwise data can be inherited from --previous-snapshot-db.
  • Warning: This approach only includes papers from the input JSON files, so ensure all papers are included to avoid data loss.

Supplement Missing Translations

If some papers are missing translations (e.g., zh), you can export and translate them:

# 1) Export papers missing Chinese translation (with file paths)
uv run deepresearch-flow paper db snapshot export-missing \
  --snapshot-db ./dist/paper_snapshot.db \
  --type translation \
  --lang zh \
  --static-export-dir ./dist/paper-static \
  --output-paths ./to_translate_paths.txt

# 2) Translate missing papers
uv run deepresearch-flow translator translate \
  --input ./docs \
  --target-lang zh \
  --model openai/gpt-4o-mini \
  --input-list ./to_translate_paths.txt \
  --output-dir ./docs_translated

# 3) Rebuild or supplement snapshot with new translations
uv run deepresearch-flow paper db snapshot build ...
# Or use snapshot supplement if only adding translations

Other useful export types:

  • --type source_md - Papers without source markdown
  • --type pdf - Papers without PDF
  • --type translation --lang zh - Papers without Chinese translation

Incremental PDF Library Workflow

This workflow keeps a growing PDF library in sync without reprocessing everything.

# 1) Compare processed JSON vs new PDF library to find missing PDFs
uv run deepresearch-flow paper db compare \
  --input-a ./paper_infos.json \
  --pdf-root-b ./pdfs_new \
  --output-only-in-b ./pdfs_todo.txt

# 2) Stage the missing PDFs for OCR
uv run deepresearch-flow paper db transfer-pdfs \
  --input-list ./pdfs_todo.txt \
  --output-dir ./pdfs_todo \
  --copy

# (optional) use --move instead of --copy
# uv run deepresearch-flow paper db transfer-pdfs --input-list ./pdfs_todo.txt --output-dir ./pdfs_todo --move

# 3) OCR the missing PDFs (use your OCR tool; write markdowns to ./md_todo)

# 4) Export matched existing assets against the new PDF library
uv run deepresearch-flow paper db extract \
  --input-json ./paper_infos.json \
  --pdf-root ./pdfs_new \
  --output-json ./paper_infos_matched.json

uv run deepresearch-flow paper db extract \
  --md-source-root ./mds \
  --output-md-root ./mds_matched \
  --pdf-root ./pdfs_new

uv run deepresearch-flow paper db extract \
  --md-translated-root ./translated \
  --output-md-translated-root ./translated_matched \
  --pdf-root ./pdfs_new \
  --lang zh

# 5) Translate + extract summaries for the new OCR markdowns
uv run deepresearch-flow translator translate \
  --input ./md_todo \
  --target-lang zh \
  --model openai/gpt-4o-mini

uv run deepresearch-flow paper extract \
  --input ./md_todo \
  --model openai/gpt-4o-mini

# 6) Merge and serve the new library (multi-input)
uv run deepresearch-flow paper db serve \
  --input ./paper_infos_matched.json \
  --input ./paper_infos_new.json \
  --md-root ./mds_matched \
  --md-root ./md_todo \
  --md-translated-root ./translated_matched \
  --md-translated-root ./md_todo \
  --pdf-root ./pdfs_new

Merge Paper JSONs

# Merge multiple libraries using the same template
uv run deepresearch-flow paper db merge library \
  --inputs ./paper_infos_a.json \
  --inputs ./paper_infos_b.json \
  --output ./paper_infos_merged.json

# Merge multiple templates from the same library (first input wins on shared fields)
uv run deepresearch-flow paper db merge templates \
  --inputs ./simple.json \
  --inputs ./deep_read.json \
  --output ./paper_infos_templates.json

Note: paper db merge is now split into merge library and merge templates.

Merge multiple databases (PDF + Markdown + BibTeX)

# 1) Copy PDFs into a single folder
rsync -av ./pdfs_a/ ./pdfs_merged/
rsync -av ./pdfs_b/ ./pdfs_merged/

# 2) Copy Markdown folders into a single folder
rsync -av ./md_a/ ./md_merged/
rsync -av ./md_b/ ./md_merged/

# 3) Merge JSON libraries
uv run deepresearch-flow paper db merge library \
  --inputs ./paper_infos_a.json \
  --inputs ./paper_infos_b.json \
  --output ./paper_infos_merged.json

# 4) Merge BibTeX files
uv run deepresearch-flow paper db merge bibtex \
  -i ./library_a.bib \
  -i ./library_b.bib \
  -o ./library_merged.bib

Merge BibTeX files

uv run deepresearch-flow paper db merge bibtex \
  -i ./library_a.bib \
  -i ./library_b.bib \
  -o ./library_merged.bib

Duplicate keys keep the entry with the most fields; ties keep the first input order.

Recommended: Merge templates then filter by BibTeX

# 1) Merge templates for the same library
uv run deepresearch-flow paper db merge templates \
  --inputs ./deep_read.json \
  --inputs ./simple.json \
  --output ./all.json

# 2) Filter the merged set with BibTeX
uv run deepresearch-flow paper db extract \
  --input-bibtex ./library.bib \
  --json ./all.json \
  --output-json ./library_filtered.json \
  --output-csv ./library_filtered.csv

Deployment (Static CDN)

The recommended production setup is front/back separation:

  • Static CDN hosts PDFs/Markdown/images/summaries.
  • API server serves a read-only snapshot DB.
  • Frontend is a separate static app (Vite build or any static host).

frontend

1) Build snapshot + static export

uv run deepresearch-flow paper db snapshot build \
  --input ./paper_infos.json \
  --bibtex ./papers.bib \
  --md-root ./docs \
  --md-translated-root ./docs \
  --pdf-root ./pdfs \
  --output-db ./dist/paper_snapshot.db \
  --static-export-dir /data/paper-static

Notes:

  • The build host must be able to read the original PDF/Markdown roots.
  • The CDN only needs the exported directory (e.g. /data/paper-static).

2) Serve static assets with CORS + cache headers (Caddy example)

:8002 {
  root * /data/paper-static
  encode zstd gzip

  @static path /pdf/* /md/* /md_translate/* /images/*
  header @static {
    Access-Control-Allow-Origin *
    Access-Control-Allow-Methods GET,HEAD,OPTIONS
    Access-Control-Allow-Headers *
    Cache-Control "public, max-age=31536000, immutable"
  }

  @options method OPTIONS
  respond @options 204

  file_server
}

2.1) Nginx example (API + frontend on one domain, static on another)

# Frontend + API (same domain)
server {
  listen 80;
  server_name frontend.example.com;

  root /var/www/paper-frontend;
  index index.html;

  location / {
    try_files $uri /index.html;
  }

  location /api/ {
    proxy_pass http://127.0.0.1:8001;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
  }

  location ^~ /mcp {
    proxy_pass http://127.0.0.1:8001;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
  }

  # SSE transport for MCP clients that require Server-Sent Events
  location ^~ /mcp-sse {
    proxy_pass http://127.0.0.1:8001;
    proxy_http_version 1.1;
    proxy_set_header Connection "";
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_buffering off;
    proxy_cache off;
    proxy_read_timeout 3600s;
    proxy_send_timeout 3600s;
    chunked_transfer_encoding off;
    add_header X-Accel-Buffering no;
  }
}

# Static assets (separate domain)
server {
  listen 80;
  server_name static.example.com;

  root /data/paper-static;

  location / {
    add_header Access-Control-Allow-Origin *;
    add_header Access-Control-Allow-Methods "GET,HEAD,OPTIONS";
    add_header Access-Control-Allow-Headers "*";
    add_header Cache-Control "public, max-age=31536000, immutable";
    try_files $uri =404;
  }
}

3) Start the API server (read-only)

export PAPER_DB_STATIC_BASE_URL="https://static.example.com"

uv run deepresearch-flow paper db api serve \
  --snapshot-db /data/paper_snapshot.db \
  --cors-origin https://frontend.example.com \
  --host 0.0.0.0 --port 8001

BibTeX metadata endpoint:

  • GET /api/v1/papers/{paper_id}/bibtex
  • Success payload: { paper_id, doi, bibtex_raw, bibtex_key, entry_type }
  • Error codes:
    • paper_not_found
    • bibtex_not_found

3.1) MCP (FastMCP Streamable HTTP + SSE)

This project exposes MCP servers mounted on the snapshot API:

  • Streamable HTTP endpoint: http://<host>:8001/mcp
  • SSE endpoint: http://<host>:8001/mcp-sse
  • Transport behavior:
    • /mcp: Streamable HTTP via POST only (GET returns 405)
    • /mcp-sse: SSE-enabled transport (supports GET handshake)
  • Protocol header: optional mcp-protocol-version (2025-03-26 or 2025-06-18)
  • Static reads: summary/source/translation are served as text content by reading snapshot static assets (local-first via PAPER_DB_STATIC_EXPORT_DIR, HTTP fallback via PAPER_DB_STATIC_BASE / PAPER_DB_STATIC_BASE_URL)

Optional (avoid HTTP fetch by reading exported assets directly on the API host):

export PAPER_DB_STATIC_EXPORT_DIR=/data/paper-static

MCP Tools (API functions)

search_papers(query, limit=10) — full-text search (relevance-ranked)
  • Args:
    • query (str): keywords / topic query
    • limit (int): number of results (clamped to API max page size)
  • Returns: list of { paper_id, title, year, venue, snippet_markdown }
search_papers_by_keyword(keyword, limit=10) — facet keyword search
  • Args:
    • keyword (str): keyword substring
    • limit (int): number of results (clamped)
  • Returns: list of { paper_id, title, year, venue, snippet_markdown }
get_paper_metadata(paper_id) — metadata + available summary templates
  • Args:
    • paper_id (str)
  • Returns: dict with:
    • paper_id, title, year, venue
    • doi, arxiv_id, openreview_id, paper_pw_url
    • has_bibtex
    • preferred_summary_template, available_summary_templates
get_paper_bibtex(paper_id) — persisted BibTeX payload
  • Args:
    • paper_id (str)
  • Returns: dict with:
    • paper_id, doi, bibtex_raw, bibtex_key, entry_type
  • Errors:
    • paper_not_found
    • bibtex_not_found
get_paper_summary(paper_id, template=None, max_chars=None) — summary JSON as raw text
  • Notes:
    • Uses preferred_summary_template if template is omitted
    • Returns the full JSON content (not a URL)
  • Args:
    • paper_id (str)
    • template (str | null)
    • max_chars (int | null): truncation limit
  • Returns: JSON string (may include a [truncated: ...] marker)
get_paper_source(paper_id, max_chars=None) — source markdown as raw text
  • Args:
    • paper_id (str)
    • max_chars (int | null): truncation limit
  • Returns: markdown string (may include a [truncated: ...] marker)
get_database_stats() — snapshot-level stats
  • Returns:
    • total
    • years, months: list of { value, paper_count }
    • authors, venues, institutions, keywords, tags: top lists of { value, paper_count }
list_top_facets(category, limit=20) — top values for one facet
  • Args:
    • category: author | venue | keyword | institution | tag
    • limit (int)
  • Returns: list of { value, paper_count }
filter_papers(author=None, venue=None, year=None, keyword=None, tag=None, limit=10) — structured filtering
  • Args (all optional except limit):
    • author, venue, keyword, tag: substring match
    • year: exact match
    • limit (int): number of results (clamped)
  • Returns: list of { paper_id, title, year, venue }

MCP Resources (URI access)

paper://{paper_id}/metadata — metadata JSON

Returns the same content as get_paper_metadata(paper_id) (as a JSON string).

paper://{paper_id}/summary — preferred summary JSON

Returns the same content as get_paper_summary(paper_id) (preferred template; JSON string).

paper://{paper_id}/summary/{template} — summary JSON for template

Returns the same content as get_paper_summary(paper_id, template=template) (JSON string).

paper://{paper_id}/source — source markdown

Returns the same content as get_paper_source(paper_id) (markdown string).

paper://{paper_id}/translation/{lang} — translated markdown

Returns translated markdown for lang (e.g. zh, ja) when available.

4) Frontend (static build or dev)

cd frontend
npm install

# Dev
VITE_PAPER_DB_API_BASE=https://api.example.com/api/v1 \
VITE_PAPER_DB_STATIC_BASE=https://static.example.com \
npm run dev

# Build for static hosting
VITE_PAPER_DB_API_BASE=https://api.example.com/api/v1 \
VITE_PAPER_DB_STATIC_BASE=https://static.example.com \
npm run build

Comprehensive Guide

1. Translator: OCR-Safe Translation

The translator module is built for scientific documents. It uses a node-based architecture to ensure stability.

  • Structure Protection: automatically detects and "freezes" code blocks, LaTeX ($$...$$), HTML tables, and images before sending text to the LLM.
  • OCR Repair: use --fix-level to merge broken paragraphs and convert text references ([1]) to clickable Markdown footnotes ([^1]).
  • Context-Aware: supports retries for failed chunks and falls back gracefully.
  • Group Concurrency: use --group-concurrency to run multiple translation groups in parallel per document.
# Translate with structure protection and OCR repairs
uv run deepresearch-flow translator translate \
  --input ./paper.md \
  --target-lang ja \
  --fix-level aggressive \
  --group-concurrency 4 \
  --model claude/claude-3-5-sonnet-20240620
2. Paper Extract: Structured Knowledge

Turn loose markdown files into a queryable database.

  • Templates: built-in prompts like simple, eight_questions, and deep_read guide the LLM to extract specific insights.
  • Async and throttled: precise control over concurrency (--max-concurrency), rate limits (--sleep-every), and request timeout (--timeout).
  • Incremental: skips already processed files; resumes from where you left off.
  • Stage resume: multi-stage templates persist per-module outputs; use --force-stage <name> to rerun a module.
  • Stage DAG: enable --stage-dag (or extract.stage_dag = true) for dependency-aware parallelism; DAG mode only passes dependency outputs to a stage and --dry-run prints the per-stage plan.
  • Diagram hints: deep_read can emit inferred diagrams labeled [Inferred]; use recognize fix-mermaid on rendered markdown if needed.
  • Stage focus: multi-stage runs emphasize the active module and summarize others to reduce context overload.
  • Range filter: use --start-idx/--end-idx to slice inputs; range applies before --retry-failed/--retry-failed-stages (--end-idx -1 = last item).
  • Retry failed stages: use --retry-failed-stages to re-run only failed stages (multi-stage templates); missing stages are forced to run. Retry runs keep existing results and only update retried items.
uv run deepresearch-flow paper extract \
  --input ./library \
  --output paper_data.json \
  --template-dir ./my-custom-prompts \
  --max-concurrency 10 \
  --timeout 180

# Extract items 0..99, then retry only failed ones from that range
uv run deepresearch-flow paper extract \
  --input ./library \
  --start-idx 0 \
  --end-idx 100 \
  --retry-failed \
  --model claude/claude-3-5-sonnet-20240620

# Retry only failed stages in multi-stage templates
uv run deepresearch-flow paper extract \
  --input ./library \
  --retry-failed-stages \
  --model claude/claude-3-5-sonnet-20240620
4. Recognize Fix: Repair Math and Mermaid

Fix broken LaTeX formulas and Mermaid diagrams in markdown or JSON outputs.

  • Retry Failed: use --retry-failed with the prior --report output to reprocess only failed formulas/diagrams.
uv run deepresearch-flow recognize fix-math \
  --input ./docs \
  --in-place \
  --model claude/claude-3-5-sonnet-20240620 \
  --report ./fix-math-errors.json \
  --retry-failed

uv run deepresearch-flow recognize fix-mermaid \
  --input ./docs \
  --in-place \
  --model claude/claude-3-5-sonnet-20240620 \
  --report ./fix-mermaid-errors.json \
  --retry-failed
3. Database and UI: Your Personal ArXiv

The db serve command creates a local research station.

  • Split View: read the original PDF/Markdown on the left and the Summary/Translation on the right.
  • Full Text Search: search by title, author, year, or content tags (tag:fpga year:2023..2024).
  • Stats: visualize publication trends and keyword frequencies.
  • PDF Viewer: built-in PDF.js viewer prevents cross-origin issues with local files.
uv run deepresearch-flow paper db serve \
  --input paper_infos.json \
  --pdf-root ./pdfs \
  --cache-dir .cache/db
4. Paper DB Compare: Coverage Audit

Compare two datasets (A/B) to find missing PDFs, markdowns, translations, or JSON items, with match metadata.

uv run deepresearch-flow paper db compare \
  --input-a ./a.json \
  --md-root-b ./md_root \
  --output-csv ./compare.csv

# Compare translated markdowns by language
uv run deepresearch-flow paper db compare \
  --md-translated-root-a ./translated_a \
  --md-translated-root-b ./translated_b \
  --lang zh
5. Paper DB Extract: Matched Export

Extract matched JSON entries or translated Markdown after coverage comparison.

uv run deepresearch-flow paper db extract \
  --json ./processed.json \
  --input-bibtex ./refs.bib \
  --pdf-root ./pdfs \
  --output-json ./matched.json \
  --output-csv ./extract.csv

# Use a JSON reference list to filter the target JSON
uv run deepresearch-flow paper db extract \
  --json ./processed.json \
  --input-json ./reference.json \
  --pdf-root ./pdfs \
  --output-json ./matched.json \
  --output-csv ./extract.csv

# Extract translated markdowns by language
uv run deepresearch-flow paper db extract \
  --md-root ./md_root \
  --md-translated-root ./translated \
  --lang zh \
  --output-md-translated-root ./translated_matched \
  --output-csv ./extract.csv
6. Recognize: OCR Post-Processing

Tools to clean up raw outputs from OCR engines like MinerU.

  • Embed Images: convert local image links to Base64 for a portable single-file Markdown.
  • Unpack Images: extract Base64 images back to files.
  • Organize: flatten nested OCR output directories.
  • Fix: apply OCR fixes and rumdl formatting during organize, or as a standalone step.
  • Fix JSON: apply the same fixes to markdown fields inside paper JSON outputs.
  • Fix Math: validate and repair LaTeX formulas with optional LLM assistance.
  • Fix Mermaid: validate and repair Mermaid diagrams (requires mmdc from mermaid-cli).
  • Recommended order: fix -> fix-math -> fix-mermaid -> fix.
uv run deepresearch-flow recognize md embed --input ./raw_ocr --output ./clean_md
# Organize MinerU output and apply OCR fixes
uv run deepresearch-flow recognize organize \
  --input ./mineru_outputs \
  --output-simple ./ocr_md \
  --fix

# Fix and format existing markdown outputs
uv run deepresearch-flow recognize fix \
  --input ./ocr_md \
  --output ./ocr_md_fixed

# Fix in place
uv run deepresearch-flow recognize fix \
  --input ./ocr_md \
  --in-place

# Fix JSON outputs in place
uv run deepresearch-flow recognize fix \
  --json \
  --input ./paper_outputs \
  --in-place

# Fix LaTeX formulas in markdown
uv run deepresearch-flow recognize fix-math \
  --input ./docs \
  --model openai/gpt-4o-mini \
  --in-place

# Fix Mermaid diagrams in JSON outputs
uv run deepresearch-flow recognize fix-mermaid \
  --json \
  --input ./paper_outputs \
  --model openai/gpt-4o-mini \
  --in-place

Docker Support

Don't want to manage Python environments?

docker run --rm -v $(pwd):/app -it ghcr.io/nerdneilsfield/deepresearch-flow:latest --help

Deploy image (API + frontend via nginx):

docker run --rm -p 8899:8899 \
  -v $(pwd)/paper_snapshot.db:/db/papers.db \
  -v $(pwd)/paper-static:/static \
  ghcr.io/nerdneilsfield/deepresearch-flow:deploy-latest

Notes:

  • nginx listens on 8899 and proxies /api, /mcp, and /mcp-sse to the internal API at 127.0.0.1:8000.
  • Mount your snapshot DB to /db/papers.db inside the container.
  • Mount snapshot static assets to /static when serving assets from this container (default PAPER_DB_STATIC_BASE is /static).
  • If PAPER_DB_STATIC_BASE is a full URL (e.g. https://static.example.com), nginx still serves the frontend locally, while API responses use that external static base for asset links.

Docker Compose example (two modes):

docker compose -f scripts/docker/docker-compose.example.yml --profile local-static up
# or
docker compose -f scripts/docker/docker-compose.example.yml --profile external-static up

External static assets example:

docker run --rm -p 8899:8899 \
  -v $(pwd)/paper_snapshot.db:/db/papers.db \
  -e PAPER_DB_STATIC_BASE=https://static.example.com \
  ghcr.io/nerdneilsfield/deepresearch-flow:deploy-latest

Configuration

The config.toml is your control center. It supports:

  • Multiple Providers: mix and match OpenAI, DeepSeek (DashScope), Gemini, Claude, and Ollama.
  • Model Routing: explicit routing to specific models (--model provider/model_name).
  • Environment Variables: keep secrets safe using env:VAR_NAME syntax.

See config.example.toml for a full reference.


Built with love for the Open Science community.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepresearch_flow-0.7.6.tar.gz (5.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deepresearch_flow-0.7.6-py3-none-any.whl (6.1 MB view details)

Uploaded Python 3

File details

Details for the file deepresearch_flow-0.7.6.tar.gz.

File metadata

  • Download URL: deepresearch_flow-0.7.6.tar.gz
  • Upload date:
  • Size: 5.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for deepresearch_flow-0.7.6.tar.gz
Algorithm Hash digest
SHA256 db75d914dc8c375340bef6d40859f1d7a3e5b62f184d99a8e9b08f65c537557e
MD5 da090d423b2d4200586c39e38adcf792
BLAKE2b-256 4113b8271377775f8fcf61c5f5a8534126cb63a132cfa88097472eabe4f2416c

See more details on using hashes here.

Provenance

The following attestation bundles were made for deepresearch_flow-0.7.6.tar.gz:

Publisher: push-to-pypi.yml on nerdneilsfield/ai-deepresearch-flow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file deepresearch_flow-0.7.6-py3-none-any.whl.

File metadata

File hashes

Hashes for deepresearch_flow-0.7.6-py3-none-any.whl
Algorithm Hash digest
SHA256 1be059af44f327630b6be82047f5188cc641f152f4f377ab9af8da4830662351
MD5 df8630130cb62ef090134f86554827d0
BLAKE2b-256 19ec582a3bb6eea4d9725df204fd4370b0d2e2282ca4de1986781c63436bc039

See more details on using hashes here.

Provenance

The following attestation bundles were made for deepresearch_flow-0.7.6-py3-none-any.whl:

Publisher: push-to-pypi.yml on nerdneilsfield/ai-deepresearch-flow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page