deepresearch-flow

Workflow tools for paper extraction, review, and research automation.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

nerdneils

These details have not been verified by PyPI

Project description

ai-deepresearch-flow logo

ai-deepresearch-flow

From documents to deep research insight — automatically.

English | 中文

The Core Pain Points

OCR Chaos: Raw markdown from OCR tools is often broken -- tables drift, formulas break, and references are non-clickable.
Translation Nightmares: Translating technical papers often destroys code blocks, LaTeX formulas, and table structures.
Information Overload: Extracting structured insights (authors, venues, summaries) from hundreds of PDFs manually is impossible.
Context Switching: Managing PDFs, summaries, and translations in different windows kills focus.

The Solution

DeepResearch Flow provides a unified pipeline to Repair, Translate, Extract, and Serve your research library.

Key Features

Smart Extraction: Turn unstructured Markdown into schema-enforced JSON (summaries, metadata, Q&A) using LLMs (OpenAI, Claude, Gemini, etc.).
Precision Translation: Translate OCR Markdown to Chinese/Japanese (.zh.md, .ja.md) while freezing formulas, code, tables, and references. No more broken layout.
Local Knowledge DB: A high-performance local Web UI to browse papers with Split View (Source vs. Translated vs. Summary), full-text search, and multi-dimensional filtering.
Snapshot + API Serve: Build a production-ready SQLite snapshot with static assets, then serve a read-only JSON API for a separate frontend.
Coverage Compare: Compare JSON/PDF/Markdown/Translated datasets to find missing artifacts and export CSV reports.
Matched Export: Extract matched JSON or translated Markdown after coverage checks.
OCR Post-Processing: Automatically fix broken references ([1] -> [^1]), merge split paragraphs, and standardize layouts.

Quick Start

1) Installation

# Recommended: using uv for speed
uv pip install deepresearch-flow

# Or standard pip
pip install deepresearch-flow

2) Configuration

Set up your LLM providers. We support OpenAI, Claude, Gemini, Ollama, and more.

cp config.example.toml config.toml
# Edit config.toml to add your API keys (e.g., env:OPENAI_API_KEY)

Multiple keys per provider are supported. Keys rotate per request and enter a short cooldown on retryable errors. You can also provide quota metadata per key:

api_keys = [
  "env:OPENAI_API_KEY",
  { key = "env:OPENAI_API_KEY_2", quota_duration = 18000, reset_time = "2026-01-23 18:04:25 +0800 CST", quota_error_tokens = ["exceed", "quota"] }
]

3) The "Zero to Hero" Workflow

Step 1: Extract Insights

Scan a folder of markdown files and extract structured summaries.

uv run deepresearch-flow paper extract \
  --input ./docs \
  --model openai/gpt-4o-mini \
  --prompt-template deep_read

extract

Step 1.1: Verify & Retry Missing Fields

Validate extracted JSON against the template schema and retry only the missing items.

uv run deepresearch-flow paper db verify \
  --input-json ./paper_infos.json \
  --prompt-template deep_read \
  --output-json ./paper_verify.json

uv run deepresearch-flow paper extract \
  --input ./docs \
  --model openai/gpt-4o-mini \
  --prompt-template deep_read \
  --retry-list-json ./paper_verify.json

verify

Step 2: Translate Safely

Translate papers to Chinese, protecting LaTeX and tables.

uv run deepresearch-flow translator translate \
  --input ./docs \
  --target-lang zh \
  --model openai/gpt-4o-mini \
  --fix-level moderate

Step 2.5: Run OCR on PDFs/Images (Optional)

If your source documents are PDFs or scanned images, run OCR first to produce markdown:

# 1) Copy and edit the OCR config
cp ocr.example.toml ocr.toml
# Set your PaddleOCR token: export PADDLE_OCR_TOKEN=xxx

# 2) Run OCR on a directory of PDFs
uv run deepresearch-flow recognize ocr ./pdfs --config ocr.toml --output-dir ./ocr_output

Output follows the mineru layout (full.md + images/ per document), compatible with the repair steps below.

See ocr.example.toml for backend configuration. Currently supports PaddleOCR; more backends planned.

Step 3: Repair OCR Outputs (Recommended)

Recommended sequence to stabilize markdown before serving:

# 1) Fix OCR markdown (auto-detects JSON if inputs are .json)
uv run deepresearch-flow recognize fix \
  --input ./docs \
  --in-place

fix

# 2) Fix LaTeX formulas
uv run deepresearch-flow recognize fix-math \
  --input ./docs \
  --model openai/gpt-4o-mini \
  --in-place

fix math

# 3) Fix Mermaid diagrams
uv run deepresearch-flow recognize fix-mermaid \
  --input ./paper_outputs \
  --json \
  --model openai/gpt-4o-mini \
  --in-place

fix mermaid

# (optional) Retry failed formulas/diagrams only
uv run deepresearch-flow recognize fix-math \
  --input ./docs \
  --model openai/gpt-4o-mini \
  --retry-failed

uv run deepresearch-flow recognize fix-mermaid \
  --input ./paper_outputs \
  --json \
  --model openai/gpt-4o-mini \
  --retry-failed

fix retry failed

# 4) Fix again to normalize formatting
uv run deepresearch-flow recognize fix \
  --input ./docs \
  --in-place

Step 4: Serve Your Database

Launch a local UI to read and manage your papers.

uv run deepresearch-flow paper db serve \
  --input paper_infos.json \
  --md-root ./docs \
  --md-translated-root ./docs \
  --host 127.0.0.1

Step 4.5: Build Snapshot + Serve API + Frontend (Recommended)

Build a production snapshot (SQLite + static assets), serve a read-only API, and run the frontend.

# 1) Build snapshot + static export
uv run deepresearch-flow paper db snapshot build \
  --input ./paper_infos.json \
  --bibtex ./papers.bib \
  --md-root ./docs \
  --md-translated-root ./docs \
  --pdf-root ./pdfs \
  --output-db ./dist/paper_snapshot.db \
  --static-export-dir ./dist/paper-static

# 2) Serve static assets (CORS required for ZIP export)
npx http-server ./dist/paper-static -p 8002 --cors

# 3) Serve API (read-only)
PAPER_DB_STATIC_BASE_URL=http://127.0.0.1:8002 \
uv run deepresearch-flow paper db api serve \
  --snapshot-db ./dist/paper_snapshot.db \
  --cors-origin http://127.0.0.1:5173 \
  --host 127.0.0.1 --port 8001

# 4) Run frontend
cd frontend
npm install
VITE_PAPER_DB_API_BASE=http://127.0.0.1:8001/api/v1 \
VITE_PAPER_DB_STATIC_BASE=http://127.0.0.1:8002 \
npm run dev

Step 4.6: Supplement Missing Templates (Optional)

If some papers are missing specific templates (e.g., deep_read), you can identify gaps and supplement extract them:

# 1) Check missing templates in snapshot
uv run deepresearch-flow paper db snapshot show-missing \
  --snapshot-db ./dist/paper_snapshot.db

# 2) Export papers missing specific template (with file paths for extraction)
uv run deepresearch-flow paper db snapshot export-missing \
  --snapshot-db ./dist/paper_snapshot.db \
  --type template \
  --template deep_read \
  --static-export-dir ./dist/paper-static \
  --output ./missing_deep_read.json \
  --txt-output ./missing_ids.txt \
  --output-paths ./extractable_paths.txt

# 3) Extract missing templates (only for papers with source markdown)
uv run deepresearch-flow paper extract \
  --model openai/gpt-4o-mini \
  --prompt-template deep_read \
  --input-list ./extractable_paths.txt \
  --output ./deep_read_supplement.json

# 4) Merge with existing paper_infos.json
uv run deepresearch-flow paper db merge library \
  --inputs ./paper_infos.json \
  --inputs ./deep_read_supplement.json \
  --output ./paper_infos_complete.json

# 5) Rebuild snapshot with complete data
uv run deepresearch-flow paper db snapshot build \
  --input ./paper_infos_complete.json \
  --bibtex ./papers.bib \
  --md-root ./docs \
  --md-translated-root ./docs \
  --pdf-root ./pdfs \
  --output-db ./dist/paper_snapshot_complete.db \
  --static-export-dir ./dist/paper-static-complete

Alternative 1: Supplement Missing Content (Templates/Translations)

If existing papers are missing templates or translations, supplement them without rebuilding:

# Supplement missing templates for existing papers (in-place)
uv run deepresearch-flow paper db snapshot supplement \
  --snapshot-db ./dist/paper_snapshot.db \
  --static-export-dir ./dist/paper-static \
  -i ./deep_read_supplement.json \
  --in-place

# Or output to new location
uv run deepresearch-flow paper db snapshot supplement \
  --snapshot-db ./dist/paper_snapshot.db \
  --static-export-dir ./dist/paper-static \
  -i ./deep_read_supplement.json \
  --output-db ./dist/paper_snapshot_supplemented.db \
  --output-static-dir ./dist/paper-static-supplemented

Notes:

--md-root and --md-translated-root are optional for snapshot supplement.
Use them only when you want to resolve/copy markdown files from local source directories.

Alternative 2: Add New Papers

If you have completely new papers to add to the snapshot:

# Add new papers to existing snapshot (in-place)
uv run deepresearch-flow paper db snapshot update \
  --snapshot-db ./dist/paper_snapshot.db \
  --static-export-dir ./dist/paper-static \
  -i ./new_papers.json \
  -b ./new_papers.bib \
  --md-root ./docs \
  --md-translated-root ./docs_translated \
  --pdf-root ./pdfs \
  --in-place

# Or output to new location
uv run deepresearch-flow paper db snapshot update \
  --snapshot-db ./dist/paper_snapshot.db \
  --static-export-dir ./dist/paper-static \
  -i ./new_papers.json \
  -b ./new_papers.bib \
  --md-root ./docs \
  --output-db ./dist/paper_snapshot_updated.db \
  --output-static-dir ./dist/paper-static-updated

Differences:

supplement: Only adds missing templates/translations for existing papers (skips new papers)
update: Only adds completely new papers (skips existing papers)

Upgrade Legacy Snapshot Schema (DOI/BibTeX)

Recommended: Migrate Schema In-Place (No Data Loss)

If your existing snapshot was built before DOI/BibTeX support, use the migrate command to upgrade the schema without losing any papers:

# In-place migration with timestamped backup
uv run deepresearch-flow paper db snapshot migrate \
  --snapshot-db ./dist/paper_snapshot.db \
  --bibtex ./papers.bib \
  --static-export-dir ./dist/paper-static \
  --in-place

# Or copy to new location
uv run deepresearch-flow paper db snapshot migrate \
  --snapshot-db ./dist/paper_snapshot.db \
  --bibtex ./papers.bib \
  --static-export-dir ./dist/paper-static \
  --output-db ./dist/paper_snapshot_v2.db

# Schema-only migration (no BibTeX enrichment)
uv run deepresearch-flow paper db snapshot migrate \
  --snapshot-db ./dist/paper_snapshot.db \
  --in-place

Features:

No data loss: Uses ALTER TABLE to upgrade schema, preserving all papers
Timestamped backups: Creates .bak_YYYYMMDD_HHMMSS backup files automatically
BibTeX enrichment: Matches papers with BibTeX and extracts DOI metadata
Static export update: Updates paper_index.json with DOI/BibTeX references
Beautiful output: Rich tables showing schema changes and match statistics

The migrate command will:

Create a timestamped backup (unless --no-backup is used)
Add doi column to the paper table (if missing)
Create paper_bibtex table (if missing)
Match papers with BibTeX entries and populate DOI/BibTeX data
Update static export index with new metadata

Alternative: Rebuild with Previous Snapshot

If you need to rebuild from scratch while preserving identity continuity:

uv run deepresearch-flow paper db snapshot build \
  --input ./paper_infos_complete.json \
  --bibtex ./papers.bib \
  --output-db ./dist/paper_snapshot_v2.db \
  --static-export-dir ./dist/paper-static-v2 \
  --previous-snapshot-db ./dist/paper_snapshot.db

Notes:

--md-root, --md-translated-root, and --pdf-root are optional for this rebuild.
If a paper in current inputs already has DOI/BibTeX, current input wins; otherwise data can be inherited from --previous-snapshot-db.
Warning: This approach only includes papers from the input JSON files, so ensure all papers are included to avoid data loss.

Supplement Missing Translations

If some papers are missing translations (e.g., zh), you can export and translate them:

# 1) Export papers missing Chinese translation (with file paths)
uv run deepresearch-flow paper db snapshot export-missing \
  --snapshot-db ./dist/paper_snapshot.db \
  --type translation \
  --lang zh \
  --static-export-dir ./dist/paper-static \
  --output-paths ./to_translate_paths.txt

# 2) Translate missing papers
uv run deepresearch-flow translator translate \
  --input ./docs \
  --target-lang zh \
  --model openai/gpt-4o-mini \
  --input-list ./to_translate_paths.txt \
  --output-dir ./docs_translated

# 3) Rebuild or supplement snapshot with new translations
uv run deepresearch-flow paper db snapshot build ...
# Or use snapshot supplement if only adding translations

Other useful export types:

--type source_md - Papers without source markdown
--type pdf - Papers without PDF
--type translation --lang zh - Papers without Chinese translation

Incremental PDF Library Workflow

This workflow keeps a growing PDF library in sync without reprocessing everything.

# 1) Compare processed JSON vs new PDF library to find missing PDFs
uv run deepresearch-flow paper db compare \
  --input-a ./paper_infos.json \
  --pdf-root-b ./pdfs_new \
  --output-only-in-b ./pdfs_todo.txt

# 2) Stage the missing PDFs for OCR
uv run deepresearch-flow paper db transfer-pdfs \
  --input-list ./pdfs_todo.txt \
  --output-dir ./pdfs_todo \
  --copy

# (optional) use --move instead of --copy
# uv run deepresearch-flow paper db transfer-pdfs --input-list ./pdfs_todo.txt --output-dir ./pdfs_todo --move

# 3) OCR the missing PDFs (use your OCR tool; write markdowns to ./md_todo)

# 4) Export matched existing assets against the new PDF library
uv run deepresearch-flow paper db extract \
  --input-json ./paper_infos.json \
  --pdf-root ./pdfs_new \
  --output-json ./paper_infos_matched.json

uv run deepresearch-flow paper db extract \
  --md-source-root ./mds \
  --output-md-root ./mds_matched \
  --pdf-root ./pdfs_new

uv run deepresearch-flow paper db extract \
  --md-translated-root ./translated \
  --output-md-translated-root ./translated_matched \
  --pdf-root ./pdfs_new \
  --lang zh

# 5) Translate + extract summaries for the new OCR markdowns
uv run deepresearch-flow translator translate \
  --input ./md_todo \
  --target-lang zh \
  --model openai/gpt-4o-mini

uv run deepresearch-flow paper extract \
  --input ./md_todo \
  --model openai/gpt-4o-mini

# 6) Merge and serve the new library (multi-input)
uv run deepresearch-flow paper db serve \
  --input ./paper_infos_matched.json \
  --input ./paper_infos_new.json \
  --md-root ./mds_matched \
  --md-root ./md_todo \
  --md-translated-root ./translated_matched \
  --md-translated-root ./md_todo \
  --pdf-root ./pdfs_new

Merge Paper JSONs

# Merge multiple libraries using the same template
uv run deepresearch-flow paper db merge library \
  --inputs ./paper_infos_a.json \
  --inputs ./paper_infos_b.json \
  --output ./paper_infos_merged.json

# Merge multiple templates from the same library (first input wins on shared fields)
uv run deepresearch-flow paper db merge templates \
  --inputs ./simple.json \
  --inputs ./deep_read.json \
  --output ./paper_infos_templates.json

Note: paper db merge is now split into merge library and merge templates.

Merge multiple databases (PDF + Markdown + BibTeX)

# 1) Copy PDFs into a single folder
rsync -av ./pdfs_a/ ./pdfs_merged/
rsync -av ./pdfs_b/ ./pdfs_merged/

# 2) Copy Markdown folders into a single folder
rsync -av ./md_a/ ./md_merged/
rsync -av ./md_b/ ./md_merged/

# 3) Merge JSON libraries
uv run deepresearch-flow paper db merge library \
  --inputs ./paper_infos_a.json \
  --inputs ./paper_infos_b.json \
  --output ./paper_infos_merged.json

# 4) Merge BibTeX files
uv run deepresearch-flow paper db merge bibtex \
  -i ./library_a.bib \
  -i ./library_b.bib \
  -o ./library_merged.bib

Merge BibTeX files

uv run deepresearch-flow paper db merge bibtex \
  -i ./library_a.bib \
  -i ./library_b.bib \
  -o ./library_merged.bib

Duplicate keys keep the entry with the most fields; ties keep the first input order.

Recommended: Merge templates then filter by BibTeX

# 1) Merge templates for the same library
uv run deepresearch-flow paper db merge templates \
  --inputs ./deep_read.json \
  --inputs ./simple.json \
  --output ./all.json

# 2) Filter the merged set with BibTeX
uv run deepresearch-flow paper db extract \
  --input-bibtex ./library.bib \
  --json ./all.json \
  --output-json ./library_filtered.json \
  --output-csv ./library_filtered.csv

Deployment (Static CDN)

The recommended production setup is front/back separation:

Static CDN hosts PDFs/Markdown/images/summaries.
API server serves a read-only snapshot DB.
Frontend is a separate static app (Vite build or any static host).

frontend

1) Build snapshot + static export

uv run deepresearch-flow paper db snapshot build \
  --input ./paper_infos.json \
  --bibtex ./papers.bib \
  --md-root ./docs \
  --md-translated-root ./docs \
  --pdf-root ./pdfs \
  --output-db ./dist/paper_snapshot.db \
  --static-export-dir /data/paper-static

Notes:

The build host must be able to read the original PDF/Markdown roots.
The CDN only needs the exported directory (e.g. /data/paper-static).

2) Serve static assets with CORS + cache headers (Caddy example)

:8002 {
  root * /data/paper-static
  encode zstd gzip

  @static path /pdf/* /md/* /md_translate/* /images/*
  header @static {
    Access-Control-Allow-Origin *
    Access-Control-Allow-Methods GET,HEAD,OPTIONS
    Access-Control-Allow-Headers *
    Cache-Control "public, max-age=31536000, immutable"
  }

  @options method OPTIONS
  respond @options 204

  file_server
}

2.1) Nginx example (API + frontend on one domain, static on another)

# Frontend + API (same domain)
server {
  listen 80;
  server_name frontend.example.com;

  root /var/www/paper-frontend;
  index index.html;

  location / {
    try_files $uri /index.html;
  }

  location /api/ {
    proxy_pass http://127.0.0.1:8001;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
  }

  location ^~ /mcp {
    proxy_pass http://127.0.0.1:8001;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
  }

  # SSE transport for MCP clients that require Server-Sent Events
  location ^~ /mcp-sse {
    proxy_pass http://127.0.0.1:8001;
    proxy_http_version 1.1;
    proxy_set_header Connection "";
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_buffering off;
    proxy_cache off;
    proxy_read_timeout 3600s;
    proxy_send_timeout 3600s;
    chunked_transfer_encoding off;
    add_header X-Accel-Buffering no;
  }
}

# Static assets (separate domain)
server {
  listen 80;
  server_name static.example.com;

  root /data/paper-static;

  location / {
    add_header Access-Control-Allow-Origin *;
    add_header Access-Control-Allow-Methods "GET,HEAD,OPTIONS";
    add_header Access-Control-Allow-Headers "*";
    add_header Cache-Control "public, max-age=31536000, immutable";
    try_files $uri =404;
  }
}

3) Start the API server (read-only)

export PAPER_DB_STATIC_BASE_URL="https://static.example.com"

uv run deepresearch-flow paper db api serve \
  --snapshot-db /data/paper_snapshot.db \
  --cors-origin https://frontend.example.com \
  --host 0.0.0.0 --port 8001

BibTeX metadata endpoint:

GET /api/v1/papers/{paper_id}/bibtex
Success payload: { paper_id, doi, bibtex_raw, bibtex_key, entry_type }
Error codes:
- paper_not_found
- bibtex_not_found

3.1) Admin API (Optional)

Enable the admin API to add or delete papers remotely via Bearer token authentication.

# Start API server with admin enabled
PAPER_DB_ADMIN_TOKEN=your-secret-token \
uv run deepresearch-flow paper db api serve \
  --snapshot-db /data/paper_snapshot.db \
  --cors-origin https://frontend.example.com \
  --host 0.0.0.0 --port 8001

Or pass the token via CLI flag: --admin-token your-secret-token

Endpoints (all require Authorization: Bearer <token> header):

POST /api/v1/admin/papers — Batch add papers (up to 200 per request)

curl -X POST https://api.example.com/api/v1/admin/papers \
  -H "Authorization: Bearer your-secret-token" \
  -H "Content-Type: application/json" \
  -d '{"papers": [{"paper_title": "...", "paper_authors": [...], ...}]}'

Response: { added, skipped, errors, paper_ids }

DELETE /api/v1/admin/papers/{paper_id} — Delete a paper and all its relations

curl -X DELETE https://api.example.com/api/v1/admin/papers/{paper_id} \
  -H "Authorization: Bearer your-secret-token"

Response: { deleted: true, paper_id }

The paper JSON format is the same as snapshot update input. Static files (PDF, markdown, images) are not handled by the API — upload them to your CDN separately.

Push from Local DB to Remote

Use api push to merge a locally-built snapshot DB into a remote deployment:

# remote.toml
[remote]
api_base_url = "https://api.example.com"
admin_token = "env:PAPER_DB_ADMIN_TOKEN"
batch_size = 100

# Preview what will be pushed
uv run deepresearch-flow paper db api push \
  --snapshot-db ./dist/paper_snapshot.db \
  --static-export-dir ./dist/paper-static \
  --config remote.toml \
  --dry-run

# Push to remote
uv run deepresearch-flow paper db api push \
  --snapshot-db ./dist/paper_snapshot.db \
  --static-export-dir ./dist/paper-static \
  --config remote.toml

--static-export-dir is optional — when provided, summary JSON payloads are included so the remote side can build FTS indexes and preview text.
Duplicate papers (same paper_id) are automatically skipped.
Static files (PDF, markdown, images) are not pushed — sync them to your CDN separately (e.g., rsync, aws s3 sync).

3.2) MCP (FastMCP Streamable HTTP + SSE)

This project exposes MCP servers mounted on the snapshot API:

Streamable HTTP endpoint: http://<host>:8001/mcp
SSE endpoint: http://<host>:8001/mcp-sse
Transport behavior:
- /mcp: Streamable HTTP via POST only (GET returns 405)
- /mcp-sse: SSE-enabled transport (supports GET handshake)
Protocol header: optional mcp-protocol-version (2025-03-26 or 2025-06-18)
Static reads: summary/source/translation are served as text content by reading snapshot static assets (local-first via PAPER_DB_STATIC_EXPORT_DIR, HTTP fallback via PAPER_DB_STATIC_BASE / PAPER_DB_STATIC_BASE_URL)

Optional (avoid HTTP fetch by reading exported assets directly on the API host):

export PAPER_DB_STATIC_EXPORT_DIR=/data/paper-static

MCP Tools (API functions)

search_papers(query, limit=10) — full-text search (relevance-ranked)

Args:
- query (str): keywords / topic query
- limit (int): number of results (clamped to API max page size)
Returns: list of { paper_id, title, year, venue, snippet_markdown }

search_papers_by_keyword(keyword, limit=10) — facet keyword search

Args:
- keyword (str): keyword substring
- limit (int): number of results (clamped)
Returns: list of { paper_id, title, year, venue, snippet_markdown }

get_paper_metadata(paper_id) — metadata + available summary templates

Args:
- paper_id (str)
Returns: dict with:
- paper_id, title, year, venue
- doi, arxiv_id, openreview_id, paper_pw_url
- has_bibtex
- preferred_summary_template, available_summary_templates

get_paper_bibtex(paper_id) — persisted BibTeX payload

Args:
- paper_id (str)
Returns: dict with:
- paper_id, doi, bibtex_raw, bibtex_key, entry_type
Errors:
- paper_not_found
- bibtex_not_found

get_paper_summary(paper_id, template=None, max_chars=None) — summary JSON as raw text

Notes:
- Uses preferred_summary_template if template is omitted
- Returns the full JSON content (not a URL)
Args:
- paper_id (str)
- template (str | null)
- max_chars (int | null): truncation limit
Returns: JSON string (may include a [truncated: ...] marker)

get_paper_source(paper_id, max_chars=None) — source markdown as raw text

Args:
- paper_id (str)
- max_chars (int | null): truncation limit
Returns: markdown string (may include a [truncated: ...] marker)

get_database_stats() — snapshot-level stats

Returns:
- total
- years, months: list of { value, paper_count }
- authors, venues, institutions, keywords, tags: top lists of { value, paper_count }

list_top_facets(category, limit=20) — top values for one facet

Args:
- category: author | venue | keyword | institution | tag
- limit (int)
Returns: list of { value, paper_count }

filter_papers(author=None, venue=None, year=None, keyword=None, tag=None, limit=10) — structured filtering

Args (all optional except limit):
- author, venue, keyword, tag: substring match
- year: exact match
- limit (int): number of results (clamped)
Returns: list of { paper_id, title, year, venue }

MCP Resources (URI access)

paper://{paper_id}/metadata — metadata JSON

Returns the same content as get_paper_metadata(paper_id) (as a JSON string).

paper://{paper_id}/summary — preferred summary JSON

Returns the same content as get_paper_summary(paper_id) (preferred template; JSON string).

paper://{paper_id}/summary/{template} — summary JSON for template

Returns the same content as get_paper_summary(paper_id, template=template) (JSON string).

paper://{paper_id}/source — source markdown

Returns the same content as get_paper_source(paper_id) (markdown string).

paper://{paper_id}/translation/{lang} — translated markdown

Returns translated markdown for lang (e.g. zh, ja) when available.

4) Frontend (static build or dev)

cd frontend
npm install

# Dev
VITE_PAPER_DB_API_BASE=https://api.example.com/api/v1 \
VITE_PAPER_DB_STATIC_BASE=https://static.example.com \
npm run dev

# Build for static hosting
VITE_PAPER_DB_API_BASE=https://api.example.com/api/v1 \
VITE_PAPER_DB_STATIC_BASE=https://static.example.com \
npm run build

Comprehensive Guide

1. Translator: OCR-Safe Translation

The translator module is built for scientific documents. It uses a node-based architecture to ensure stability.

Structure Protection: automatically detects and "freezes" code blocks, LaTeX ($$...$$), HTML tables, and images before sending text to the LLM.
OCR Repair: use --fix-level to merge broken paragraphs and convert text references ([1]) to clickable Markdown footnotes ([^1]).
Context-Aware: supports retries for failed chunks and falls back gracefully.
Group Concurrency: use --group-concurrency to run multiple translation groups in parallel per document.

# Translate with structure protection and OCR repairs
uv run deepresearch-flow translator translate \
  --input ./paper.md \
  --target-lang ja \
  --fix-level aggressive \
  --group-concurrency 4 \
  --model claude/claude-3-5-sonnet-20240620

2. Paper Extract: Structured Knowledge

Turn loose markdown files into a queryable database.

Templates: built-in prompts like simple, eight_questions, and deep_read guide the LLM to extract specific insights.
Async and throttled: precise control over concurrency (--max-concurrency), rate limits (--sleep-every), and request timeout (--timeout).
Incremental: skips already processed files; resumes from where you left off.
Stage resume: multi-stage templates persist per-module outputs; use --force-stage <name> to rerun a module.
Stage DAG: enable --stage-dag (or extract.stage_dag = true) for dependency-aware parallelism; DAG mode only passes dependency outputs to a stage and --dry-run prints the per-stage plan.
Diagram hints: deep_read can emit inferred diagrams labeled [Inferred]; use recognize fix-mermaid on rendered markdown if needed.
Stage focus: multi-stage runs emphasize the active module and summarize others to reduce context overload.
Range filter: use --start-idx/--end-idx to slice inputs; range applies before --retry-failed/--retry-failed-stages (--end-idx -1 = last item).
Retry failed stages: use --retry-failed-stages to re-run only failed stages (multi-stage templates); missing stages are forced to run. Retry runs keep existing results and only update retried items.

uv run deepresearch-flow paper extract \
  --input ./library \
  --output paper_data.json \
  --template-dir ./my-custom-prompts \
  --max-concurrency 10 \
  --timeout 180

# Extract items 0..99, then retry only failed ones from that range
uv run deepresearch-flow paper extract \
  --input ./library \
  --start-idx 0 \
  --end-idx 100 \
  --retry-failed \
  --model claude/claude-3-5-sonnet-20240620

# Retry only failed stages in multi-stage templates
uv run deepresearch-flow paper extract \
  --input ./library \
  --retry-failed-stages \
  --model claude/claude-3-5-sonnet-20240620

4. Recognize Fix: Repair Math and Mermaid

Fix broken LaTeX formulas and Mermaid diagrams in markdown or JSON outputs.

Retry Failed: use --retry-failed with the prior --report output to reprocess only failed formulas/diagrams.

uv run deepresearch-flow recognize fix-math \
  --input ./docs \
  --in-place \
  --model claude/claude-3-5-sonnet-20240620 \
  --report ./fix-math-errors.json \
  --retry-failed

uv run deepresearch-flow recognize fix-mermaid \
  --input ./docs \
  --in-place \
  --model claude/claude-3-5-sonnet-20240620 \
  --report ./fix-mermaid-errors.json \
  --retry-failed

3. Database and UI: Your Personal ArXiv

The db serve command creates a local research station.

Split View: read the original PDF/Markdown on the left and the Summary/Translation on the right.
Full Text Search: search by title, author, year, or content tags (tag:fpga year:2023..2024).
Stats: visualize publication trends and keyword frequencies.
PDF Viewer: built-in PDF.js viewer prevents cross-origin issues with local files.

uv run deepresearch-flow paper db serve \
  --input paper_infos.json \
  --pdf-root ./pdfs \
  --cache-dir .cache/db

4. Paper DB Compare: Coverage Audit

Compare two datasets (A/B) to find missing PDFs, markdowns, translations, or JSON items, with match metadata.

uv run deepresearch-flow paper db compare \
  --input-a ./a.json \
  --md-root-b ./md_root \
  --output-csv ./compare.csv

# Compare translated markdowns by language
uv run deepresearch-flow paper db compare \
  --md-translated-root-a ./translated_a \
  --md-translated-root-b ./translated_b \
  --lang zh

5. Paper DB Extract: Matched Export

Extract matched JSON entries or translated Markdown after coverage comparison.

uv run deepresearch-flow paper db extract \
  --json ./processed.json \
  --input-bibtex ./refs.bib \
  --pdf-root ./pdfs \
  --output-json ./matched.json \
  --output-csv ./extract.csv

# Use a JSON reference list to filter the target JSON
uv run deepresearch-flow paper db extract \
  --json ./processed.json \
  --input-json ./reference.json \
  --pdf-root ./pdfs \
  --output-json ./matched.json \
  --output-csv ./extract.csv

# Extract translated markdowns by language
uv run deepresearch-flow paper db extract \
  --md-root ./md_root \
  --md-translated-root ./translated \
  --lang zh \
  --output-md-translated-root ./translated_matched \
  --output-csv ./extract.csv

6. Recognize: OCR Post-Processing

Tools to clean up raw outputs from OCR engines like MinerU.

Embed Images: convert local image links to Base64 for a portable single-file Markdown.
Unpack Images: extract Base64 images back to files.
Organize: flatten nested OCR output directories.
Fix: apply OCR fixes and rumdl formatting during organize, or as a standalone step.
Fix JSON: apply the same fixes to markdown fields inside paper JSON outputs.
Fix Math: validate and repair LaTeX formulas with optional LLM assistance.
Fix Mermaid: validate and repair Mermaid diagrams (requires mmdc from mermaid-cli).
Recommended order: fix -> fix-math -> fix-mermaid -> fix.

uv run deepresearch-flow recognize md embed --input ./raw_ocr --output ./clean_md

# Organize MinerU output and apply OCR fixes
uv run deepresearch-flow recognize organize \
  --input ./mineru_outputs \
  --output-simple ./ocr_md \
  --fix

# Fix and format existing markdown outputs
uv run deepresearch-flow recognize fix \
  --input ./ocr_md \
  --output ./ocr_md_fixed

# Fix in place
uv run deepresearch-flow recognize fix \
  --input ./ocr_md \
  --in-place

# Fix JSON outputs in place
uv run deepresearch-flow recognize fix \
  --json \
  --input ./paper_outputs \
  --in-place

# Fix LaTeX formulas in markdown
uv run deepresearch-flow recognize fix-math \
  --input ./docs \
  --model openai/gpt-4o-mini \
  --in-place

# Fix Mermaid diagrams in JSON outputs
uv run deepresearch-flow recognize fix-mermaid \
  --json \
  --input ./paper_outputs \
  --model openai/gpt-4o-mini \
  --in-place

Docker Support

Don't want to manage Python environments?

docker run --rm -v $(pwd):/app -it ghcr.io/nerdneilsfield/deepresearch-flow:latest --help

Deploy image (API + frontend via nginx):

docker run --rm -p 8899:8899 \
  -v $(pwd)/paper_snapshot.db:/db/papers.db \
  -v $(pwd)/paper-static:/static \
  ghcr.io/nerdneilsfield/deepresearch-flow:deploy-latest

Notes:

nginx listens on 8899 and proxies /api, /mcp, and /mcp-sse to the internal API at 127.0.0.1:8000.
Mount your snapshot DB to /db/papers.db inside the container.
Mount snapshot static assets to /static when serving assets from this container (default PAPER_DB_STATIC_BASE is /static).
If PAPER_DB_STATIC_BASE is a full URL (e.g. https://static.example.com), nginx still serves the frontend locally, while API responses use that external static base for asset links.

Docker Compose example (two modes):

docker compose -f scripts/docker/docker-compose.example.yml --profile local-static up
# or
docker compose -f scripts/docker/docker-compose.example.yml --profile external-static up

External static assets example:

docker run --rm -p 8899:8899 \
  -v $(pwd)/paper_snapshot.db:/db/papers.db \
  -e PAPER_DB_STATIC_BASE=https://static.example.com \
  ghcr.io/nerdneilsfield/deepresearch-flow:deploy-latest

Configuration

The config.toml is your control center. It supports:

Multiple Providers: mix and match OpenAI, DeepSeek (DashScope), Gemini, Claude, and Ollama.
Model Routing: explicit routing to specific models (--model provider/model_name).
Environment Variables: keep secrets safe using env:VAR_NAME syntax.

See config.example.toml for a full reference.

Built with love for the Open Science community.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

nerdneils

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.9.7

Apr 19, 2026

0.9.6

Apr 18, 2026

0.9.5

Apr 18, 2026

0.9.4

Apr 18, 2026

0.9.3

Apr 18, 2026

0.9.2

Apr 17, 2026

0.9.1

Apr 16, 2026

0.9.0

Apr 15, 2026

0.8.5

Apr 10, 2026

0.8.4

Apr 1, 2026

0.8.3

Apr 1, 2026

0.8.2

Mar 30, 2026

This version

0.8.1

Mar 28, 2026

0.7.10

Feb 11, 2026

0.7.9

Feb 11, 2026

0.7.8

Feb 11, 2026

0.7.7

Feb 11, 2026

0.7.6

Feb 11, 2026

0.7.5

Feb 11, 2026

0.7.4

Feb 10, 2026

0.7.3

Feb 5, 2026

0.7.2

Feb 5, 2026

0.7.1

Feb 5, 2026

0.7.0

Feb 5, 2026

0.6.1

Jan 30, 2026

0.6.0

Jan 30, 2026

0.5.1

Jan 19, 2026

0.5.0

Jan 18, 2026

0.4.1

Jan 16, 2026

0.4.0

Jan 15, 2026

0.3.0

Jan 12, 2026

0.2.1

Jan 12, 2026

0.2.0

Jan 12, 2026

0.1.2

Jan 11, 2026

0.1.1

Jan 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepresearch_flow-0.8.1.tar.gz (5.7 MB view details)

Uploaded Mar 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

deepresearch_flow-0.8.1-py3-none-any.whl (6.2 MB view details)

Uploaded Mar 28, 2026 Python 3

File details

Details for the file deepresearch_flow-0.8.1.tar.gz.

File metadata

Download URL: deepresearch_flow-0.8.1.tar.gz
Upload date: Mar 28, 2026
Size: 5.7 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for deepresearch_flow-0.8.1.tar.gz
Algorithm	Hash digest
SHA256	`a0c89d969749c9892f5ce29dd3ee955a7ddee4bd3bbef105a58a8609d6e42f74`
MD5	`4b99f5bc724caecab9f08084d35f67b1`
BLAKE2b-256	`c04ce2c5acd8c4e2fb21018e8c66c681e1aa7e84ad0b1ba3e8223224d668877e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for deepresearch_flow-0.8.1.tar.gz:

Publisher: push-to-pypi.yml on nerdneilsfield/ai-deepresearch-flow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: deepresearch_flow-0.8.1.tar.gz
- Subject digest: a0c89d969749c9892f5ce29dd3ee955a7ddee4bd3bbef105a58a8609d6e42f74
- Sigstore transparency entry: 1191411557
- Sigstore integration time: Mar 28, 2026
Source repository:
- Permalink: nerdneilsfield/ai-deepresearch-flow@daca68b3426e99b5962bff2c320f79d77966d5e4
- Branch / Tag: refs/tags/v0.8.1
- Owner: https://github.com/nerdneilsfield
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: push-to-pypi.yml@daca68b3426e99b5962bff2c320f79d77966d5e4
- Trigger Event: push

File details

Details for the file deepresearch_flow-0.8.1-py3-none-any.whl.

File metadata

Download URL: deepresearch_flow-0.8.1-py3-none-any.whl
Upload date: Mar 28, 2026
Size: 6.2 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for deepresearch_flow-0.8.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`61285e33719e425b69e3dff9527a1de6f919e5b7792b5a3713e2c69b21b3274e`
MD5	`a9a43c391e8c3fabe5ad318635243dc8`
BLAKE2b-256	`bc9110f17cd503a901a34dfa1346cfa8b08642aa4794d74c022bc629a5904b76`

See more details on using hashes here.

Provenance

The following attestation bundles were made for deepresearch_flow-0.8.1-py3-none-any.whl:

Publisher: push-to-pypi.yml on nerdneilsfield/ai-deepresearch-flow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: deepresearch_flow-0.8.1-py3-none-any.whl
- Subject digest: 61285e33719e425b69e3dff9527a1de6f919e5b7792b5a3713e2c69b21b3274e
- Sigstore transparency entry: 1191411559
- Sigstore integration time: Mar 28, 2026
Source repository:
- Permalink: nerdneilsfield/ai-deepresearch-flow@daca68b3426e99b5962bff2c320f79d77966d5e4
- Branch / Tag: refs/tags/v0.8.1
- Owner: https://github.com/nerdneilsfield
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: push-to-pypi.yml@daca68b3426e99b5962bff2c320f79d77966d5e4
- Trigger Event: push

deepresearch-flow 0.8.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

ai-deepresearch-flow

The Core Pain Points

The Solution

Key Features

Quick Start

1) Installation

2) Configuration

3) The "Zero to Hero" Workflow

Step 1: Extract Insights

Step 1.1: Verify & Retry Missing Fields

Step 2: Translate Safely

Step 2.5: Run OCR on PDFs/Images (Optional)

Step 3: Repair OCR Outputs (Recommended)

Step 4: Serve Your Database

Step 4.5: Build Snapshot + Serve API + Frontend (Recommended)

Step 4.6: Supplement Missing Templates (Optional)

Upgrade Legacy Snapshot Schema (DOI/BibTeX)

Supplement Missing Translations

Incremental PDF Library Workflow

Merge Paper JSONs

Merge multiple databases (PDF + Markdown + BibTeX)

Merge BibTeX files

Recommended: Merge templates then filter by BibTeX

Deployment (Static CDN)

1) Build snapshot + static export

2) Serve static assets with CORS + cache headers (Caddy example)

2.1) Nginx example (API + frontend on one domain, static on another)

3) Start the API server (read-only)

3.1) Admin API (Optional)

Push from Local DB to Remote

3.2) MCP (FastMCP Streamable HTTP + SSE)

MCP Tools (API functions)

MCP Resources (URI access)

4) Frontend (static build or dev)

Comprehensive Guide

Docker Support

Configuration

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance