Skip to main content

Workflow tools for paper extraction, review, and research automation.

Project description

deepresearch-flow

DeepResearch Flow command-line tools for document extraction, OCR post-processing, and paper database operations.

Quick Start

pip install deepresearch-flow
# or
uv pip install deepresearch-flow

# Development install
pip install -e .

cp config.example.toml config.toml

# Extract from a docs folder
uv run deepresearch-flow paper extract \
  --input ./docs \
  --model openai/gpt-4o-mini

# Serve a local UI
uv run deepresearch-flow paper db serve \
  --input ./paper_infos_simple.json \
  --host 127.0.0.1 \
  --port 8000

Docker images:

docker run --rm -it nerdneils/deepresearch-flow --help
# or
docker run --rm -it ghcr.io/nerdneilsfield/deepresearch-flow --help

Commands

deepresearch-flow is the top-level CLI. Workflows live under paper and recognize. Use deepresearch-flow --help, deepresearch-flow paper --help, and deepresearch-flow recognize --help to explore flags.

Configuration details

Copy config.example.toml to config.toml and edit providers.

  • Providers are configured under [[providers]].
  • Use api_keys = ["env:OPENAI_API_KEY"] to read from environment variables.
  • model_list is required for each provider and controls allowed provider/model values.
  • Explicit model routing is required: --model provider/model.
  • Supported provider types: ollama, openai_compatible, dashscope, gemini_ai_studio, gemini_vertex, azure_openai, claude.
  • Provider-specific fields: azure_openai requires endpoint, api_version, deployment; gemini_vertex requires project_id, location; claude requires anthropic_version.
  • Built-in prompt templates for extraction: simple, deep_read, eight_questions, three_pass.
  • Template rename: seven_questions is now eight_questions.
  • Render templates use paper db render-md --template-name with the same names.
  • --language defaults to en; extraction stores it as output_language and render uses that field.
  • When output_language is zh, render headings include both Chinese and English.
  • Complex templates (deep_read, eight_questions, three_pass) run multi-stage extraction and persist per-document stage files under paper_stage_outputs/.
  • Custom templates: use --prompt-system/--prompt-user with --schema-json, or --template-dir containing system.j2, user.j2, schema.json, render.j2.
  • Custom templates run in single-stage extraction mode.
  • Built-in schemas require publication_date and publication_venue.
  • The simple template requires abstract, keywords, and a single-paragraph summary that covers the eight-question aspects.
  • Extraction tolerates minor JSON formatting errors and ignores extra top-level fields when required keys validate.
paper extract — structured extraction from markdown

Extract structured JSON from markdown files using configured providers and prompt templates.

Key options:

  • --input (repeatable): file or directory input.
  • --glob: filter when scanning directories.
  • --prompt-template / --language: select built-in prompts and output language.
  • --prompt-system / --prompt-user / --schema-json: custom prompt + schema.
  • --template-dir: use a directory containing system.j2, user.j2, schema.json, render.j2.
  • --sleep-every / --sleep-time: throttle request initiation.
  • --max-concurrency: override concurrency.
  • --render-md: render markdown output as part of extraction.
  • --dry-run: scan inputs and show summary metrics without calling providers.

Outputs:

  • Aggregated JSON: paper_infos.json
  • Errors: paper_errors.json
  • Optional rendered Markdown: rendered_md/ by default

Incremental behavior:

  • Reuses existing entries when source_path and source_hash match.
  • Use --force to re-extract everything.
  • Use --retry-failed to retry only failed documents listed in paper_errors.json.
  • Use --verbose for detailed logs alongside progress bars.
  • Extract-time rendering defaults to the same built-in template as --prompt-template.
  • A summary table prints input/prompt/output character totals, token estimates, and throughput after each run.
  • Progress bars include a live prompt/completion/total token ticker.

Examples:

# Scan a directory recursively (default: *.md)
deepresearch-flow paper extract \
  --input ./docs \
  --model openai/gpt-4o-mini

# Multiple inputs + custom output
deepresearch-flow paper extract \
  --input ./docs \
  --input ./more-docs \
  --output ./out/papers.json \
  --model openai/gpt-4o-mini

# Built-in template with output language
deepresearch-flow paper extract \
  --input ./docs \
  --prompt-template deep_read \
  --language zh \
  --model openai/gpt-4o-mini

# Custom template directory
deepresearch-flow paper extract \
  --input ./docs \
  --template-dir ./prompts \
  --model openai/gpt-4o-mini

# Extract + render in one run
deepresearch-flow paper extract \
  --input ./docs \
  --prompt-template eight_questions \
  --render-md \
  --model openai/gpt-4o-mini

# Throttle request initiation
deepresearch-flow paper extract \
  --input ./docs \
  --sleep-every 10 \
  --sleep-time 60 \
  --model openai/gpt-4o-mini
paper db — render, analyze, and serve extracted data

Render outputs, compute stats, and serve a local web UI over paper JSON.

JSON input formats:

  • For db render-md, db statistics, db filter, and db generate-tags, the input is the aggregated JSON list.
  • For db serve, each input JSON must be an object: {"template_tag": "simple", "papers": [...]}. When template_tag is missing, the server attempts to infer it as a fallback.

Web UI highlights:

  • Summary/Source/PDF/PDF Viewer views with tab navigation.
  • Split view: choose left/right panes independently (summary/source/pdf/pdf viewer) via URL params.
  • Summary view includes a collapsible outline panel (top-left) and a back-to-top control (bottom-left).
  • Summary template dropdown shows only available templates per paper.
  • Source view renders Markdown and supports embedded HTML tables plus data:image/...;base64 <img> tags.
  • PDF Viewer is served locally (PDF.js viewer assets) to avoid cross-origin issues with local PDFs.
  • Merge behavior for multi-input serve: title similarity (>= 0.95), preferring bibtex.fields.title and falling back to paper_title.
  • Cache merged inputs with --cache-dir; bypass with --no-cache.

Examples:

# Render Markdown from JSON
deepresearch-flow paper db render-md --input paper_infos.json

# Render with a built-in template and language fallback
deepresearch-flow paper db render-md \
  --input paper_infos.json \
  --template-name deep_read \
  --language zh

# Generate tags
deepresearch-flow paper db generate-tags \
  --input paper_infos.json \
  --output paper_infos_with_tags.json \
  --model openai/gpt-4o-mini

# Filter papers
deepresearch-flow paper db filter \
  --input paper_infos.json \
  --output filtered.json \
  --tags hardware_acceleration,fpga

# Statistics (rich tables)
deepresearch-flow paper db statistics \
  --input paper_infos.json \
  --top-n 20

# Serve a local read-only web UI (loads charts/libs via CDN)
deepresearch-flow paper db serve \
  --input paper_infos_simple.json \
  --input paper_infos_deep_read.json \
  --cache-dir .cache/db-serve \
  --host 127.0.0.1 \
  --port 8000

# Serve with optional BibTeX enrichment and source roots
deepresearch-flow paper db serve \
  --input paper_infos_simple.json \
  --input paper_infos_deep_read.json \
  --bibtex ./refs/library.bib \
  --md-root ./docs \
  --md-root ./more_docs \
  --pdf-root ./pdfs \
  --cache-dir .cache/db-serve \
  --host 127.0.0.1 \
  --port 8000

Web search syntax (Scholar-style):

  • Default is AND: fpga kNN
  • Quoted phrases: title:"nearest neighbor"
  • OR: fpga OR asic
  • Negation: -survey or -tag:survey
  • Fields: title:, author:, tag:, venue:, year:, month:
  • Year range: year:2020..2024

Other database helpers:

  • append-bibtex
  • sort-papers
  • split-by-tag
  • split-database
  • statistics
  • merge
recognize md — embed or unpack markdown images

recognize md embed replaces local image links in markdown with data:image/...;base64, URLs. recognize md unpack extracts embedded images into images/ and updates markdown links.

Key options:

  • --input (repeatable): file or directory input.
  • --recursive: recurse into directories.
  • --output: output directory (flattened outputs).
  • --enable-http: allow embedding HTTP(S) images (embed only).
  • --workers: concurrent workers (default: 4).
  • --dry-run: report planned outputs without writing files.
  • --verbose: enable detailed logs for image resolution/HTTP fetches.

Notes:

  • Progress bars report completion; a rich summary table lists counts, image totals, duration, and output locations.
  • Summary paths are shown relative to the current working directory when possible.
  • If the output directory is not empty, the command logs a warning before writing files.

Examples:

# Embed local images (flatten outputs)
deepresearch-flow recognize md embed \
  --input ./docs \
  --recursive \
  --output ./out_md

# Embed HTTP images (with browser User-Agent)
deepresearch-flow recognize md embed \
  --input ./docs \
  --enable-http \
  --output ./out_md

# Unpack embedded images into output/images/
deepresearch-flow recognize md unpack \
  --input ./docs \
  --recursive \
  --output ./out_md
recognize organize — flatten OCR outputs

Organize OCR outputs (layout: mineru) into flat markdown files, with optional image embedding.

Key options:

  • --layout: OCR layout type (currently mineru).
  • --input (repeatable): directories containing full.md + images/.
  • --recursive: search for layout folders (required when inputs contain nested result directories).
  • --output-simple: copy markdown + images to output (shared images/).
  • --output-base64: embed images into markdown.
  • --workers: concurrent workers (default: 4).
  • --dry-run: report planned outputs without writing files.
  • --verbose: enable detailed logs for layout discovery and file copying.

Notes:

  • Use --recursive when the input directory contains nested layout folders (otherwise no layouts are discovered).
  • If output directories are not empty, the command logs a warning before writing files.
  • A summary table lists counts, image totals, duration, and output locations after completion.
  • Summary paths are shown relative to the current working directory when possible.

Examples:

# Copy markdown + images into a flat output directory
deepresearch-flow recognize organize \
  --layout mineru \
  --input ./ocr_results \
  --recursive \
  --output-simple ./out_simple

# Embed images into markdown
deepresearch-flow recognize organize \
  --layout mineru \
  --input ./ocr_results \
  --output-base64 ./out_base64
Data formats (examples)

Aggregated extraction output is a JSON list:

[
  {
    "paper_title": "Example Paper",
    "paper_authors": ["Author A", "Author B"],
    "publication_date": "2024-01-01",
    "publication_venue": "ExampleConf",
    "source_path": "/abs/path/to/doc.md"
  }
]

db serve expects each input to be an object with a template_tag and a papers list:

{
  "template_tag": "simple",
  "papers": [
    {
      "paper_title": "Example Paper",
      "paper_authors": ["Author A"],
      "publication_date": "2024-01-01",
      "publication_venue": "ExampleConf"
    }
  ]
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepresearch_flow-0.1.2.tar.gz (3.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deepresearch_flow-0.1.2-py3-none-any.whl (3.4 MB view details)

Uploaded Python 3

File details

Details for the file deepresearch_flow-0.1.2.tar.gz.

File metadata

  • Download URL: deepresearch_flow-0.1.2.tar.gz
  • Upload date:
  • Size: 3.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for deepresearch_flow-0.1.2.tar.gz
Algorithm Hash digest
SHA256 5d22a3245e6f4d91a1f32bfa871ff5758780a9e891efe973dabdda22fdf55436
MD5 f2f8ae2e5cbc888a7ee97c1f403933ae
BLAKE2b-256 a19da3a3bf9c12ef933098b206cef5fc00b1b8d870cf5a96dbe062d90992c5da

See more details on using hashes here.

Provenance

The following attestation bundles were made for deepresearch_flow-0.1.2.tar.gz:

Publisher: push-to-pypi.yml on nerdneilsfield/ai-deepresearch-flow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file deepresearch_flow-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for deepresearch_flow-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 b90c69872a00b242d83ecd3c81e0932954dce30cb520a52f8d188ba619038ba5
MD5 30bc4a8e8a29729f522d70242f85de35
BLAKE2b-256 b8c0d52ef014bc096de463e528bd6a52356f0c5d423af4ef97d20383b0a41f90

See more details on using hashes here.

Provenance

The following attestation bundles were made for deepresearch_flow-0.1.2-py3-none-any.whl:

Publisher: push-to-pypi.yml on nerdneilsfield/ai-deepresearch-flow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page