Workflow tools for paper extraction, review, and research automation.
Project description
deepresearch-flow
DeepResearch Flow command-line tools for document extraction, OCR post-processing, and paper database operations.
Quick Start
pip install deepresearch-flow
# or
uv pip install deepresearch-flow
# Development install
pip install -e .
cp config.example.toml config.toml
# Extract from a docs folder
uv run deepresearch-flow paper extract \
--input ./docs \
--model openai/gpt-4o-mini
# Serve a local UI
uv run deepresearch-flow paper db serve \
--input ./paper_infos_simple.json \
--host 127.0.0.1 \
--port 8000
Docker images:
docker run --rm -it nerdneils/deepresearch-flow --help
# or
docker run --rm -it ghcr.io/nerdneilsfield/deepresearch-flow --help
Commands
deepresearch-flow is the top-level CLI. Workflows live under paper and recognize.
Use deepresearch-flow --help, deepresearch-flow paper --help, and deepresearch-flow recognize --help to explore flags.
Configuration details
Copy config.example.toml to config.toml and edit providers.
- Providers are configured under
[[providers]]. - Use
api_keys = ["env:OPENAI_API_KEY"]to read from environment variables. model_listis required for each provider and controls allowedprovider/modelvalues.- Explicit model routing is required:
--model provider/model. - Supported provider types:
ollama,openai_compatible,dashscope,gemini_ai_studio,gemini_vertex,azure_openai,claude. - Provider-specific fields:
azure_openairequiresendpoint,api_version,deployment;gemini_vertexrequiresproject_id,location;clauderequiresanthropic_version. - Built-in prompt templates for extraction:
simple,deep_read,eight_questions,three_pass. - Template rename:
seven_questionsis noweight_questions. - Render templates use
paper db render-md --template-namewith the same names. --languagedefaults toen; extraction stores it asoutput_languageand render uses that field.- When
output_languageiszh, render headings include both Chinese and English. - Complex templates (
deep_read,eight_questions,three_pass) run multi-stage extraction and persist per-document stage files underpaper_stage_outputs/. - Custom templates: use
--prompt-system/--prompt-userwith--schema-json, or--template-dircontainingsystem.j2,user.j2,schema.json,render.j2. - Custom templates run in single-stage extraction mode.
- Built-in schemas require
publication_dateandpublication_venue. - The
simpletemplate requiresabstract,keywords, and a single-paragraphsummarythat covers the eight-question aspects. - Extraction tolerates minor JSON formatting errors and ignores extra top-level fields when required keys validate.
paper extract — structured extraction from markdown
Extract structured JSON from markdown files using configured providers and prompt templates.
Key options:
--input(repeatable): file or directory input.--glob: filter when scanning directories.--prompt-template/--language: select built-in prompts and output language.--prompt-system/--prompt-user/--schema-json: custom prompt + schema.--template-dir: use a directory containingsystem.j2,user.j2,schema.json,render.j2.--sleep-every/--sleep-time: throttle request initiation.--max-concurrency: override concurrency.--render-md: render markdown output as part of extraction.--dry-run: scan inputs and show summary metrics without calling providers.
Outputs:
- Aggregated JSON:
paper_infos.json - Errors:
paper_errors.json - Optional rendered Markdown:
rendered_md/by default
Incremental behavior:
- Reuses existing entries when
source_pathandsource_hashmatch. - Use
--forceto re-extract everything. - Use
--retry-failedto retry only failed documents listed inpaper_errors.json. - Use
--verbosefor detailed logs alongside progress bars. - Extract-time rendering defaults to the same built-in template as
--prompt-template. - Output JSON is written as
{"template_tag": "...", "papers": [...]}. - A summary table prints input/prompt/output character totals, token estimates, and throughput after each run.
- Progress bars include a live prompt/completion/total token ticker.
Examples:
# Scan a directory recursively (default: *.md)
deepresearch-flow paper extract \
--input ./docs \
--model openai/gpt-4o-mini
# Multiple inputs + custom output
deepresearch-flow paper extract \
--input ./docs \
--input ./more-docs \
--output ./out/papers.json \
--model openai/gpt-4o-mini
# Built-in template with output language
deepresearch-flow paper extract \
--input ./docs \
--prompt-template deep_read \
--language zh \
--model openai/gpt-4o-mini
# Custom template directory
deepresearch-flow paper extract \
--input ./docs \
--template-dir ./prompts \
--model openai/gpt-4o-mini
# Extract + render in one run
deepresearch-flow paper extract \
--input ./docs \
--prompt-template eight_questions \
--render-md \
--model openai/gpt-4o-mini
# Throttle request initiation
deepresearch-flow paper extract \
--input ./docs \
--sleep-every 10 \
--sleep-time 60 \
--model openai/gpt-4o-mini
paper db — render, analyze, and serve extracted data
Render outputs, compute stats, and serve a local web UI over paper JSON.
JSON input formats:
- For
db render-md,db statistics,db filter, anddb generate-tags, the input can be either an aggregated JSON list or{"template_tag": "...", "papers": [...]}(the commands operate onpapers). - For
db serve, each input JSON must be an object:{"template_tag": "simple", "papers": [...]}. Whentemplate_tagis missing, the server attempts to infer it as a fallback (legacy list-only inputs are rejected).
Web UI highlights:
- Summary/Source/PDF/PDF Viewer views with tab navigation.
- Split view: choose left/right panes independently (summary/source/pdf/pdf viewer) via URL params.
- Summary/Source views include a collapsible outline panel (top-left) and a back-to-top control (bottom-left).
- Summary template dropdown shows only available templates per paper.
- Homepage filters: PDF/Source/Summary availability and template tags, plus a filter syntax input (
tmpl:...,has:pdf,no:source). - Homepage stats: total and filtered counts for PDF/Source/Summary plus per-template totals.
- Stats page includes keyword frequency charts.
- Source view renders Markdown and supports embedded HTML tables plus
data:image/...;base64<img>tags (images are constrained to the content width). - PDF Viewer is served locally (PDF.js viewer assets) to avoid cross-origin issues with local PDFs.
- PDF-only entries are surfaced for unmatched PDFs under
--pdf-root(metadata title if available, otherwise filename), with badges and detail warnings. - PDF-only entries are excluded from stats counts.
- Merge behavior for multi-input serve: title similarity (>= 0.95), preferring
bibtex.fields.titleand falling back topaper_title. - Cache merged inputs with
--cache-dir; bypass with--no-cache.
Examples:
# Render Markdown from JSON
deepresearch-flow paper db render-md --input paper_infos.json
# Render with a built-in template and language fallback
deepresearch-flow paper db render-md \
--input paper_infos.json \
--template-name deep_read \
--language zh
# Generate tags
deepresearch-flow paper db generate-tags \
--input paper_infos.json \
--output paper_infos_with_tags.json \
--model openai/gpt-4o-mini
# Filter papers
deepresearch-flow paper db filter \
--input paper_infos.json \
--output filtered.json \
--tags hardware_acceleration,fpga
# Statistics (rich tables)
deepresearch-flow paper db statistics \
--input paper_infos.json \
--top-n 20
# Statistics also include keyword frequency (normalized to lowercase)
# Serve a local read-only web UI (loads charts/libs via CDN)
deepresearch-flow paper db serve \
--input paper_infos_simple.json \
--input paper_infos_deep_read.json \
--cache-dir .cache/db-serve \
--host 127.0.0.1 \
--port 8000
# Serve with optional BibTeX enrichment and source roots
deepresearch-flow paper db serve \
--input paper_infos_simple.json \
--input paper_infos_deep_read.json \
--bibtex ./refs/library.bib \
--md-root ./docs \
--md-root ./more_docs \
--pdf-root ./pdfs \
--cache-dir .cache/db-serve \
--host 127.0.0.1 \
--port 8000
Web search syntax (Scholar-style):
- Default is AND:
fpga kNN - Quoted phrases:
title:"nearest neighbor" - OR:
fpga OR asic - Negation:
-surveyor-tag:survey - Fields:
title:,author:,tag:,venue:,year:,month:(content tags only) - Year range:
year:2020..2024
Other database helpers:
append-bibtexsort-paperssplit-by-tagsplit-databasestatisticsmerge
recognize md — embed or unpack markdown images
recognize md embed replaces local image links in markdown with data:image/...;base64, URLs.
recognize md unpack extracts embedded images into images/ and updates markdown links.
Key options:
--input(repeatable): file or directory input.--recursive: recurse into directories.--output: output directory (flattened outputs).--enable-http: allow embedding HTTP(S) images (embed only).--workers: concurrent workers (default: 4).--dry-run: report planned outputs without writing files.--verbose: enable detailed logs for image resolution/HTTP fetches.
Notes:
- Progress bars report completion; a rich summary table lists counts, image totals, duration, and output locations.
- Summary paths are shown relative to the current working directory when possible.
- If the output directory is not empty, the command logs a warning before writing files.
Examples:
# Embed local images (flatten outputs)
deepresearch-flow recognize md embed \
--input ./docs \
--recursive \
--output ./out_md
# Embed HTTP images (with browser User-Agent)
deepresearch-flow recognize md embed \
--input ./docs \
--enable-http \
--output ./out_md
# Unpack embedded images into output/images/
deepresearch-flow recognize md unpack \
--input ./docs \
--recursive \
--output ./out_md
recognize organize — flatten OCR outputs
Organize OCR outputs (layout: mineru) into flat markdown files, with optional image embedding.
Key options:
--layout: OCR layout type (currentlymineru).--input(repeatable): directories containingfull.md+images/.--recursive: search for layout folders (required when inputs contain nested result directories).--output-simple: copy markdown + images to output (sharedimages/).--output-base64: embed images into markdown.--workers: concurrent workers (default: 4).--dry-run: report planned outputs without writing files.--verbose: enable detailed logs for layout discovery and file copying.
Notes:
- Use
--recursivewhen the input directory contains nested layout folders (otherwise no layouts are discovered). - If output directories are not empty, the command logs a warning before writing files.
- A summary table lists counts, image totals, duration, and output locations after completion.
- Summary paths are shown relative to the current working directory when possible.
Examples:
# Copy markdown + images into a flat output directory
deepresearch-flow recognize organize \
--layout mineru \
--input ./ocr_results \
--recursive \
--output-simple ./out_simple
# Embed images into markdown
deepresearch-flow recognize organize \
--layout mineru \
--input ./ocr_results \
--output-base64 ./out_base64
Data formats (examples)
Aggregated extraction output is a JSON list:
[
{
"paper_title": "Example Paper",
"paper_authors": ["Author A", "Author B"],
"publication_date": "2024-01-01",
"publication_venue": "ExampleConf",
"source_path": "/abs/path/to/doc.md"
}
]
db serve expects each input to be an object with a template_tag and a papers list:
{
"template_tag": "simple",
"papers": [
{
"paper_title": "Example Paper",
"paper_authors": ["Author A"],
"publication_date": "2024-01-01",
"publication_venue": "ExampleConf"
}
]
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file deepresearch_flow-0.2.1.tar.gz.
File metadata
- Download URL: deepresearch_flow-0.2.1.tar.gz
- Upload date:
- Size: 5.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fdba6c1da5f09386a1406f4ec26285f05df15841b0d7067fe84f5bb3d34d9cbe
|
|
| MD5 |
ee8a395e5b77c4c12608751ced09c613
|
|
| BLAKE2b-256 |
0a7c306f6409fad3467adb79802587f3af56c43e8d3b79bff64f0653fe1caf5e
|
Provenance
The following attestation bundles were made for deepresearch_flow-0.2.1.tar.gz:
Publisher:
push-to-pypi.yml on nerdneilsfield/ai-deepresearch-flow
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
deepresearch_flow-0.2.1.tar.gz -
Subject digest:
fdba6c1da5f09386a1406f4ec26285f05df15841b0d7067fe84f5bb3d34d9cbe - Sigstore transparency entry: 814172862
- Sigstore integration time:
-
Permalink:
nerdneilsfield/ai-deepresearch-flow@45662ea9c0e8b80c6cdf9ebc3c2af3ca426a1a95 -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/nerdneilsfield
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
push-to-pypi.yml@45662ea9c0e8b80c6cdf9ebc3c2af3ca426a1a95 -
Trigger Event:
push
-
Statement type:
File details
Details for the file deepresearch_flow-0.2.1-py3-none-any.whl.
File metadata
- Download URL: deepresearch_flow-0.2.1-py3-none-any.whl
- Upload date:
- Size: 5.9 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
13c0eda731731ae0560abd6cb31915939d93bdb2ce16aa201397106d31a9459f
|
|
| MD5 |
5703932d0c83eb57c0fcf0cf8af44909
|
|
| BLAKE2b-256 |
fca97ecb16acc853529b9703d5a259c6d4450ff852776dec233e733de0712d9e
|
Provenance
The following attestation bundles were made for deepresearch_flow-0.2.1-py3-none-any.whl:
Publisher:
push-to-pypi.yml on nerdneilsfield/ai-deepresearch-flow
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
deepresearch_flow-0.2.1-py3-none-any.whl -
Subject digest:
13c0eda731731ae0560abd6cb31915939d93bdb2ce16aa201397106d31a9459f - Sigstore transparency entry: 814172865
- Sigstore integration time:
-
Permalink:
nerdneilsfield/ai-deepresearch-flow@45662ea9c0e8b80c6cdf9ebc3c2af3ca426a1a95 -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/nerdneilsfield
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
push-to-pypi.yml@45662ea9c0e8b80c6cdf9ebc3c2af3ca426a1a95 -
Trigger Event:
push
-
Statement type: