A multi-source pipeline for searching, screening, downloading, converting, and summarizing academic papers from arXiv, Google Scholar, and more.
Project description
research-pipeline
research-pipeline is a deterministic Python 3.12+ workflow for finding,
screening, downloading, converting, and synthesizing academic papers. It is
useful when you need an auditable literature review, not just a one-off paper
search.
It ships as both a Typer CLI and an MCP server for agent-driven research.
Contents
- What It Does
- Installation
- Quick Start
- Pipeline
- CLI Commands
- Readable Reports
- MCP Server
- AI Skill And Agents
- Configuration
- Artifacts
- Development
What It Does
- Searches arXiv, Google Scholar, Semantic Scholar, OpenAlex, DBLP, and HuggingFace daily papers, with cross-source deduplication.
- Screens candidates with BM25 heuristics, optional SPECTER2 semantic reranking, optional LLM judging, diversity-aware selection, and feedback-adjusted weights.
- Downloads PDFs politely with rate limits, retry, caching, and manifest tracking.
- Converts PDFs to Markdown through local or cloud backends: Docling, Marker, PyMuPDF4LLM, MinerU, Mathpix, Datalab, LlamaParse, Mistral OCR, or OpenAI Vision.
- Supports two-tier conversion: fast rough conversion for all papers and high-quality fine conversion for selected papers.
- Extracts structured chunks, bibliography data, citation contexts, and retrieval indexes from converted papers.
- Produces schema-first per-paper extraction records, design-neutral cross-paper synthesis, confidence scoring, evidence aggregation, BibTeX exports, templated Markdown reports, and self-contained HTML reports.
- Adds research quality layers: citation expansion, quality scoring, claim decomposition, knowledge graph ingestion, report validation, multi-run comparison, coherence checks, memory consolidation, blinding audits, Pass@k / Pass[k] metrics, case-based strategy reuse, KG quality checks, adaptive stopping, and 4-layer confidence calibration.
Installation
# Base package
pip install research-pipeline
# Recommended local converter
pip install 'research-pipeline[docling]'
# Other local converters
pip install 'research-pipeline[marker]' # high accuracy, GPL-3.0
pip install 'research-pipeline[pymupdf4llm]' # fast CPU conversion, AGPL
pip install 'research-pipeline[mineru]' # scientific PDF parser
# Search and reranking extras
pip install 'research-pipeline[scholar]' # Google Scholar via scholarly
pip install 'research-pipeline[serpapi]' # Google Scholar via SerpAPI
pip install 'research-pipeline[reranker]' # sentence-transformers reranker
# Cloud conversion extras
pip install 'research-pipeline[datalab]'
pip install 'research-pipeline[llamaparse]'
pip install 'research-pipeline[mistral-ocr]'
pip install 'research-pipeline[openai-vision]'
# Development checkout
uv sync --extra dev --extra docling --extra scholar --extra reranker
Quick Start
# Fast abstract-only pass
research-pipeline run --profile quick "transformer architectures for time series"
# Full evidence-backed pipeline
research-pipeline run "local memory systems for AI agents"
# Deep profile with quality, expansion, claim analysis, and TER gap filling
research-pipeline run --profile deep "comprehensive survey of AI memory systems"
# Search every configured source family
research-pipeline run --source all "long-context retrieval augmented generation"
Run stages independently when you want control over review points:
research-pipeline plan "multimodal RAG for long-document QA"
research-pipeline search --run-id <RUN_ID> --source all
research-pipeline screen --run-id <RUN_ID> --diversity
research-pipeline quality --run-id <RUN_ID>
research-pipeline download --run-id <RUN_ID>
research-pipeline convert-rough --run-id <RUN_ID>
research-pipeline convert-fine --run-id <RUN_ID> --paper-ids "2401.12345"
research-pipeline extract --run-id <RUN_ID>
research-pipeline summarize --run-id <RUN_ID>
research-pipeline report --run-id <RUN_ID> --template structured_synthesis
research-pipeline validate --run-id <RUN_ID>
Pipeline
flowchart TD
A["Plan queries"] --> B["Search sources"]
B --> C["Screen candidates"]
C --> D["Quality and expansion"]
D --> E["Download PDFs"]
E --> F["Convert to Markdown"]
F --> G["Extract evidence"]
G --> H["Summarize papers"]
H --> I["Report, validate, export"]
Profiles:
| Profile | Stages | Use Case |
|---|---|---|
quick |
plan, search, screen, summarize | Fast abstract-only scan |
standard |
plan through summarize | Default full pipeline |
deep |
standard plus quality, expand, claim analysis, TER loop | Comprehensive literature review |
auto |
selected by query complexity | Mixed workloads |
Search sources:
| Source | Notes |
|---|---|
arxiv |
Polite arXiv API client with cache and rate limits |
scholar |
Google Scholar through scholarly or SerpAPI |
semantic_scholar |
Broad metadata, citations, and abstracts |
openalex |
Open bibliographic metadata |
dblp |
Computer science bibliography |
huggingface |
Recent HuggingFace daily papers |
all |
arXiv, Scholar, Semantic Scholar, OpenAlex, DBLP, HuggingFace |
CLI Commands
| Group | Commands |
|---|---|
| Core pipeline | plan, search, screen, download, convert, extract, summarize, run, inspect |
| Search expansion and organization | quality, expand, cluster, enrich, watch |
| Conversion and export | convert-file, convert-rough, convert-fine, export-bibtex, export-html, report |
| Analysis and validation | analyze, analyze-claims, score-claims, confidence-layers, aggregate, validate, compare, evaluate |
| Feedback and memory | feedback, index, coherence, consolidate, memory-stats, memory-episodes, memory-search |
| Knowledge graph | kg-ingest, kg-stats, kg-query, kg-quality, cite-context |
| Reliability checks | blinding-audit, dual-metrics, adaptive-stopping, cbr-lookup, cbr-retain |
| Setup | setup installs the bundled skill and paper-analysis agents |
Useful examples:
# Citation graph expansion
research-pipeline expand --run-id <RUN_ID> --paper-ids "2401.12345" \
--direction both --bfs-depth 2 --bfs-query "memory,agents"
# Evidence-only aggregation
research-pipeline aggregate --run-id <RUN_ID> --min-pointers 1
# Multi-run comparison and coherence
research-pipeline compare --run-a <RUN_A> --run-b <RUN_B>
research-pipeline coherence <RUN_A> <RUN_B> <RUN_C>
# Evaluation metrics (Deep Research Report gap closures)
# Unified Horizon Metric (A3-5): single scalar combining quality, difficulty,
# horizon length, stability, and Pass[k] reliability.
research-pipeline horizon --score 0.8 --achieved 40 --target 50 \
--difficulty 0.6 --entropy-trend -0.1 --reliability 0.9
# Recall / Reasoning / Presentation diagnostic (Theme 16): localize the
# bottleneck axis of a synthesis report.
research-pipeline rrp --report report.md --shortlist shortlist.json
# Knowledge graph
research-pipeline kg-ingest --run-id <RUN_ID>
research-pipeline kg-stats
research-pipeline kg-query 2401.12345
Readable Reports
The pipeline can produce machine-readable synthesis JSON and human-readable Markdown or HTML reports. For human-facing reports, prefer:
- clear headings and a contents section with internal links;
- Mermaid diagrams for process charts, usually vertical
flowchart TDcharts; - LaTeX for formulas, using
$...$inline and$$...$$for display equations; - tables for comparisons and coverage matrices;
- paper links that jump to references or evidence-map entries;
- recommendations linked back to findings, gaps, and evidence.
# Render Markdown from structured synthesis JSON
research-pipeline report --run-id <RUN_ID> --template structured_synthesis
# Export self-contained HTML
research-pipeline export-html --run-id <RUN_ID>
# Validate report completeness and readability signals
research-pipeline validate --run-id <RUN_ID>
MCP Server
Run the MCP server with:
research-pipeline mcp serve
# or, from a development checkout
uv run research-pipeline mcp serve
Current MCP surface:
- 42 tools covering pipeline stages, conversion, quality, expansion,
validation, reporting, memory, KG, reliability, and the server-driven
research_workflow. - 15 resources for run manifests, plans, candidates, shortlists, PDFs, Markdown, summaries, synthesis, quality scores, config, index, workflow state, telemetry, and budget.
- 6 prompts for topic planning, workflow orchestration, paper analysis, comparison, search refinement, and quality assessment.
The research_workflow tool adds harness engineering: telemetry, bounded
context, governance gates, structural verification, doom-loop monitoring, and
crash recovery.
AI Skill And Agents
Install the bundled skill for Claude Code / GitHub Copilot and Codex, plus Claude Code sub-agent definitions:
research-pipeline setup # skills + agents + MCP config snippet
research-pipeline setup --symlink # symlink for development
research-pipeline setup --force # overwrite existing files
research-pipeline setup --skip-agents
research-pipeline setup --skip-skill
research-pipeline setup --skip-mcp
Installed files:
- Claude/GitHub Copilot skill:
~/.claude/skills/research-pipeline/ - Codex skill:
~/.codex/skills/research-pipeline/ - Agents:
~/.claude/agents/paper-screener.md,~/.claude/agents/paper-analyzer.md,~/.claude/agents/paper-synthesizer.md - MCP config snippet:
~/.config/research-pipeline/mcp.json
The skill follows Anthropic's Skill-Building Guide: it declares explicit
trigger phrases and negative triggers, a license/compatibility
frontmatter, concrete user-prompt → action Examples, and progressive
disclosure into references/. Behaviorally, every run:
- Resumes on top of any prior same-topic report in the working directory — the prior file is snapshot-renamed, prior paper IDs seed the new run, and the new report fully replaces the old one.
- Iterates up to 4 gap-closure rounds — each round extracts the report's academic and engineering gaps, fills them (new pipeline iteration or implementation knowledge), and regenerates the report from scratch. Stops early when the gap list empties, a search returns no new papers, or the user marks gaps out-of-scope.
- Enforces human-report formatting:
## Contents,## Round History, Mermaid for every chart, LaTeX for every formula, and per-section evidence citations validated byresearch-pipeline validate.
Configuration
Start from the example config:
cp config.example.toml config.toml
High-impact settings:
profile = "standard" # quick, standard, deep, auto
workspace = "runs"
[sources]
enabled = ["arxiv"] # or include scholar, semantic_scholar, openalex, dblp, huggingface
scholar_backend = "scholarly" # or "serpapi"
[screen]
diversity = false
use_semantic_reranking = false
[conversion]
backend = "docling"
fallback_backends = []
[llm]
enabled = false # enables LLM screening/summarization when configured
provider = "ollama" # ollama or openai-compatible
[gates]
enabled = false
auto_approve = true
Environment overrides:
| Variable | Purpose |
|---|---|
RESEARCH_PIPELINE_CONFIG |
Config file path |
RESEARCH_PIPELINE_CACHE_DIR |
Override cache directory |
RESEARCH_PIPELINE_WORKSPACE |
Override workspace directory |
RESEARCH_PIPELINE_DISABLE_LLM |
Force LLM features off |
RESEARCH_PIPELINE_LLM_PROFILE |
Select LLM profile |
Artifacts
Each run writes auditable outputs under runs/<run_id>/:
runs/<run_id>/
├── plan/query_plan.json
├── search/candidates.jsonl
├── screen/shortlist.json
├── download/pdf/*.pdf
├── convert/markdown/*.md
├── convert_rough/markdown/*.md
├── convert_fine/markdown/*.md
├── extract/*.extract.json
├── extract/*.bibliography.json
├── summarize/extractions/*.extraction.json
├── summarize/extractions/*.extraction.md
├── summarize/extractions/extraction_quality.json
├── summarize/*.summary.json
├── summarize/synthesis_report.json
├── summarize/synthesis_report.md
├── summarize/synthesis_traceability.json
├── summarize/synthesis_quality.json
├── summarize/synthesis.json
├── summarize/synthesis_confidence.json
├── quality/quality_scores.jsonl
├── expand/expanded_candidates.jsonl
├── analysis/
├── comparison/
└── logs/
The runs/ and workspace/ directories are generated outputs and are not
tracked by git.
Development
uv sync --extra dev --extra docling --extra scholar --extra reranker
uv run pytest tests/unit/ -xvs
uv run ruff format .
uv run ruff check . --fix
uv run mypy src/
uv run pre-commit run --all-files
See docs/architecture.md for architecture details and docs/user-guide.md for the full user guide.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file research_pipeline-0.17.17.tar.gz.
File metadata
- Download URL: research_pipeline-0.17.17.tar.gz
- Upload date:
- Size: 599.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
73be8dd2b15ffea493f3b4cf40b8e72571fd142a276eef3b975561b798091f08
|
|
| MD5 |
804a3b447eaa43417ff41f48a9ae1b5a
|
|
| BLAKE2b-256 |
bb7d2aeec55295e6999cd6466161e050e492a0936fffa85a5a23d445a9ed131a
|
Provenance
The following attestation bundles were made for research_pipeline-0.17.17.tar.gz:
Publisher:
publish.yml on grammy-jiang/research-pipeline
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
research_pipeline-0.17.17.tar.gz -
Subject digest:
73be8dd2b15ffea493f3b4cf40b8e72571fd142a276eef3b975561b798091f08 - Sigstore transparency entry: 1526610832
- Sigstore integration time:
-
Permalink:
grammy-jiang/research-pipeline@e7937e1f10c22eddbffbd8e01c5bdf5f0f1404e0 -
Branch / Tag:
refs/tags/v0.17.17 - Owner: https://github.com/grammy-jiang
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e7937e1f10c22eddbffbd8e01c5bdf5f0f1404e0 -
Trigger Event:
release
-
Statement type:
File details
Details for the file research_pipeline-0.17.17-py3-none-any.whl.
File metadata
- Download URL: research_pipeline-0.17.17-py3-none-any.whl
- Upload date:
- Size: 793.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2b8c9a89c26f3fa321472575f789483a7de2e79041e43f79cdc36f65912e6ffb
|
|
| MD5 |
7af19f58ff4d74fd3667cc2626b13d9b
|
|
| BLAKE2b-256 |
c177002fe39f71fdbe12155c04e0a788d75b9caa981c842e3550cea5af2c3e91
|
Provenance
The following attestation bundles were made for research_pipeline-0.17.17-py3-none-any.whl:
Publisher:
publish.yml on grammy-jiang/research-pipeline
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
research_pipeline-0.17.17-py3-none-any.whl -
Subject digest:
2b8c9a89c26f3fa321472575f789483a7de2e79041e43f79cdc36f65912e6ffb - Sigstore transparency entry: 1526610929
- Sigstore integration time:
-
Permalink:
grammy-jiang/research-pipeline@e7937e1f10c22eddbffbd8e01c5bdf5f0f1404e0 -
Branch / Tag:
refs/tags/v0.17.17 - Owner: https://github.com/grammy-jiang
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e7937e1f10c22eddbffbd8e01c5bdf5f0f1404e0 -
Trigger Event:
release
-
Statement type: