Skip to main content

A multi-source pipeline for searching, screening, downloading, converting, and summarizing academic papers from arXiv, Google Scholar, and more.

Project description

research-pipeline

CI codecov PyPI version Python 3.12+ License: MIT mypy ruff Docs

research-pipeline is a deterministic Python 3.12+ workflow for finding, screening, downloading, converting, and synthesizing academic papers. It is useful when you need an auditable literature review, not just a one-off paper search.

It ships as both a Typer CLI and an MCP server for agent-driven research.

Contents

What It Does

  • Searches arXiv, Google Scholar, Semantic Scholar, OpenAlex, DBLP, and HuggingFace daily papers, with cross-source deduplication.
  • Screens candidates with BM25 heuristics, optional SPECTER2 semantic reranking, optional LLM judging, diversity-aware selection, and feedback-adjusted weights.
  • Downloads PDFs politely with rate limits, retry, caching, and manifest tracking.
  • Converts PDFs to Markdown through local or cloud backends: Docling, Marker, PyMuPDF4LLM, MinerU, Mathpix, Datalab, LlamaParse, Mistral OCR, or OpenAI Vision.
  • Supports two-tier conversion: fast rough conversion for all papers and high-quality fine conversion for selected papers.
  • Extracts structured chunks, bibliography data, citation contexts, and retrieval indexes from converted papers.
  • Produces schema-first per-paper extraction records, design-neutral cross-paper synthesis, confidence scoring, evidence aggregation, BibTeX exports, templated Markdown reports, and self-contained HTML reports.
  • Adds research quality layers: citation expansion, quality scoring, claim decomposition, knowledge graph ingestion, report validation, multi-run comparison, coherence checks, memory consolidation, blinding audits, Pass@k / Pass[k] metrics, case-based strategy reuse, KG quality checks, adaptive stopping, and 4-layer confidence calibration.

Installation

# Base package
pip install research-pipeline

# Recommended local converter
pip install 'research-pipeline[docling]'

# Other local converters
pip install 'research-pipeline[marker]'       # high accuracy, GPL-3.0
pip install 'research-pipeline[pymupdf4llm]'  # fast CPU conversion, AGPL
pip install 'research-pipeline[mineru]'       # scientific PDF parser

# Search and reranking extras
pip install 'research-pipeline[scholar]'      # Google Scholar via scholarly
pip install 'research-pipeline[serpapi]'      # Google Scholar via SerpAPI
pip install 'research-pipeline[reranker]'     # sentence-transformers reranker

# Cloud conversion extras
pip install 'research-pipeline[datalab]'
pip install 'research-pipeline[llamaparse]'
pip install 'research-pipeline[mistral-ocr]'
pip install 'research-pipeline[openai-vision]'

# Development checkout
uv sync --extra dev --extra docling --extra scholar --extra reranker

Quick Start

# Fast abstract-only pass
research-pipeline run --profile quick "transformer architectures for time series"

# Full evidence-backed pipeline
research-pipeline run "local memory systems for AI agents"

# Deep profile with quality, expansion, claim analysis, and TER gap filling
research-pipeline run --profile deep "comprehensive survey of AI memory systems"

# Search every configured source family
research-pipeline run --source all "long-context retrieval augmented generation"

Run stages independently when you want control over review points:

research-pipeline plan "multimodal RAG for long-document QA"
research-pipeline search --run-id <RUN_ID> --source all
research-pipeline screen --run-id <RUN_ID> --diversity
research-pipeline quality --run-id <RUN_ID>
research-pipeline download --run-id <RUN_ID>
research-pipeline convert-rough --run-id <RUN_ID>
research-pipeline convert-fine --run-id <RUN_ID> --paper-ids "2401.12345"
research-pipeline extract --run-id <RUN_ID>
research-pipeline summarize --run-id <RUN_ID>
research-pipeline report --run-id <RUN_ID> --template structured_synthesis
research-pipeline validate --run-id <RUN_ID>

Pipeline

flowchart TD
    A["Plan queries"] --> B["Search sources"]
    B --> C["Screen candidates"]
    C --> D["Quality and expansion"]
    D --> E["Download PDFs"]
    E --> F["Convert to Markdown"]
    F --> G["Extract evidence"]
    G --> H["Summarize papers"]
    H --> I["Report, validate, export"]

Profiles:

Profile Stages Use Case
quick plan, search, screen, summarize Fast abstract-only scan
standard plan through summarize Default full pipeline
deep standard plus quality, expand, claim analysis, TER loop Comprehensive literature review
auto selected by query complexity Mixed workloads

Search sources:

Source Notes
arxiv Polite arXiv API client with cache and rate limits
scholar Google Scholar through scholarly or SerpAPI
semantic_scholar Broad metadata, citations, and abstracts
openalex Open bibliographic metadata
dblp Computer science bibliography
huggingface Recent HuggingFace daily papers
all arXiv, Scholar, Semantic Scholar, OpenAlex, DBLP, HuggingFace

CLI Commands

Group Commands
Core pipeline plan, search, screen, download, convert, extract, summarize, run, inspect
Search expansion and organization quality, expand, cluster, enrich, watch
Conversion and export convert-file, convert-rough, convert-fine, export-bibtex, export-html, report
Analysis and validation analyze, analyze-claims, score-claims, confidence-layers, aggregate, validate, compare, evaluate
Feedback and memory feedback, index, coherence, consolidate, memory-stats, memory-episodes, memory-search
Knowledge graph kg-ingest, kg-stats, kg-query, kg-quality, cite-context
Reliability checks blinding-audit, dual-metrics, adaptive-stopping, cbr-lookup, cbr-retain
Setup setup installs the bundled skill and paper-analysis agents

Useful examples:

# Citation graph expansion
research-pipeline expand --run-id <RUN_ID> --paper-ids "2401.12345" \
  --direction both --bfs-depth 2 --bfs-query "memory,agents"

# Evidence-only aggregation
research-pipeline aggregate --run-id <RUN_ID> --min-pointers 1

# Multi-run comparison and coherence
research-pipeline compare --run-a <RUN_A> --run-b <RUN_B>
research-pipeline coherence <RUN_A> <RUN_B> <RUN_C>

# Evaluation metrics (Deep Research Report gap closures)
# Unified Horizon Metric (A3-5): single scalar combining quality, difficulty,
# horizon length, stability, and Pass[k] reliability.
research-pipeline horizon --score 0.8 --achieved 40 --target 50 \
  --difficulty 0.6 --entropy-trend -0.1 --reliability 0.9

# Recall / Reasoning / Presentation diagnostic (Theme 16): localize the
# bottleneck axis of a synthesis report.
research-pipeline rrp --report report.md --shortlist shortlist.json

# Knowledge graph
research-pipeline kg-ingest --run-id <RUN_ID>
research-pipeline kg-stats
research-pipeline kg-query 2401.12345

Readable Reports

The pipeline can produce machine-readable synthesis JSON and human-readable Markdown or HTML reports. For human-facing reports, prefer:

  • clear headings and a contents section with internal links;
  • Mermaid diagrams for process charts, usually vertical flowchart TD charts;
  • LaTeX for formulas, using $...$ inline and $$...$$ for display equations;
  • tables for comparisons and coverage matrices;
  • paper links that jump to references or evidence-map entries;
  • recommendations linked back to findings, gaps, and evidence.
# Render Markdown from structured synthesis JSON
research-pipeline report --run-id <RUN_ID> --template structured_synthesis

# Export self-contained HTML
research-pipeline export-html --run-id <RUN_ID>

# Validate report completeness and readability signals
research-pipeline validate --run-id <RUN_ID>

MCP Server

Run the MCP server with:

research-pipeline mcp serve
# or, from a development checkout
uv run research-pipeline mcp serve

Current MCP surface:

  • 42 tools covering pipeline stages, conversion, quality, expansion, validation, reporting, memory, KG, reliability, and the server-driven research_workflow.
  • 15 resources for run manifests, plans, candidates, shortlists, PDFs, Markdown, summaries, synthesis, quality scores, config, index, workflow state, telemetry, and budget.
  • 6 prompts for topic planning, workflow orchestration, paper analysis, comparison, search refinement, and quality assessment.

The research_workflow tool adds harness engineering: telemetry, bounded context, governance gates, structural verification, doom-loop monitoring, and crash recovery.

AI Skill And Agents

Install the bundled skill for Claude Code / GitHub Copilot and Codex, plus Claude Code sub-agent definitions:

research-pipeline setup              # skills + agents + MCP config snippet
research-pipeline setup --symlink    # symlink for development
research-pipeline setup --force      # overwrite existing files
research-pipeline setup --skip-agents
research-pipeline setup --skip-skill
research-pipeline setup --skip-mcp

Installed files:

  • Claude/GitHub Copilot skill: ~/.claude/skills/research-pipeline/
  • Codex skill: ~/.codex/skills/research-pipeline/
  • Agents: ~/.claude/agents/paper-screener.md, ~/.claude/agents/paper-analyzer.md, ~/.claude/agents/paper-synthesizer.md
  • MCP config snippet: ~/.config/research-pipeline/mcp.json

The skill follows Anthropic's Skill-Building Guide: it declares explicit trigger phrases and negative triggers, a license/compatibility frontmatter, concrete user-prompt → action Examples, and progressive disclosure into references/. Behaviorally, every run:

  • Resumes on top of any prior same-topic report in the working directory — the prior file is snapshot-renamed, prior paper IDs seed the new run, and the new report fully replaces the old one.
  • Iterates up to 4 gap-closure rounds — each round extracts the report's academic and engineering gaps, fills them (new pipeline iteration or implementation knowledge), and regenerates the report from scratch. Stops early when the gap list empties, a search returns no new papers, or the user marks gaps out-of-scope.
  • Enforces human-report formatting: ## Contents, ## Round History, Mermaid for every chart, LaTeX for every formula, and per-section evidence citations validated by research-pipeline validate.

Configuration

Start from the example config:

cp config.example.toml config.toml

High-impact settings:

profile = "standard"          # quick, standard, deep, auto
workspace = "runs"

[sources]
enabled = ["arxiv"]           # or include scholar, semantic_scholar, openalex, dblp, huggingface
scholar_backend = "scholarly" # or "serpapi"

[screen]
diversity = false
use_semantic_reranking = false

[conversion]
backend = "docling"
fallback_backends = []

[llm]
enabled = false               # enables LLM screening/summarization when configured
provider = "ollama"           # ollama or openai-compatible

[gates]
enabled = false
auto_approve = true

Environment overrides:

Variable Purpose
RESEARCH_PIPELINE_CONFIG Config file path
RESEARCH_PIPELINE_CACHE_DIR Override cache directory
RESEARCH_PIPELINE_WORKSPACE Override workspace directory
RESEARCH_PIPELINE_DISABLE_LLM Force LLM features off
RESEARCH_PIPELINE_LLM_PROFILE Select LLM profile

Artifacts

Each run writes auditable outputs under runs/<run_id>/:

runs/<run_id>/
├── plan/query_plan.json
├── search/candidates.jsonl
├── screen/shortlist.json
├── download/pdf/*.pdf
├── convert/markdown/*.md
├── convert_rough/markdown/*.md
├── convert_fine/markdown/*.md
├── extract/*.extract.json
├── extract/*.bibliography.json
├── summarize/extractions/*.extraction.json
├── summarize/extractions/*.extraction.md
├── summarize/extractions/extraction_quality.json
├── summarize/*.summary.json
├── summarize/synthesis_report.json
├── summarize/synthesis_report.md
├── summarize/synthesis_traceability.json
├── summarize/synthesis_quality.json
├── summarize/synthesis.json
├── summarize/synthesis_confidence.json
├── quality/quality_scores.jsonl
├── expand/expanded_candidates.jsonl
├── analysis/
├── comparison/
└── logs/

The runs/ and workspace/ directories are generated outputs and are not tracked by git.

Development

uv sync --extra dev --extra docling --extra scholar --extra reranker
uv run pytest tests/unit/ -xvs
uv run ruff format .
uv run ruff check . --fix
uv run mypy src/
uv run pre-commit run --all-files

See docs/architecture.md for architecture details and docs/user-guide.md for the full user guide.

License

MIT

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

research_pipeline-0.17.12.tar.gz (557.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

research_pipeline-0.17.12-py3-none-any.whl (732.3 kB view details)

Uploaded Python 3

File details

Details for the file research_pipeline-0.17.12.tar.gz.

File metadata

  • Download URL: research_pipeline-0.17.12.tar.gz
  • Upload date:
  • Size: 557.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for research_pipeline-0.17.12.tar.gz
Algorithm Hash digest
SHA256 8558d65be172e0e972ff6ff85ad430ff30dc454c4396d6377236b84e2b43b005
MD5 64bf17b4ae795b742bb25dec31aceef3
BLAKE2b-256 5643aaab028ddbdadfa1419c8f1fae1bb572189917ec723c585e2febe583a4b3

See more details on using hashes here.

Provenance

The following attestation bundles were made for research_pipeline-0.17.12.tar.gz:

Publisher: publish.yml on grammy-jiang/research-pipeline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file research_pipeline-0.17.12-py3-none-any.whl.

File metadata

File hashes

Hashes for research_pipeline-0.17.12-py3-none-any.whl
Algorithm Hash digest
SHA256 7ab23355b01e9871e9a3a81f81fbc2aa9eb931c1a62cb07bb3d5a0d07b9a8600
MD5 2da2eead13371faf7d45eccb93321c12
BLAKE2b-256 e5b8d410d129daf12809c85ac03a7e7a720cf579b0e48eead00f5e7a8c9e4861

See more details on using hashes here.

Provenance

The following attestation bundles were made for research_pipeline-0.17.12-py3-none-any.whl:

Publisher: publish.yml on grammy-jiang/research-pipeline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page